Forum

Author Topic: error Kernel failed: an illegal memory access was encountered (700) at line 143  (Read 9585 times)

Iluvathar

  • Newbie
  • *
  • Posts: 29
    • View Profile
hello,
I have this error since 4 days and i'm competely stuck into my work and dead line is approaching.
During the alignment between photos and laser scans.
I have a lots of laser scans (72) to align with 3500 photos. the danse cloud from all the scans are about 2 billions about points 50 millions points each. I have 256 go ram, dual xeon cpu E5 2640 V4, and a Quadro RTX 8000.
Mabe point cloud from laser scans is too big ?
I'm working with windows 10 up to date and Nvidia driver is 537.42 desktop / notebook driver for quadro (up to date too)
i'm working with the latest metashape build 2.0.3 16915

the error :
2023-09-27 10:10:06 Detecting points...
2023-09-27 10:10:06 Found 1 GPUs in 0 sec (CUDA: 0 sec, OpenCL: 0 sec)
2023-09-27 10:10:06 Using device: Quadro RTX 8000, 72 compute units, free memory: 47686/49151 MB, compute capability 7.5
2023-09-27 10:10:06   driver/runtime CUDA: 12020/10010
2023-09-27 10:10:06   max work group size 1024
2023-09-27 10:10:06   max work item sizes [1024, 1024, 64]
2023-09-27 10:10:12 Warning: cudaStreamDestroy failed: an illegal memory access was encountered (700)
2023-09-27 10:10:13 Finished processing in 307.578 sec (exit code 0)
2023-09-27 10:10:13 Error: Kernel failed: an illegal memory access was encountered (700) at line 143


Thanks for the help !

bgodfrey

  • Newbie
  • *
  • Posts: 3
    • View Profile
Wondering if you found the cause of this error and if you came up with a solution.  I'm getting the same error message when trying to align 4000 photos. I'm using Metashape 1.8 on Linux.  GPU: 2x Nvidia RTX A6000; GPU RAM: 48GB; SYS RAM: 128 GB; PROCESSOR: Xeon E5-2620 v4 (16 cores).  I don't get the error when I test on a subset (like 1500) of the photos.

Thank you for any help.

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 15067
    • View Profile
Hello bgodfrey,

In the most cases such error indicates to the GPU driver issues, so I can suggest to make a clean install of the latest NVIDIA driver compatible to your system.

If the issue persists with the new driver installed, please share the log corresponding to the failed operation.
Best regards,
Alexey Pasumansky,
Agisoft LLC

bgodfrey

  • Newbie
  • *
  • Posts: 3
    • View Profile
We upgraded the NVIDIA RTX A6000 driver from version 535.104.05 to version 550.54.15.  We also updated the cuda-12.2 install.

We still get the same error message.

I've attached the metashape log and a log from a bash script capturing gpu memory usage at the time of failure.

Thoughts on how best to proceed to resolve this issue?

ikemarv

  • Newbie
  • *
  • Posts: 7
    • View Profile
Hello we have experienced the same issue

We are using Metashape 2.10 on Linux with 2 L40 GPU, 28 CPU, 120 GB RAM. The driver version is 535.129.03 and CUDA Version 12.2
When restarting the alignment after the error message we are getting a “cudaMemGetInfo time out” error. After restarting Metashape and then restarting the alignment we are getting the Kernel failed error again.

Due to external restrictions we are not able to simply update the driver versions or reinstall it manually

I’ve attached our log files as well. Could you please tell us how to resolve this issue?
Thanks in advance

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 15067
    • View Profile
Hello!

You can try to use OpenCL implementation instead of CUDA by creating main/gpu_enable_cuda tweak via Advanced preferences tab and setting its value to False.

Also you can try to use High accuracy instead of Highest to reduce the memory consumption.
Best regards,
Alexey Pasumansky,
Agisoft LLC

bgodfrey

  • Newbie
  • *
  • Posts: 3
    • View Profile
Hello,

I upgraded to Metashape 2.1.1.

I added the tweak to our node that has 2 Nvidia RTX A6000 GPUs.  I get this new error message:
"Kernal linearColumnFilter_float_21: clWaitForEvents(1, &ev): CL_UNKNOWN_ERROR_CODE_-9999 (-9999) at line 700"

With 2.1.1 I get the same error message documented earlier if I don't add the tweak.

I am using High accuracy.

Additionally, I tested the same data and processes on a node that has 2 GTX 1080Ti GPUs.  With and without the tweak I get the same error.  The program fails after a few hours and returns "651454 Killed "$dirname/$appname" "$@""

Thank you for any additional suggestions to resolve this issue.

jhead

  • Newbie
  • *
  • Posts: 3
    • View Profile
I'm also getting the same error as the original poster on 2.1.1
My system has 2x RTX 6000 ada GPUs (Stable Release R550 U6 (552.55)) and seems to be throwing this error when I try and select highest settings using 150MP imagery.
I'll try the OpenCL implementation and the New Feature branch and see if there are any differences.