Forum

Author Topic: Metashape errors on NVIDIA A100X with CUDA 11.6  (Read 2495 times)

djyoung

  • Newbie
  • *
  • Posts: 19
    • View Profile
Metashape errors on NVIDIA A100X with CUDA 11.6
« on: November 10, 2022, 10:57:26 PM »
Hello -- I am attempting to run Metashape on the NSF Jetstream2 GPU virtual machines (https://jetstream-cloud.org/), which use A100X GPUs (partitioned into multiple vGPUs) and which currently run CUDA 11.6. The OS is Ubuntu 22.04 and I am running Metashape 1.8.4 via the Python API.

Almost every Metashape run ends in an error or hang, with errors of different types, at different stages, happening even for identical runs. Sometimes during depth map generation I get the error Exception: Kernel failed: unspecified launch failure (719) at line 329. More commonly though, Metashape will simply hang (during depth map generation or filtering stages), with one CPU core maxed out at 100% and the remainder of CPUs and GPU at 0%. The memory (of the GPU or main RAM) is never anywhere close to exhausted.

I have attempted to disable CUDA, as suggested in another thread for users with similar issues: https://www.agisoft.com/forum/index.php?topic=11771.15
But it did not resolve the errors, it just changed them from CUDA errors to OpenCL errors.

I also attempted to change the tweak "main/depth_max_gpu_multiplier" to 1 (via the python API), but it did not have any effect on processing. It appears that tweak is not recognized by the python API.

Has Agisoft identified any additional solutions to these sorts of errors? Is there any more information I can provide to help identify solutions? Thank you in advance!

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14857
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #1 on: November 13, 2022, 03:24:46 PM »
Hello djyoung,

Can you please specify the NVIDIA driver version installed and also describe, how it is installed on the virtual machine that you are using?
Best regards,
Alexey Pasumansky,
Agisoft LLC

djyoung

  • Newbie
  • *
  • Posts: 19
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #2 on: November 14, 2022, 09:46:30 PM »
Hi Alexey -- the driver version is 510.85.02, CUDA version 11.6.

It is installed via NVIDIA vGPU GRID software (current version 14.2). https://docs.nvidia.com/grid/index.html

The GRID drivers are installed from RPM or DEB package provided by NVIDIA as part of the Jetstream2 VM image build process. The Jetstream2 platform has an automated pipeline that creates new images on a weekly basis and as part of this builds in the official vGPU GRID drivers from NVIDIA via the operating system's package manager.

It is a licensed product and not the freely available display driver.

Here's what is reported by nvidia-smi currently:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100X-20C      On   | 00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |   1140MiB / 20480MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     98244      C   python                           1139MiB |
+-----------------------------------------------------------------------------+


Please let me know if you need any additional details. Thank you!

djyoung

  • Newbie
  • *
  • Posts: 19
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #3 on: November 23, 2022, 12:07:35 AM »
Hello Alexey -- I'm just following up to see if you have any thoughts on how we can troubleshoot this. Thank you!

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14857
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #4 on: November 23, 2022, 03:55:34 PM »
Hello djyoung,


Did you get any processing errors on the same system before, or whether Metashape never worked with GPU enabled processing on this system?

Do you get any similar problems on the image matching stage, when GPU is enabled?

Do you have the possibility to install older drivers (but not older than 435.xx) for your system and whether with them you are also getting GPU processing related errors?

In case you can also run any long-time resource demanding CUDA stress tests on the same system, it would be also interesting to know, if they are working properly or are also failing after a while.
Best regards,
Alexey Pasumansky,
Agisoft LLC

djyoung

  • Newbie
  • *
  • Posts: 19
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #5 on: November 27, 2022, 11:53:41 PM »
Hi Alexey,

I only started using this system for Metashape about 2 months ago, and I've always gotten the same errors; it has never worked correctly.

I do not get these errors in the image matching stage, only dense point cloud stage. However, in the image matching stage, the GPU utilization, according to nvidia-smi, is usually only around 3%. During dense cloud (when the problems occur) utilization is usually around 50-75%.

I just ran a GPU stress test for 24 hours (https://github.com/wilicc/gpu-burn) on the same system. GPU utilization was around 99% and I got no errors.

Unfortunately I don't have the option of installing older drivers. I tried and it caused many package conflicts and I was unable to run nvidia-smi, and the system support told me it's not possible because of the way the vGPU system works (the driver needs to be set by the hypervisor, which allocates GPU slices to the VMs).

Is there any form of Metashape logging I can turn on that might help diagnose this problem?

Based on a suggestion in another thread (https://www.agisoft.com/forum/index.php?topic=11771.15), I tried to set the tweak "main/depth_max_gpu_multiplier" to "1", but it did not change anything. During the depth map/dense cloud stage, the terminal still says
[GPU 1] ...
[GPU 2] ...
[GPU 1] ...
[GPU 2] ...


and this is the stage where the problems occur. The way I tried to set the tweak using the python API is:
    Metashape.app.settings.setValue("main/depth_max_gpu_multiplier", 1)
Is this correct? Even when I do this, I still get [GPU 1] ... [GPU 2] ... , but I suspect the tweak is supposed to eliminate the [GPU 2] computations.

Any help would be much appreciated! Thank you.

djyoung

  • Newbie
  • *
  • Posts: 19
    • View Profile
Re: Metashape errors on NVIDIA A100X with CUDA 11.6
« Reply #6 on: December 03, 2022, 06:32:22 PM »
Hi Alexey -- do you have any thoughts on my message above? I'd love your thoughts when you get a chance.

NVIDIA released new vGPU drivers (510.108.03, which still use CUDA 11.6), which the Jetstream2 computing platform just installed. https://docs.nvidia.com/grid/index.html


I just tried a run with them but I had the same issue with the workflow hanging in the middle of the depth map stage. =

Thanks for any help!