Author Topic: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors (Read 2595 times)

simon_r · « **on:** June 18, 2018, 04:55:36 PM »

Dear community and support team,

I am currently running a Photoscan Pro node on a headless machine designed for GPU processing. (specs below)

Whenever I submit a GPU-intensive Photoscan task to the node - e.g. depth reconstruction - Photoscan keeps spawning the following errors: Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
The node continues to try the processing, however it keeps yielding this error message and makes no progress.

How can I solve this problem? Does the server need some specific configuration? Other software does not yield CUDA errors on the same machine, and the GPU utilization works fine there.

The basic specs of the compute server are as follows:

Quote

Machine: Dell PowerEdge T640, 2x Intel Xeon Gold 6148 @ 2.4 GHz, 382GiB system memory
Graphics: 4x NVIDIA GK210GL [Tesla K80] driver: nvidia v: 390.67
System: Distro: CentOS Linux release 7.5.1804, Kernel: 3.10.0-862.3.2.el7.x86_64

This is a (shortened) log output of the node: (full logfile attached)

Quote

Code: [Select]
BuildDepthMaps.buildDepthMaps (39/57): quality = High, depth filtering = Disabled loaded depth map partition in 0.005463 sec Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7 driver version: 9010, runtime version: 5050 max work group size 1024 max work item sizes [1024, 1024, 64] Using CUDA device 'Tesla K80' in concurrent. (2 times) loaded photos in 9.57762 seconds [GPU] estimating 1787x1754x416 disparity using 894x877x8u tiles timings: rectify: 0.360702 disparity: 1.02108 borders: 0.016318 filter: 0.013563 fill: 0 [GPU] estimating 1525x988x192 disparity using 1525x988x8u tiles timings: rectify: 0.017841 disparity: 0.404869 borders: 0.008467 filter: 0.008001 fill: 0 [...skipping similar output...] [GPU] estimating 1084x1371x352 disparity using 1084x1371x8u tiles timings: rectify: 0.019118 disparity: 0.407436 borders: 0.008471 filter: 0.007898 fill: 0 Depth reconstruction devices performance: - 100% done by Tesla K80 Total time: 47.6308 seconds Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123 processing failed in 69.2783 sec BuildDepthMaps.buildDepthMaps (40/57): quality = High, depth filtering = Disabled loaded depth map partition in 0.00156 sec Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7 driver version: 9010, runtime version: 5050 max work group size 1024 max work item sizes [1024, 1024, 64] Using CUDA device 'Tesla K80' in concurrent. (2 times) loaded photos in 9.11162 seconds [GPU] estimating 2410x1787x288 disparity using 1205x894x8u tiles timings: rectify: 0.298753 disparity: 1.19054 borders: 0.021145 filter: 0.0178 fill: 0 [GPU] estimating 795x2439x256 disparity using 795x1220x8u tiles timings: rectify: 0.023439 disparity: 0.56942 borders: 0.010525 filter: 0.009718 fill: 0 [...skipping similar output...] [GPU] estimating 1246x1360x320 disparity using 1246x1360x8u tiles timings: rectify: 0.021423 disparity: 0.486459 borders: 0.009245 filter: 0.008953 fill: 0 Depth reconstruction devices performance: - 100% done by Tesla K80 Total time: 49.4865 seconds Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123 processing failed in 59.189 sec [... and so on...]

Currently I start the node using this command line:

Code: [Select]

~/photoscan-pro/photoscan.sh --node --dispatch xxx.xxx.xxx.xxx:5841 --capability gpu --cpu_enable 0 --gpu_mask 1 --root ~/SfM  2>&1 | tee -a ~/log_photoscan/photoscan_node.log

Currently I make use of a GPU mask, and did not enable CPU, but the issue does not change when I change those settings, enable CPU or make use of all four GPUs.

When I check the GPU activity with nvidia-smi, I see that the GPU is actually utilized and doing something:

Quote

Code: [Select]
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 390.25 Driver Version: 390.25 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 00000000:3D:00.0 Off | 0 | | N/A 72C P0 128W / 149W | 787MiB / 11441MiB | 98% E. Process | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 00000000:3E:00.0 Off | 0 | | N/A 34C P0 70W / 149W | 82MiB / 11441MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | 00000000:60:00.0 Off | 0 | | N/A 47C P0 58W / 149W | 82MiB / 11441MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | 00000000:61:00.0 Off | 0 | | N/A 26C P8 29W / 149W | 20MiB / 11441MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 767MiB | | 1 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 62MiB | | 2 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 62MiB | +-----------------------------------------------------------------------------+

(strangely, two additional small 62MiB idle processes are running on additional GPUs although I applied the GPU mask)

How can I fix the error? I would like to make full use of the Tesla GPUs, and plan to invest in more GPU power, however I need to make sure that Photoscan is capable of the setup. Is there a misconfiguration of our GPU server setup or is this a Photoscan issue?

Any help is appreciated.
If required, I can do more test runs, change settings, and provide logifle output - please let me know.
We also have various MPI implementations available on the server if that is useful.

Thank you and best regards,
Simon

Alexey Pasumansky · « **Reply #1 on:** June 20, 2018, 10:12:41 PM »

Hello Simon,

The issue could be caused by Compute Mode=Exclusive Process (see "E. Process" in nvidia-smi output), so please try to switch all GPUs to Compute Mode=Default:
1. "sudo nvidia-smi -i 0 -c 0" (this will switch mode for first GPU. To switch mode for other GPUs you can use -i 1, -i 2 or -i 3),
2. ensure that nvidia-smi prints Default now, but not E. Process for first GPU,
3. try to calculate depth maps with first GPU again.

Another workaround is to enable the following Tweak via Advanced preferences tab: main/depth_max_gpu_multiplier and set it to 1, but this could lead to slower depth maps calculation performance.

simon_r · « **Reply #2 on:** June 21, 2018, 11:11:36 AM »

Dear Alexey,
Thank you for your reply. I will change the compute mode and run a test again.

How can I apply the tweak on a compute node? Do I run the Photoscan (node) installation in standalone mode, set the tweak, and start it node mode again? Or is there a dedicated command line argument when starting the node?

Thanks again and best regards,
Simon

Alexey Pasumansky · « **Reply #3 on:** June 27, 2018, 05:25:23 PM »

Hello Simon,

You need to set the "tweak" on the client's machine where from the network task is sent to the server. The parameter should be sent as the tasks property.

Forum

Author Topic: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors (Read 2595 times)

simon_r

Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors

Alexey Pasumansky

Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors

simon_r

Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors

Alexey Pasumansky

Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors