Dear community and support team,
I am currently running a
Photoscan Pro node on a headless machine designed for GPU processing.
(specs below)Whenever I submit a GPU-intensive Photoscan task to the node - e.g. depth reconstruction - Photoscan keeps spawning the following errors:
Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123The node continues to try the processing, however it keeps yielding this error message and makes no progress.
How can I solve this problem? Does the server need some specific configuration? Other software does not yield CUDA errors on the same machine, and the GPU utilization works fine there.
The basic specs of the compute server are as follows:
Machine: Dell PowerEdge T640, 2x Intel Xeon Gold 6148 @ 2.4 GHz, 382GiB system memory
Graphics: 4x NVIDIA GK210GL [Tesla K80] driver: nvidia v: 390.67
System: Distro: CentOS Linux release 7.5.1804, Kernel: 3.10.0-862.3.2.el7.x86_64
This is a (shortened) log output of the node: (full logfile attached)
BuildDepthMaps.buildDepthMaps (39/57): quality = High, depth filtering = Disabled
loaded depth map partition in 0.005463 sec
Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7
driver version: 9010, runtime version: 5050
max work group size 1024
max work item sizes [1024, 1024, 64]
Using CUDA device 'Tesla K80' in concurrent. (2 times)
loaded photos in 9.57762 seconds
[GPU] estimating 1787x1754x416 disparity using 894x877x8u tiles
timings: rectify: 0.360702 disparity: 1.02108 borders: 0.016318 filter: 0.013563 fill: 0
[GPU] estimating 1525x988x192 disparity using 1525x988x8u tiles
timings: rectify: 0.017841 disparity: 0.404869 borders: 0.008467 filter: 0.008001 fill: 0
[...skipping similar output...]
[GPU] estimating 1084x1371x352 disparity using 1084x1371x8u tiles
timings: rectify: 0.019118 disparity: 0.407436 borders: 0.008471 filter: 0.007898 fill: 0
Depth reconstruction devices performance:
- 100% done by Tesla K80
Total time: 47.6308 seconds
Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
processing failed in 69.2783 sec
BuildDepthMaps.buildDepthMaps (40/57): quality = High, depth filtering = Disabled
loaded depth map partition in 0.00156 sec
Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7
driver version: 9010, runtime version: 5050
max work group size 1024
max work item sizes [1024, 1024, 64]
Using CUDA device 'Tesla K80' in concurrent. (2 times)
loaded photos in 9.11162 seconds
[GPU] estimating 2410x1787x288 disparity using 1205x894x8u tiles
timings: rectify: 0.298753 disparity: 1.19054 borders: 0.021145 filter: 0.0178 fill: 0
[GPU] estimating 795x2439x256 disparity using 795x1220x8u tiles
timings: rectify: 0.023439 disparity: 0.56942 borders: 0.010525 filter: 0.009718 fill: 0
[...skipping similar output...]
[GPU] estimating 1246x1360x320 disparity using 1246x1360x8u tiles
timings: rectify: 0.021423 disparity: 0.486459 borders: 0.009245 filter: 0.008953 fill: 0
Depth reconstruction devices performance:
- 100% done by Tesla K80
Total time: 49.4865 seconds
Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
processing failed in 59.189 sec
[... and so on...]
Currently I start the node using this command line:
~/photoscan-pro/photoscan.sh --node --dispatch xxx.xxx.xxx.xxx:5841 --capability gpu --cpu_enable 0 --gpu_mask 1 --root ~/SfM 2>&1 | tee -a ~/log_photoscan/photoscan_node.log
Currently I make use of a GPU mask, and did not enable CPU, but the issue does not change when I change those settings, enable CPU or make use of all four GPUs.
When I check the GPU activity with
nvidia-smi, I see that the GPU is actually utilized and doing something:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25 Driver Version: 390.25 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:3D:00.0 Off | 0 |
| N/A 72C P0 128W / 149W | 787MiB / 11441MiB | 98% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:3E:00.0 Off | 0 |
| N/A 34C P0 70W / 149W | 82MiB / 11441MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:60:00.0 Off | 0 |
| N/A 47C P0 58W / 149W | 82MiB / 11441MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:61:00.0 Off | 0 |
| N/A 26C P8 29W / 149W | 20MiB / 11441MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 767MiB |
| 1 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 62MiB |
| 2 251735 C /home/xxxxxxxx/photoscan-pro/photoscan 62MiB |
+-----------------------------------------------------------------------------+
(strangely, two additional small 62MiB idle processes are running on additional GPUs although I applied the GPU mask)How can I fix the error? I would like to make full use of the Tesla GPUs, and plan to invest in more GPU power, however I need to make sure that Photoscan is capable of the setup. Is there a misconfiguration of our GPU server setup or is this a Photoscan issue?
Any help is appreciated.
If required, I can do more test runs, change settings, and provide logifle output - please let me know.
We also have various MPI implementations available on the server if that is useful.
Thank you and best regards,
Simon