Forum

Author Topic: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors  (Read 2595 times)

simon_r

  • Newbie
  • *
  • Posts: 3
    • View Profile
Dear community and support team,

I am currently running a Photoscan Pro node on a headless machine designed for  GPU processing. (specs below)

Whenever I submit a GPU-intensive Photoscan task to the node - e.g. depth reconstruction - Photoscan keeps spawning the following errors: Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
The node continues to try the processing, however it keeps yielding this error message and makes no progress.

How can I solve this problem? Does the server need some specific configuration? Other software does not yield CUDA errors on the same machine, and the GPU utilization works fine there.

The basic specs of the compute server are as follows:
Quote
Machine: Dell PowerEdge T640, 2x Intel Xeon Gold 6148 @ 2.4 GHz, 382GiB system memory
Graphics: 4x NVIDIA GK210GL [Tesla K80] driver: nvidia v: 390.67
System: Distro: CentOS Linux release 7.5.1804, Kernel: 3.10.0-862.3.2.el7.x86_64

This is a (shortened) log output of the node: (full logfile attached)
Quote
Code: [Select]
BuildDepthMaps.buildDepthMaps (39/57): quality = High, depth filtering = Disabled
loaded depth map partition in 0.005463 sec
Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7
  driver version: 9010, runtime version: 5050
  max work group size 1024
  max work item sizes [1024, 1024, 64]
Using CUDA device 'Tesla K80' in concurrent. (2 times)
loaded photos in 9.57762 seconds
[GPU] estimating 1787x1754x416 disparity using 894x877x8u tiles
timings: rectify: 0.360702 disparity: 1.02108 borders: 0.016318 filter: 0.013563 fill: 0
[GPU] estimating 1525x988x192 disparity using 1525x988x8u tiles
timings: rectify: 0.017841 disparity: 0.404869 borders: 0.008467 filter: 0.008001 fill: 0
[...skipping similar output...]
[GPU] estimating 1084x1371x352 disparity using 1084x1371x8u tiles
timings: rectify: 0.019118 disparity: 0.407436 borders: 0.008471 filter: 0.007898 fill: 0

Depth reconstruction devices performance:
 - 100% done by Tesla K80
Total time: 47.6308 seconds

Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
processing failed in 69.2783 sec
BuildDepthMaps.buildDepthMaps (40/57): quality = High, depth filtering = Disabled
loaded depth map partition in 0.00156 sec
Using device: Tesla K80, 13 compute units, 11441 MB global memory, compute capability 3.7
  driver version: 9010, runtime version: 5050
  max work group size 1024
  max work item sizes [1024, 1024, 64]
Using CUDA device 'Tesla K80' in concurrent. (2 times)
loaded photos in 9.11162 seconds
[GPU] estimating 2410x1787x288 disparity using 1205x894x8u tiles
timings: rectify: 0.298753 disparity: 1.19054 borders: 0.021145 filter: 0.0178 fill: 0
[GPU] estimating 795x2439x256 disparity using 795x1220x8u tiles
timings: rectify: 0.023439 disparity: 0.56942 borders: 0.010525 filter: 0.009718 fill: 0
[...skipping similar output...]
[GPU] estimating 1246x1360x320 disparity using 1246x1360x8u tiles
timings: rectify: 0.021423 disparity: 0.486459 borders: 0.009245 filter: 0.008953 fill: 0

Depth reconstruction devices performance:
 - 100% done by Tesla K80
Total time: 49.4865 seconds

Error: CUDA_ERROR_INVALID_DEVICE (101) at line 123
processing failed in 59.189 sec
[... and so on...]

Currently I start the node using this command line:
Code: [Select]
~/photoscan-pro/photoscan.sh --node --dispatch xxx.xxx.xxx.xxx:5841 --capability gpu --cpu_enable 0 --gpu_mask 1 --root ~/SfM  2>&1 | tee -a ~/log_photoscan/photoscan_node.logCurrently I make use of a GPU mask, and did not enable CPU, but the issue does not change when I change those settings, enable CPU or make use of all four GPUs.

When I check the GPU activity with nvidia-smi, I see that the GPU is actually utilized and doing something:
Quote
Code: [Select]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:3D:00.0 Off |                    0 |
| N/A   72C    P0   128W / 149W |    787MiB / 11441MiB |     98%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:3E:00.0 Off |                    0 |
| N/A   34C    P0    70W / 149W |     82MiB / 11441MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:60:00.0 Off |                    0 |
| N/A   47C    P0    58W / 149W |     82MiB / 11441MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:61:00.0 Off |                    0 |
| N/A   26C    P8    29W / 149W |     20MiB / 11441MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    251735      C   /home/xxxxxxxx/photoscan-pro/photoscan       767MiB |
|    1    251735      C   /home/xxxxxxxx/photoscan-pro/photoscan        62MiB |
|    2    251735      C   /home/xxxxxxxx/photoscan-pro/photoscan        62MiB |
+-----------------------------------------------------------------------------+

(strangely, two additional small 62MiB idle processes are running on additional GPUs although I applied the GPU mask)


How can I fix the error? I would like to make full use of the Tesla GPUs, and plan to invest in more GPU power, however I need to make sure that Photoscan is capable of the setup. Is there a misconfiguration of our GPU server setup or is this a Photoscan issue?

Any help is appreciated.
If required, I can do more test runs, change settings, and provide logifle output - please let me know.
We also have various MPI implementations available on the server if that is useful.

Thank you and best regards,
Simon

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14855
    • View Profile
Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors
« Reply #1 on: June 20, 2018, 10:12:41 PM »
Hello Simon,

The issue could be caused by Compute Mode=Exclusive Process (see "E. Process" in nvidia-smi output), so please try to switch all GPUs to Compute Mode=Default:
1. "sudo nvidia-smi -i 0 -c 0" (this will switch mode for first GPU. To switch mode for other GPUs you can use -i 1, -i 2 or -i 3),
2. ensure that nvidia-smi prints Default now, but not E. Process for first GPU,
3. try to calculate depth maps with first GPU again.
 
Another workaround is to enable the following Tweak via Advanced preferences tab: main/depth_max_gpu_multiplier and set it to 1, but this could lead to slower depth maps calculation performance.
Best regards,
Alexey Pasumansky,
Agisoft LLC

simon_r

  • Newbie
  • *
  • Posts: 3
    • View Profile
Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors
« Reply #2 on: June 21, 2018, 11:11:36 AM »
Dear Alexey,
Thank you for your reply. I will change the compute mode and run a test again.

How can I apply the tweak on a compute node? Do I run the Photoscan (node) installation in standalone mode, set the tweak, and start it node mode again? Or is there a dedicated command line argument when starting the node?

Thanks again and best regards,
Simon


Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14855
    • View Profile
Re: Photoscan compute node yields CUDA_ERROR_INVALID_DEVICE errors
« Reply #3 on: June 27, 2018, 05:25:23 PM »
Hello Simon,

You need to set the "tweak" on the client's machine where from the network task is sent to the server. The parameter should be sent as the tasks property.
Best regards,
Alexey Pasumansky,
Agisoft LLC