Forum

Author Topic: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128  (Read 7798 times)

andyroo

  • Sr. Member
  • ****
  • Posts: 440
    • View Profile
more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« on: February 09, 2019, 01:40:50 AM »
I have been having CUDA errors on a couple of our workstations since v. 1.4-ish. Recently upgraded my most problematic one from 1.4.3 OpenCL to metashape 15.1 CUDA and sure enough got the same error immediately on starting alignment(5069 16 MP images in two camera groups):

Error: CUDA_ERROR_UNKNOWN (999) at line 128

This is a dual Xeon E5-2643 v3 w/ 512GB RAM, and three GPUs: dual EVGA 0980Ti hybrid GPUs, and a EVGA Titan X hybrid running the display.

I then updated the GPU drivers and did a repair reinstall (upgrade-in-place) of windows 10 then disabled my primary video GPU (EVGA Titan X) in Tools/Preferences/GPU, and unchecked "Use CPU when performing GPU-accelerated rendering".

Interestingly, things worked ok when I started alignment after doing this.

I was optimistic that it would work if I left "use CPU..." unchecked so I enabled the Titan X then attempted the alignment again and got the CUDA error. Interestingly, it threw the error on the Titan X immediately (I infer this because it threw the error on the third image), which makes me think that maybe it's because of another process on that GPU. These are all the processes that NVidia-SMI reports on the GPUs:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      5108      C   ...les\Agisoft\Metashape Pro\metashape.exe N/A      |
|    1       744    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    1      1320    C+G   Insufficient Permissions                   N/A      |
|    1      5108    C+G   ...les\Agisoft\Metashape Pro\metashape.exe N/A      |
|    1      5704    C+G   ...hell.Experiences.TextInput.InputApp.exe N/A      |
|    1      6352    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    1      7212    C+G   C:\Windows\explorer.exe                    N/A      |
|    1      7540    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
|    2      5108      C   ...les\Agisoft\Metashape Pro\metashape.exe N/A      |
+-----------------------------------------------------------------------------+

(PID 1320 is Desktop Window Manager)

Here' a partial log showing the successful partial run, and the failed next run after I enabled the Titan X:

Code: [Select]
2019-02-08 13:52:46 Agisoft Metashape Professional Version: 1.5.1 build 7618 (64 bit)
2019-02-08 13:52:46 Platform: Windows
2019-02-08 13:52:46 CPU: Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz (server)
2019-02-08 13:52:46 CPU family: 6 model: 63 signature: 306F2h
2019-02-08 13:52:46 RAM: 511.9 GB
2019-02-08 13:52:49 OpenGL Vendor: NVIDIA Corporation
2019-02-08 13:52:49 OpenGL Renderer: GeForce GTX TITAN X/PCIe/SSE2
2019-02-08 13:52:49 OpenGL Version: 4.6.0 NVIDIA 418.81
2019-02-08 13:52:49 Maximum Texture Size: 16384
2019-02-08 13:52:49 Quad Buffered Stereo: not enabled
2019-02-08 13:52:49 ARB_vertex_buffer_object: supported
2019-02-08 13:52:49 ARB_texture_non_power_of_two: supported

2019-02-08 13:52:54 LoadProject: path = E:/Elwha/20190206PlaneCam/SfM/20190206_Quinault.psx
2019-02-08 13:52:54 Loading project...
2019-02-08 13:52:54 loaded project in 0.548 sec
2019-02-08 13:52:54 Finished processing in 0.548 sec (exit code 1)
2019-02-08 13:54:11 AlignPhotos: accuracy = High, preselection = generic, keypoint limit = 0, tiepoint limit = 0, apply masks = 0, filter tie points = 0, adaptive fitting = 0
2019-02-08 13:54:11 Matching photos...
2019-02-08 13:54:14 saved matching data in 0.016 sec
2019-02-08 13:54:19 scheduled 100 keypoint detection groups
2019-02-08 13:54:19 saved keypoint partition in 0.016 sec
2019-02-08 13:54:19 groups: 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128395
2019-02-08 13:54:22 340435 of 5069 used (6716.02%)
2019-02-08 13:54:22 scheduled 100 keypoint matching groups
2019-02-08 13:54:22 saved matching partition in 0.062 sec
2019-02-08 13:54:23 loaded keypoint partition in 0 sec
2019-02-08 13:54:23 loaded matching data in 0 sec
2019-02-08 13:54:23 Detecting points...
2019-02-08 13:54:40 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:54:40   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:54:40   max work group size 1024
2019-02-08 13:54:40   max work item sizes [1024, 1024, 64]
2019-02-08 13:54:40 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:54:40   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:54:40   max work group size 1024
2019-02-08 13:54:40   max work item sizes [1024, 1024, 64]
2019-02-08 13:54:41 [GPU] photo 0: 99254 points
2019-02-08 13:54:41 [GPU] photo 100: 109523 points
2019-02-08 13:54:41 [GPU] photo 200: 102206 points
2019-02-08 13:54:41 [GPU] photo 300: 124713 points
2019-02-08 13:54:42 [GPU] photo 400: 117956 points
2019-02-08 13:54:42 [GPU] photo 500: 126365 points
2019-02-08 13:54:42 [GPU] photo 600: 69406 points
2019-02-08 13:54:42 [GPU] photo 700: 117144 points
2019-02-08 13:54:42 [GPU] photo 800: 114690 points
2019-02-08 13:54:42 [GPU] photo 900: 105174 points
2019-02-08 13:54:43 [GPU] photo 1000: 94910 points
2019-02-08 13:54:43 [GPU] photo 1100: 93596 points
2019-02-08 13:54:43 [GPU] photo 1200: 94911 points
2019-02-08 13:54:43 [GPU] photo 1300: 99096 points
2019-02-08 13:54:43 [GPU] photo 1400: 83928 points
2019-02-08 13:54:43 [GPU] photo 1500: 82285 points
2019-02-08 13:54:44 [GPU] photo 1600: 93443 points
2019-02-08 13:54:44 [GPU] photo 1700: 99221 points
2019-02-08 13:54:44 [GPU] photo 1800: 76264 points
2019-02-08 13:54:44 [GPU] photo 2000: 57386 points
2019-02-08 13:54:44 [GPU] photo 1900: 99425 points
2019-02-08 13:54:44 [GPU] photo 2100: 96156 points
2019-02-08 13:54:45 [GPU] photo 2200: 98751 points
2019-02-08 13:54:45 [GPU] photo 2300: 121368 points
2019-02-08 13:54:45 [GPU] photo 2400: 113657 points
2019-02-08 13:54:45 [GPU] photo 2500: 100553 points
2019-02-08 13:54:45 [GPU] photo 2600: 139357 points
2019-02-08 13:54:45 [GPU] photo 2700: 145186 points
2019-02-08 13:54:46 [GPU] photo 2800: 74595 points
2019-02-08 13:54:46 [GPU] photo 2900: 114896 points
2019-02-08 13:54:46 [GPU] photo 3000: 83435 points
2019-02-08 13:54:46 [GPU] photo 3100: 97808 points
2019-02-08 13:54:46 [GPU] photo 3200: 114143 points
2019-02-08 13:54:46 [GPU] photo 3300: 48723 points
2019-02-08 13:54:47 [GPU] photo 3400: 122766 points
2019-02-08 13:54:47 [GPU] photo 3500: 123042 points
2019-02-08 13:54:47 [GPU] photo 3600: 89272 points
2019-02-08 13:54:47 [GPU] photo 3700: 79337 points
2019-02-08 13:54:47 [GPU] photo 3800: 87967 points
2019-02-08 13:54:47 [GPU] photo 3900: 74224 points
2019-02-08 13:54:48 [GPU] photo 4000: 119007 points
2019-02-08 13:54:48 [GPU] photo 4100: 97702 points
2019-02-08 13:54:48 [GPU] photo 4200: 115408 points
2019-02-08 13:54:48 [GPU] photo 4300: 98065 points
2019-02-08 13:54:48 [GPU] photo 4400: 79883 points
2019-02-08 13:54:48 [GPU] photo 4500: 117093 points
2019-02-08 13:54:48 [GPU] photo 4600: 123136 points
2019-02-08 13:54:49 [GPU] photo 4700: 103121 points
2019-02-08 13:54:49 [GPU] photo 4800: 84256 points
2019-02-08 13:54:49 [GPU] photo 4900: 118860 points
2019-02-08 13:54:49 [GPU] photo 5000: 98751 points
2019-02-08 13:54:49 points detected in 26.06 sec
2019-02-08 13:54:50 loaded keypoint partition in 0 sec
2019-02-08 13:54:50 loaded matching data in 0 sec
2019-02-08 13:54:50 Detecting points...
2019-02-08 13:54:50 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:54:50   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:54:50   max work group size 1024
2019-02-08 13:54:50   max work item sizes [1024, 1024, 64]
2019-02-08 13:54:50 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:54:50   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:54:50   max work group size 1024
2019-02-08 13:54:50   max work item sizes [1024, 1024, 64]
2019-02-08 13:54:51 [GPU] photo 1: 131726 points
...<clipped for brevity>
2019-02-08 13:55:26 [GPU] photo 404: 102017 points
2019-02-08 13:55:26 [GPU] photo 504: 126837 points
2019-02-08 13:55:27 Finished batch processing in 75.744 sec (exit code 0)
2019-02-08 13:55:32 Error: Aborted by user
2019-02-08 13:56:00 Saving project...
2019-02-08 13:56:00 saved project in 0.079 sec
2019-02-08 13:56:00 AlignPhotos: accuracy = High, preselection = generic, keypoint limit = 0, tiepoint limit = 0, apply masks = 0, filter tie points = 0, adaptive fitting = 0
2019-02-08 13:56:00 Matching photos...
2019-02-08 13:56:03 saved matching data in 0 sec
2019-02-08 13:56:08 scheduled 100 keypoint detection groups
2019-02-08 13:56:08 saved keypoint partition in 0 sec
2019-02-08 13:56:08 groups: 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128449 128395
2019-02-08 13:56:11 340435 of 5069 used (6716.02%)
2019-02-08 13:56:11 scheduled 100 keypoint matching groups
2019-02-08 13:56:11 saved matching partition in 0.046 sec
2019-02-08 13:56:12 loaded keypoint partition in 0 sec
2019-02-08 13:56:12 loaded matching data in 0 sec
2019-02-08 13:56:12 Detecting points...
2019-02-08 13:56:12 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:56:12   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:56:12   max work group size 1024
2019-02-08 13:56:12   max work item sizes [1024, 1024, 64]
2019-02-08 13:56:12 Using device: GeForce GTX 980 Ti, 22 compute units, free memory: 5088/6144 MB, compute capability 5.2
2019-02-08 13:56:12   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:56:12   max work group size 1024
2019-02-08 13:56:12   max work item sizes [1024, 1024, 64]
2019-02-08 13:56:12 Using device: GeForce GTX TITAN X, 24 compute units, free memory: 10093/12288 MB, compute capability 5.2
2019-02-08 13:56:12   driver version: 418.81, driver/runtime CUDA: 10010/5050
2019-02-08 13:56:12   max work group size 1024
2019-02-08 13:56:12   max work item sizes [1024, 1024, 64]
2019-02-08 13:56:13 [GPU] photo 100: 109523 points
2019-02-08 13:56:13 [GPU] photo 0: 99254 points
2019-02-08 13:56:13 Error: CUDA_ERROR_UNKNOWN (999) at line 128
2019-02-08 13:56:13 Saving project...
2019-02-08 13:56:13 saved project in 0.063 sec
2019-02-08 13:56:13 OptimizeCameras: f, cx, cy, k1-k3, p1, p2, adaptive fitting = 0
2019-02-08 13:56:13 Optimizing camera locations...
2019-02-08 13:56:13 coordinates applied in 0 sec
2019-02-08 13:56:13 Finished processing in 0.015 sec
2019-02-08 13:56:13 Saving project...
2019-02-08 13:56:13 saved project in 0.063 sec
2019-02-08 13:56:13 Finished batch processing in 13.109 sec (exit code 1)

I don't have anything more to add at this point. I have to reboot every time I get a CUDA error to make the GPU happy again, so I'll try to figure out how to kill these processes on the Titan and see if I can get it to run OK.

andyroo

  • Sr. Member
  • ****
  • Posts: 440
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #1 on: February 20, 2019, 10:08:25 PM »
Update - finally got around to attempting to generate a dense cloud with only my 2 980Ti GPUs and got the same damn CUDA error:

Code: [Select]
2019-02-19 18:30:35 Error: CUDA_ERROR_UNKNOWN (999) at line 128
I would love to know how to enable openCL in metashape (tweaks?)

Andy

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14851
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #2 on: February 21, 2019, 08:24:25 PM »
Hello Andy,

It's quite difficult to investigate such issues, as they seem to be related to hardware configurations (from mechanical defect to power supply or over/under-clocking) or drivers. So far we were not able to reproduce anything similar on our test configurations, that include NVIDIA cards of different series.
Best regards,
Alexey Pasumansky,
Agisoft LLC

andyroo

  • Sr. Member
  • ****
  • Posts: 440
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #3 on: June 19, 2019, 10:22:59 PM »
Just wanted to update this with latest.

I updated the problem machine to Win 10 v1903 yesterday using the Windows Update Assistant and installed Metashape 1.5.3

I'm 18h into alignment (5297 100mpix images) with all 3 GPUs using CUDA and it appears to be working again.

NVidia Driver Version 397.93 (2x 980Ti & 1 Titan X)
Win 10 v 1903 build 18362.175

Andy

marcel.d

  • Newbie
  • *
  • Posts: 15
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #4 on: March 09, 2021, 01:43:45 PM »
Hi Andy!

We are currently facing the same issue. Can you please confirm whether you are still running Windows 10 v 1903 and Nvidia driver 397.93 or whether you have since updated?

With some Nvidia driver versions we get this: https://www.agisoft.com/forum/index.php?topic=13156.0

But with others (e.g. 417.22) we do get the same error as you do. Our current Windows version is 1803.

Other specs:
CPU: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (server)
CPU family: 6 model: 79 signature: 406F1h
RAM: 511.9 GB
OpenGL Vendor: NVIDIA Corporation
OpenGL Renderer: GeForce GTX TITAN X/PCIe/SSE2

Since the setup is fairly similar, I hope to get your feedback and maybe it'll also sove our issues. Thanks!

Marcel

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14851
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #5 on: March 09, 2021, 03:25:50 PM »
Hello Marcel,

Please check the latest NVIDIA driver 461.72:
https://www.nvidia.com/Download/driverResults.aspx/170886/en-us

If it doesn't solve the problem after system reboot, please provide the complete log related to the failed operation.
Best regards,
Alexey Pasumansky,
Agisoft LLC

RHenriques

  • Full Member
  • ***
  • Posts: 225
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #6 on: March 11, 2021, 12:14:18 AM »
I've been experiencing these kind of errors too using MacOS (High Sierra). The problem seems to occur with high memory demands in overclocked GPU's (in my case with a MSI Nvidia 1080Ti). The other Nvidia is not giving any problem. As I purposed before about this issue, It would be great if we could lower a little bit the workload of a certain GPU. If I lower the processing parameters, both GPU's work just fine.

uqkgold1

  • Newbie
  • *
  • Posts: 1
    • View Profile
Re: more CUDA ERROR - Error: CUDA_ERROR_UNKNOWN (999) at line 128
« Reply #7 on: December 09, 2022, 02:40:27 AM »
Hi,

I am having a similar issue. When I try to align my photos, I receive the following error message.

2022-12-02 11:08:59 Found 1 GPUs in 0.092482 sec (CUDA: 0.046111 sec, OpenCL: 0.046333 sec)
2022-12-02 11:09:00 Using device: Tesla T4, 40 compute units, free memory: 14971/15109 MB, compute capability 7.5
2022-12-02 11:09:00    driver/runtime CUDA: 11000/10010
2022-12-02 11:09:00    max work group size 1024
2022-12-02 11:09:00    max work item sizes [1024, 1024, 64]
2022-12-02 11:09:00 Finished processinguinoh89939 sec (exit, code 0)
2022-12-02 11:09:00 Error: CUDA_ERROR_UNKNOWN_CODE_-1 (-1) at line 149

I am working remotely on a Wiener HPC using the CVL virtual computing system.
Here are its specifications:
   CPU: AMD EPYC 7302
   RAM up to 180 GB
   GPU: NVidia Tesla T4, CUDA 11.0
   Operating System: Linux

I've tried using a smaller collection of photos and changing the setting for "Use CPU when performing GPU accelerated processing," but still receive the same error message.

Any ideas on why I am receiving this errors or what I could do to resolve them? Is this still common for the NVidia GPUs?

Thanks,
Kirsten