Forum

Author Topic: Network processing - MatchPhotos.prematch - initializing GPUs takes long time?  (Read 1748 times)

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
Metashape 1.6.5 network processing on HPC.

I'm trying to understand the worker/processing node logfile in the network monitor during the MatchPhotos.prematch part of alignment.

It looks like ~1/2 of the time is being taken initializing the GPUs in each iteration of the task, but I don't know if there is more going on behind the scenes.

In the logfile excerpt below, it looks like each task (iteration of MatchPhotos.prematch) takes ~140-160s, and for each GPU (4 total) "free memory" takes ~18s - I'm not sure if that's querying free memory, or freeing the memory from the last run, or something else, but by my math, each iteration spends about 72s on the "free memory:" portion of the task. Is that normal?

Log excerpt below going through a few iterations starting from the end of point detection:

Code: [Select]
2020-11-29 23:47:18 [GPU] photo 82059: 70000 points
2020-11-29 23:47:18 points detected in 150.208 sec
2020-11-29 23:47:23 processing finished in 162.678 sec
2020-11-29 23:50:02 MatchPhotos.prematch (4/100): accuracy = High, preselection = generic, reference, keypoint limit = 70000, tiepoint limit = 0, apply masks = 0, filter tie points = 0
2020-11-29 23:50:10 loaded matching partition in 0.025084 sec
2020-11-29 23:50:10 loaded keypoint partition in 0.018421 sec
2020-11-29 23:50:54 loaded keypoints in 37.2838 sec
2020-11-29 23:50:54 loaded matching data in 0.000232 sec
2020-11-29 23:50:54 Selecting pairs...
2020-11-29 23:50:54 Found 4 GPUs in 0.261444 sec (CUDA: 0.217975 sec, OpenCL: 0.043425 sec)
2020-11-29 23:51:12 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:51:12   driver/runtime CUDA: 10010/6050
2020-11-29 23:51:12   max work group size 1024
2020-11-29 23:51:12   max work item sizes [1024, 1024, 64]
2020-11-29 23:51:12   got device properties in 0.001026 sec, free memory in 18.1027 sec
2020-11-29 23:51:30 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:51:30   driver/runtime CUDA: 10010/6050
2020-11-29 23:51:30   max work group size 1024
2020-11-29 23:51:30   max work item sizes [1024, 1024, 64]
2020-11-29 23:51:30   got device properties in 0.000813 sec, free memory in 18.3504 sec
2020-11-29 23:51:49 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:51:49   driver/runtime CUDA: 10010/6050
2020-11-29 23:51:49   max work group size 1024
2020-11-29 23:51:49   max work item sizes [1024, 1024, 64]
2020-11-29 23:51:49   got device properties in 0.000794 sec, free memory in 18.2349 sec
2020-11-29 23:52:06 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:52:06   driver/runtime CUDA: 10010/6050
2020-11-29 23:52:06   max work group size 1024
2020-11-29 23:52:06   max work item sizes [1024, 1024, 64]
2020-11-29 23:52:06   got device properties in 0.002548 sec, free memory in 17.4495 sec
2020-11-29 23:53:18 977304 matches found in 144.231 sec
2020-11-29 23:53:18 matches combined in 0.133228 sec
2020-11-29 23:53:19 filtered 84711 out of 579510 matches (14.6177%) in 0.116332 sec
2020-11-29 23:53:19 saved matches in 0.006697 sec
2020-11-29 23:53:19 processing finished in 197.588 sec
2020-11-29 23:53:22 MatchPhotos.prematch (11/100): accuracy = High, preselection = generic, reference, keypoint limit = 70000, tiepoint limit = 0, apply masks = 0, filter tie points = 0
2020-11-29 23:53:29 loaded matching partition in 0.010066 sec
2020-11-29 23:53:29 loaded keypoint partition in 0.004592 sec
2020-11-29 23:53:38 loaded keypoints in 7.61736 sec
2020-11-29 23:53:38 loaded matching data in 0.000374 sec
2020-11-29 23:53:38 Selecting pairs...
2020-11-29 23:53:38 Found 4 GPUs in 0.23591 sec (CUDA: 0.196422 sec, OpenCL: 0.039459 sec)
2020-11-29 23:53:56 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:53:56   driver/runtime CUDA: 10010/6050
2020-11-29 23:53:56   max work group size 1024
2020-11-29 23:53:56   max work item sizes [1024, 1024, 64]
2020-11-29 23:53:56   got device properties in 0.000824 sec, free memory in 18.0572 sec
2020-11-29 23:54:14 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:54:14   driver/runtime CUDA: 10010/6050
2020-11-29 23:54:14   max work group size 1024
2020-11-29 23:54:14   max work item sizes [1024, 1024, 64]
2020-11-29 23:54:14   got device properties in 0.000802 sec, free memory in 17.8316 sec
2020-11-29 23:54:32 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:54:32   driver/runtime CUDA: 10010/6050
2020-11-29 23:54:32   max work group size 1024
2020-11-29 23:54:32   max work item sizes [1024, 1024, 64]
2020-11-29 23:54:32   got device properties in 0.000784 sec, free memory in 17.5461 sec
2020-11-29 23:54:49 Using device: Tesla V100-SXM2-16GB, 80 compute units, free memory: 15806/16130 MB, compute capability 7.0
2020-11-29 23:54:49   driver/runtime CUDA: 10010/6050
2020-11-29 23:54:49   max work group size 1024
2020-11-29 23:54:49   max work item sizes [1024, 1024, 64]
2020-11-29 23:54:49   got device properties in 0.000805 sec, free memory in 17.486 sec
2020-11-29 23:56:01 269548 matches found in 142.999 sec
2020-11-29 23:56:01 matches combined in 0.062111 sec
2020-11-29 23:56:02 filtered 30416 out of 151205 matches (20.1157%) in 0.045621 sec
2020-11-29 23:56:02 saved matches in 0.006546 sec
2020-11-29 23:56:02 processing finished in 160.668 sec

Thanks for any insight. Happy to read more about network processing if I'm missing an informative reference/resource.

[edit] - a quick additional question/comment - I notice that in the code snippet below during the step between "setting point indices..." and "generated XXX tiepoints", there is a long time (almost an hour) with almost no CPU activity (I couldn't tell if the GPUs were active on this node). I was reviewing my workstation log and this long period occurs in non-network processing too - is this a GPU or CPU step? I guess it slows down with many tiepoints?

Code: [Select]
2020-11-30 01:46:24 MatchPhotos.finalize (1/1): accuracy = High, preselection = generic, reference, keypoint limit = 70000, tiepoint limit = 0, apply masks = 0, filter tie points = 0
2020-11-30 01:46:30 loaded matching data in 0.016731 sec
2020-11-30 01:46:31 loaded matching partition in 0.113757 sec
2020-11-30 01:46:31 loaded keypoint partition in 0.000596 sec
2020-11-30 01:46:44 loaded matches in 8.68071 sec
2020-11-30 01:46:44 setting point indices... 63857782 done in 25.3933 sec
2020-11-30 02:37:58 generated 63857782 tie points, 3.26963 average projections
2020-11-30 02:38:12 removed 344871 multiple indices
2020-11-30 02:38:13 removed 21032 track
« Last Edit: November 30, 2020, 11:47:37 AM by andyroo »

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14003
    • View Profile
Hello andyroo,

According to the logs the calls related to the estimation of the free local memory on the GPU devices are taking longer time, than expected. But it could be related to the specifics of the current system configuration (OS, driver version, hardware).

You can try to switch to OpenCL implementation from CUDA and see if the initialization procedure works faster for the current setup.

Also you can try to run the processing nodes with --inprocess command line key, in this case the memory access calls in the beginning of each sub-task starting from the second one would also work faster.
Best regards,
Alexey Pasumansky,
Agisoft LLC

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
Thank you for the tips, Alexey.

Is there a "correct" way to run a CUDA-capable node in openCL? should I just not pre-load the CUDA module, or should I use a tweak in the GUI or a switch when I launch the node?

my worker slurm script is as follows (I will add the --inprocess flag you mentioned):

Code: [Select]
#SBATCH -t 08:00:00
#SBATCH -p gpu
#SBATCH -n 1
#SBATCH -N 1
#SBATCH -c 36
#SBATCH --hint=nomultithread
#SBATCH --gres=gpu:4
#SBATCH --gres-flags=disable-binding
#SBATCH -o %j-metashape-worker.out

module use /home/software/arc/modulefiles
module load cuda10.0/toolkit/10.0.130
module load metashape/1.6.5-pro

srun metashape.sh --node --dispatch xxx.xxx.xxx.xxx --capability any --cpu_enable 0 --gpu_mask 15 -platform offscreen

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
WOW.

I added the --inprocess switch to some CPU nodes on a cpu-bound portion of a network filtering task, and it made a pretty dramatic difference - basically HALVED the processing time. Screenshot is from two identical hardware nodes spawned with and without this switch.

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14003
    • View Profile
Hello andyroo,

Quote
Is there a "correct" way to run a CUDA-capable node in openCL? should I just not pre-load the CUDA module, or should I use a tweak in the GUI or a switch when I launch the node?
If you are able to open Metashape in GUI mode on the nodes it will be sufficient to set main/gpu_enable_cuda tweak value to False, close application and re-start the node.
Best regards,
Alexey Pasumansky,
Agisoft LLC

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14003
    • View Profile
I added the --inprocess switch to some CPU nodes on a cpu-bound portion of a network filtering task, and it made a pretty dramatic difference - basically HALVED the processing time. Screenshot is from two identical hardware nodes spawned with and without this switch.
Hello andyroo,

Have you checked the timing for the matching stage?

It would be interesting to compare the total time spent on the same project (with the same processing parameters) with and without "inprocess" flag.
Best regards,
Alexey Pasumansky,
Agisoft LLC

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
I will have to revisit this on a smaller project, unless you would be informed by timings in logfiles from a few nodes running the matching stage with and without the --inprocess flag. This project takes a few days to align (still on the alignment finalizing step, which seems to be limited to one node), so I won't be re-aligning to compare times.

Is there an equivalent tweak to --inprocess in the non-network performance? I could try it on my local workstation if so. Still trying to figure out how to update William's benchmark script to be network/node-aware.
« Last Edit: December 01, 2020, 08:35:46 PM by andyroo »

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
Hello andyroo,

If you are able to open Metashape in GUI mode on the nodes it will be sufficient to set main/gpu_enable_cuda tweak value to False, close application and re-start the node.

I don't think there's an easy way for me to open the metashape GUI on a node, but I can avoid loading the CUDA module. Is there a way to add the tweak with a startup switch?
« Last Edit: December 02, 2020, 10:15:07 AM by andyroo »

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14003
    • View Profile
Hello andyroo,

You can create the .py script with the following lines:
Code: [Select]
import Metashape
Metashape.app.settings.setValue("main/gpu_enable_cuda", "0")

and execute on the node in headless mode (maybe additional argument will be required: -plaform offscreen):
./metashape.sh -r script.py

then run Metashape in node mode on that computer.
« Last Edit: December 02, 2020, 09:01:15 PM by Alexey Pasumansky »
Best regards,
Alexey Pasumansky,
Agisoft LLC

andyroo

  • Sr. Member
  • ****
  • Posts: 406
    • View Profile
perfect! Thank you!