TLDR: GPU masking with Metashape 1.7.2 on a CentOS linux node is mirrored/reversed, or applied from high-to-low.
Not sure if this is expected behavior because the examples I found in the API/forum were ambiguous.
We had uncorrectable GPU memory errors on one of our cards on a HPC GPU node (CentOS) that I worked around by masking the offending GPU:Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU at PCI:0000:89:00: GPU-<censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU Board Serial Number: <censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a44da).
Jun 21 19:56:37 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a249c).
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 6, subpartition 0.
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a2ca1).
nvidia-smi reported GPU2 was bad (of GPUs 0-3) :+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:61:00.0 Off | 0 |
| N/A 30C P0 40W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:62:00.0 Off | 0 |
| N/A 29C P0 38W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 1 |
| N/A 30C P0 38W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 31C P0 41W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
But my attempts to mask led to some confusion about expected mask behavior. For clarity below I'm representing the masks as binary, but they were converted to decimal in my metashape -gpu_mask argument (ie binary 1011 = decimal 11).
With a depth mapping job active on the node, we confirmed that masking 0011 activated GPU 0 and 1, masking 1100 crashed the metashape process, and masking 1011 (decimal 11) enabled GPU 0,1, and 3. Metashape was called as below to mask GPU2:srun metashape.sh --node --dispatch $ip4 --capability any --cpu_enable 1 --gpu_mask 11 --inprocess -platform offscreen