Forum

Author Topic: GPU mask is reversed, or applied from high-to-low (CentOS)  (Read 2799 times)

andyroo

  • Sr. Member
  • ****
  • Posts: 438
    • View Profile
GPU mask is reversed, or applied from high-to-low (CentOS)
« on: June 22, 2021, 10:11:02 PM »
TLDR: GPU masking with Metashape 1.7.2 on a CentOS linux node is mirrored/reversed, or applied from high-to-low.

Not sure if this is expected behavior because the examples I found in the API/forum were ambiguous.

We had uncorrectable GPU memory errors on one of our cards on a HPC GPU node (CentOS) that I worked around by masking the offending GPU:
Code: [Select]
Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU at PCI:0000:89:00: GPU-<censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU Board Serial Number: <censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a44da).
Jun 21 19:56:37 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a249c).
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 6, subpartition 0.
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a2ca1).

 nvidia-smi reported GPU2 was bad (of GPUs 0-3) :
Code: [Select]
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   30C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   29C    P0    38W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    1 |
| N/A   30C    P0    38W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

But my attempts to mask led to some confusion about expected mask behavior. For clarity below I'm representing the masks as binary, but they were converted to decimal in my metashape -gpu_mask argument (ie binary 1011 = decimal 11).

With a depth mapping job active on the node, we confirmed that masking 0011 activated GPU 0 and 1, masking 1100  crashed the metashape process, and masking 1011 (decimal 11) enabled GPU 0,1, and 3. Metashape was called as below to mask GPU2:
Code: [Select]
srun metashape.sh --node --dispatch $ip4 --capability any --cpu_enable 1 --gpu_mask 11 --inprocess -platform offscreen

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 14813
    • View Profile
Re: GPU mask is reversed, or applied from high-to-low (CentOS)
« Reply #1 on: June 23, 2021, 11:53:42 AM »
Hello Andy,

GPU mask is applied according to the GPU order in Metashape.app.enumGPUDevices() list. Can you please check, if the device order in this list corresponds to the gpu_mask definition?

If the numbers correspond to the device index in this list, then the binary mask should follow this order:
876543210
Best regards,
Alexey Pasumansky,
Agisoft LLC