Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - andyroo

Pages: 1 ... 3 4 [5] 6 7 ... 30

Bug Reports / GPU mask is reversed, or applied from high-to-low (CentOS)

« on: June 22, 2021, 10:11:02 PM »

TLDR: GPU masking with Metashape 1.7.2 on a CentOS linux node is mirrored/reversed, or applied from high-to-low.

Not sure if this is expected behavior because the examples I found in the API/forum were ambiguous.

We had uncorrectable GPU memory errors on one of our cards on a HPC GPU node (CentOS) that I worked around by masking the offending GPU:

Code: [Select]

Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU at PCI:0000:89:00: GPU-<censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: GPU Board Serial Number: <censored>
Jun 21 19:56:35 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a44da).
Jun 21 19:56:37 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a249c).
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU in the framebuffer at partition 6, subpartition 0.
Jun 21 19:56:40 dl-0001 kernel: NVRM: Xid (PCI:0000:89:00): 63, Dynamic Page Retirement: New page retired, reboot to activate (0x00000000000a2ca1).

nvidia-smi reported GPU2 was bad (of GPUs 0-3) :

Code: [Select]

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   30C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:62:00.0 Off |                    0 |
| N/A   29C    P0    38W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    1 |
| N/A   30C    P0    38W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

But my attempts to mask led to some confusion about expected mask behavior. For clarity below I'm representing the masks as binary, but they were converted to decimal in my metashape -gpu_mask argument (ie binary 1011 = decimal 11).

With a depth mapping job active on the node, we confirmed that masking 0011 activated GPU 0 and 1, masking 1100 crashed the metashape process, and masking 1011 (decimal 11) enabled GPU 0,1, and 3. Metashape was called as below to mask GPU2:

Code: [Select]

srun metashape.sh --node --dispatch $ip4 --capability any --cpu_enable 1 --gpu_mask 11 --inprocess -platform offscreen

Python and Java API / HPC scripting and node usage - best practices for processing nicely?

« on: June 17, 2021, 05:11:56 PM »

I'm trying to set up --nice slurm scripts to run big jobs at a low priority but using all available nodes. I have a couple questions about best practices:

Is it possible to pass a signal to the server/monitor/node to die/suspend nicely (i.e. pause/stop, finish particular task, then quit)? Right now by default if I scancel a job it just dies, but I can pass signals to child processes of the batch script. I'd like to figure out how my --nice nodes could exit/suspend in a way that finishes whatever subtask I'm in the middle of (because some of them take hours)
Is it possible to specify that certain nodes get priorities for specific long-running tasks (like AlignCameras.finalize)? Ideally I'd assign that task to a node that has normal priority, and not a --nice node. Also ideally if there's an unused node of the right type available (CPU vs GPU), I'd like to start a fresh job to maximize my time in case I'm near the end of my allocation for a given node, since interrupting that task sometimes costs up to 24+ hours.
are there other metashape tips/tricks of network processing nicely that I haven't thought of? I'm especially interested in being able to spawn and retire nodes as needed during different stages of a batch or script. At the moment I have a workflow that is batch-driven, where some steps run scripts that affect the whole document, and others are simple batch steps. I imagine to do good node management I'd need to go to 100% scripted.
If anyone has some example python code that spawns and retires nodes I would be much-obliged for sharing

Thanks!

Bug Reports / 1.7.3 Possible save bug if disk full

« on: June 16, 2021, 11:26:50 PM »

I ran out of space on my work disk while saving a project, so I deleted some stuff unrelated to the project to make room, then tried re-saving. During the second attempt at saving I got an error that the file is being used by another process. I have no other instances of Metashape running. Guessing my only option is to save the entire project as another project.

Full console errors below:

Code: [Select]

2021-06-16 10:46:56 SaveProject: path = D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.psx
2021-06-16 10:46:56 Saving project...
2021-06-16 10:52:11 Error: Can't write file: There is not enough space on the disk (112): D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.files/10/0/point_cloud/point_cloud.zip.tmp
2021-06-16 10:52:11 Error: Can't remove file: The process cannot access the file because it is being used by another process (32): D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.files/10/0/point_cloud/point_cloud.zip.tmp
2021-06-16 10:52:11 Finished processing in 314.97 sec (exit code 0)
2021-06-16 10:52:11 Error: Can't write file: There is not enough space on the disk (112): D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.files/10/0/point_cloud/point_cloud.zip.tmp
2021-06-16 13:11:10 SaveProject: path = D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.psx
2021-06-16 13:11:11 Saving project...
2021-06-16 13:16:20 Error: Can't remove file: The process cannot access the file because it is being used by another process (32): D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.files/10/0/point_cloud/point_cloud.zip.tmp
2021-06-16 13:16:20 Finished processing in 309.418 sec (exit code 0)
2021-06-16 13:16:20 Error: Can't replace file or directory: The process cannot access the file because it is being used by another process (32): D:/FloSup/FloSup_Align4d/FloSup_batch_2/optimize_tests/FloSup_4D_202008-202104.files/10/0/point_cloud/point_cloud.zip

Bug Reports / 1.7.3 [win10] Error in console and python interpreter dies until restart

« on: June 11, 2021, 11:02:28 PM »

I'm getting the error below periodically. The first time it happened I was doing two things at once (sorry don't remember - I think I ran code right after saving) and I thought maybe I did the one thing too soon after the other.

This time (2nd time, a day later) I was just doing standard stuff in the console. I don't think I did a full reboot after the last error. Will try that now.

When I got googly on the error the only stuff I saw that jumped out was folks fixing a similar error by upgrading ipykernel or downgrading tornado

Here's the log showing me triggering the error with a few commands prior.

Code: [Select]

In[29]: chunk
Out[29]: 2021-06-11 12:35:15 <Chunk 'Copy of Hatteras_Inlet_to_Ocracoke_Inlet_RuPaRe_x2_FloSup_4D_202008-202104'>

In[30]: group_label = chunk.camera_groups[0].label

In[31]: group_label
2021-06-11 12:35:48 ERROR:tornado.general:Uncaught exception in ZMQStream callback
2021-06-11 12:35:48 Traceback (most recent call last):
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 438, in _run_callback
2021-06-11 12:35:48     callback(*args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 120, in _handle_event
2021-06-11 12:35:48     event_f()
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 214, in <lambda>
2021-06-11 12:35:48     self.schedule(lambda : self._really_send(*args, **kwargs))
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 222, in _really_send
2021-06-11 12:35:48     self.socket.send_multipart(msg, *args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\socket.py", line 62, in send_multipart
2021-06-11 12:35:48     self.message_sent += 1
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 585, in __set__
2021-06-11 12:35:48     self.set(obj, value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 574, in set
2021-06-11 12:35:48     obj._notify_trait(self.name, old_value, new_value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1134, in _notify_trait
2021-06-11 12:35:48     self.notify_change(Bunch(
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1176, in notify_change
2021-06-11 12:35:48     c(change)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\ipkernel.py", line 130, in _io_dispatch
2021-06-11 12:35:48     ident, msg = self.session.recv(self.iopub_socket, copy=False)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\jupyter_client\session.py", line 814, in recv
2021-06-11 12:35:48     msg_list = socket.recv_multipart(mode, copy=copy)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 246, in __getattr__
2021-06-11 12:35:48     warnings.warn("Accessing zmq Socket attribute %s on BackgroundSocket" % attr,
2021-06-11 12:35:48 DeprecationWarning: Accessing zmq Socket attribute recv_multipart on BackgroundSocket
2021-06-11 12:35:48 ERROR:tornado.general:Uncaught exception in zmqstream callback
2021-06-11 12:35:48 Traceback (most recent call last):
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 456, in _handle_events
2021-06-11 12:35:48     self._handle_recv()
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 486, in _handle_recv
2021-06-11 12:35:48     self._run_callback(callback, msg)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 438, in _run_callback
2021-06-11 12:35:48     callback(*args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 120, in _handle_event
2021-06-11 12:35:48     event_f()
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 214, in <lambda>
2021-06-11 12:35:48     self.schedule(lambda : self._really_send(*args, **kwargs))
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 222, in _really_send
2021-06-11 12:35:48     self.socket.send_multipart(msg, *args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\socket.py", line 62, in send_multipart
2021-06-11 12:35:48     self.message_sent += 1
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 585, in __set__
2021-06-11 12:35:48     self.set(obj, value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 574, in set
2021-06-11 12:35:48     obj._notify_trait(self.name, old_value, new_value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1134, in _notify_trait
2021-06-11 12:35:48     self.notify_change(Bunch(
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1176, in notify_change
2021-06-11 12:35:48     c(change)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\ipkernel.py", line 130, in _io_dispatch
2021-06-11 12:35:48     ident, msg = self.session.recv(self.iopub_socket, copy=False)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\jupyter_client\session.py", line 814, in recv
2021-06-11 12:35:48     msg_list = socket.recv_multipart(mode, copy=copy)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 246, in __getattr__
2021-06-11 12:35:48     warnings.warn("Accessing zmq Socket attribute %s on BackgroundSocket" % attr,
2021-06-11 12:35:48 DeprecationWarning: Accessing zmq Socket attribute recv_multipart on BackgroundSocket
2021-06-11 12:35:48 ERROR:asyncio:Exception in callback BaseAsyncIOLoop._handle_events(2028, 1)
2021-06-11 12:35:48 handle: <Handle BaseAsyncIOLoop._handle_events(2028, 1)>
2021-06-11 12:35:48 Traceback (most recent call last):
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\asyncio\events.py", line 81, in _run
2021-06-11 12:35:48     self._context.run(self._callback, *self._args)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\tornado\platform\asyncio.py", line 139, in _handle_events
2021-06-11 12:35:48     handler_func(fileobj, events)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 456, in _handle_events
2021-06-11 12:35:48     self._handle_recv()
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 486, in _handle_recv
2021-06-11 12:35:48     self._run_callback(callback, msg)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\zmq\eventloop\zmqstream.py", line 438, in _run_callback
2021-06-11 12:35:48     callback(*args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 120, in _handle_event
2021-06-11 12:35:48     event_f()
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 214, in <lambda>
2021-06-11 12:35:48     self.schedule(lambda : self._really_send(*args, **kwargs))
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 222, in _really_send
2021-06-11 12:35:48     self.socket.send_multipart(msg, *args, **kwargs)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\socket.py", line 62, in send_multipart
2021-06-11 12:35:48     self.message_sent += 1
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 585, in __set__
2021-06-11 12:35:48     self.set(obj, value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 574, in set
2021-06-11 12:35:48     obj._notify_trait(self.name, old_value, new_value)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1134, in _notify_trait
2021-06-11 12:35:48     self.notify_change(Bunch(
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\traitlets\traitlets.py", line 1176, in notify_change
2021-06-11 12:35:48     c(change)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\inprocess\ipkernel.py", line 130, in _io_dispatch
2021-06-11 12:35:48     ident, msg = self.session.recv(self.iopub_socket, copy=False)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\jupyter_client\session.py", line 814, in recv
2021-06-11 12:35:48     msg_list = socket.recv_multipart(mode, copy=copy)
2021-06-11 12:35:48   File "C:\Program Files\Agisoft\Metashape Pro\python\lib\site-packages\ipykernel\iostream.py", line 246, in __getattr__
2021-06-11 12:35:48     warnings.warn("Accessing zmq Socket attribute %s on BackgroundSocket" % attr,
2021-06-11 12:35:48 DeprecationWarning: Accessing zmq Socket attribute recv_multipart on BackgroundSocket

after that the interpreter dies, and there's like a 30 second unresponsive spinning wheel after each command but no output:

Code: [Select]

In [32]: chunk

In [33]: Metashape.app.document

In [34]: print('oh no!')

In [35]:

[edit] after I closed metashape, the attached window hung around for a minute or two with the message:

IOStream.flush timed out

repeating a dozen times or so.

General / Network process surprise in 1.7.2 - align resumes after cancel & divide project!

« on: June 04, 2021, 01:26:07 AM »

I recently attempted a network processing alignment with around 140k images. When the Alignment.Cleanup failed due to out-of-RAM errors on the cleanup node, I killed the job in Monitor, then divided the project into two separate projects by copying the original chunk, deleting half the images from the original chunk, and the other half from the copied chunk. Each chunk was saved out to a new PSX file, and the original was closed without saving. "save keypoints" is not enabled (since I can't selectively delete them when/if I delete some images and divide the project later for dense matching). I also killed the server and monitor and restarted all processes.

I then restarted network alignment on the PSX saved out from the original chunk. There was an initial error about "can't resume matching without keypoints" or something like that, before the nodes started on the AlignCameras.align task without performing any matching (?!).

BUT - there is a point_cloud folder (~40GB) in the original project .files heirarchy, and in both of the sub-projects I divided and saved out, there are also ~20GB point_cloud folders. Despite the fact that I canceled alignment because cleanup couldn't continue, and that I didn't have "save keypoints" enabled. So it appears that the network processing saved the matched-but-not-aligned state - effectively saving the keypoints anyway since the align stage didn't complete?! (if so, yay!).

The nodes appear to be crunching through what they perceive as valid matches (pic of monitor attached), and I'm confused - did the server save the partially complete state/matched keypoints, or are these data/status saved in the original project and transferred when I exported the edited chunk as a new PSX? Did the network processing task, because it didn't complete, somehow save the "matched-but-not-aligned" state of the original project? Would this NOT be saved in the copied chunk, but only as a property of the original chunk? So many questions.

The interesting thing to me was that the project skipped matching entirely, but essentially restarted the align task from some post-matching point - even though I don't have "save keypoints" enabled, and after it looked like it initially tried to restart align.cleanup. I'll have to take a look at the logs when this is done (and see if it runs out of RAM again during cleanup) but I'm guessing since each sub-project only has half the points, that the task will complete.

This would be a nice "feature" in non-network mode, and it makes me want to bench running a large project in network vs non-network mode on a single workstation (I do this with small projects sometimes to test python script). I could definitely see value in running projects in network mode if it allows me to skip re-matching even if I choose not to save keypoints, if the process is somehow interrupted.

General / Metashape 1.6.5/1.7.2 observation on memory limit to alignment size

« on: June 03, 2021, 07:55:52 PM »

I haven't seen updated alignment ram usage numbers lately so I figured I'd share my latest learnings. I processed two collections of 36 MPix aerial images with roughly the same geometry. The first was processed in Metashape 1.6.5 and the second in 1.7.2.

Working on a cluster with 384 GB of RAM, the alignment limit (on high) appears to be between 82,000 and 139,000 images, with the final step of alignment being the limiting factor (performed on a single node).

Maximum RAM usage to align 82,129 images was 173.18GB in 1.6.5.11249. If this scaled linearly, 139,152 images should take ~293GB of RAM. But we ran out of RAM on a 384GB node trying to complete the alignment stage in 1.7.2. with that number of images.

Obviously these are different versions, but wanted to share what I know.

Andy

General / Re: New and still wondering...

« on: June 03, 2021, 05:24:50 AM »

Quote from: parry on May 26, 2021, 10:37:00 PM

...What might be useful for users like ourselves is to get a reference image set, with a set of assigned settings for alignment, cloud generation, texture generation etc. so we could assure ourselves of our approach and understand what happens when we use different settings.

parry and Lebobo - you might find it useful to download the Puget Systems Metashape Benchmark(s) and try them. They have a small and a big one here:

https://www.pugetsystems.com/labs/articles/Metashape-Benchmark-1457/

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 10, 2021, 11:19:54 PM »

Just started looking at "completed" meshes, and the first one I checked has some serious problems (see screenshots). Only partial reconstruction, and some artifacts that extend far outside the region. Screenshots of the model view window attached showing close and far views. Far view is sparse cloud vs mesh. Close view is dense cloud vs mesh. Mesh was reconstructed from batch dialog.

~~It *looks* like the top right of the mesh *could* be about where the upper right boundary of the original region was, but I'm not certain~~ [edit - the other messed up mesh shot off to the upper left, so not at all consistent with my previous (in ~~strikeout~~) hypothesis]. The project was aligned over a large area, then divided into smaller regions saved to new PSX files for region-based dense cloud processing.

-EDIT- looked at the remaining completed meshes and 2 of 6 were bad. Both had ~190GB of RAM as max usage (the two largest meshes - using about half of the total 384GB available on each node). Compared mesh and DEM construction time and size from the dense cloud, and there didn't appear to be a consistent advantage to mesh - meshes were generally about the same except where they broke, then ~2x longer. Best mesh performance was about 1/2 the time of the DEM, but avg was approximately equal. Mesh size across these 6 DEMs was consistently smaller, but only by about 17%, which wasn't enough to warrant pursuing this further. For our purposes, mesh-based ortho reconstruction is not as performant as DEM-based because of the requirement for single-node/full-allocation grid generation.

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 10, 2021, 09:32:13 PM »

It looks like in batch mode network processing, when erroring out with bad_alloc, metashape treats errors differently than on my workstation, and recycles/restarts the batch job after cycling through all chunks. Is this expected/optional behavior, and is there an option to change this so the job completes rather than endlessly cycling? Screenshot attached.

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 09, 2021, 10:55:43 PM »

Hi Alexey,

Can I calculate the amount of RAM that will be used in the complete grid from the dimensions reported from the grid? ~~I'm wondering if I can use the custom mesh size to optimize the grid without reducing resolution more than I have to.~~

-EDIT- It looks like the grid size when generating a mesh from a dense cloud is a function of the depth map resolution x the region size. I'm wondering how much memory each grid cell needs - this would allow me to more accurately calculate the maximum region I can reconstruct at once, and to tile the dense cloud appropriately.

For example, I see that when I shrink the region/grid from 244691x628622 to 91307x600192 then instead of getting bad_alloc error I appear to be using about 200GB, which is fairly close to 32-bit float x raster size if I'm doing my math right...

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 08, 2021, 09:39:23 PM »

Quote from: Alexey Pasumansky on May 08, 2021, 09:07:22 PM

It seems Metashape was not able to allocate sufficient RAM for 248536x639273 grid.

Hi Alexey,

I have 256GB of RAM, and it looked like it never tried to allocate more than about 10% of that. Does the allocation error occur before the memory is assigned?

The DEMs (interpolated and uninterpolated I already built for this chunk are much larger than what the mesh is attempting - 450228x942522 (DEMs built directly from the dense cloud).

I guess there is some operation for the meshing that requires more RAM than the DEM from dense cloud does?

Andy

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 08, 2021, 07:51:49 PM »

Just finished reconstruction on my local workstation using same parameters as on the node, and got two bad allocation errors (the successful (2nd) chunk is a fragment of the size of the 1st and 3rd chunks, which threw bad allocation errors).

Interestingly I was at my workstation when the second one happened, and RAM usage was about 10% of my total RAM at that point. Is it reporting a bad allocation from an earlier step? Here's the log for the whole batch:

Code: [Select]

2021-05-08 08:55:22 saved project in 0.036 sec
2021-05-08 08:55:22 BuildModel: quality = High, depth filtering = Mild, PM version, reuse depth maps, source data = Dense cloud, surface type = Height field, face count = High, interpolation = Enabled, vertex colors = 0
2021-05-08 08:55:22 Generating mesh...
2021-05-08 08:58:13 generating 244691x628622 grid (0.00219517 resolution)
2021-05-08 08:58:13 rasterizing dem...2021-05-08 08:58:14 Error: bad allocation
Saving project...
2021-05-08 08:58:14 saved project in 0.035 sec
2021-05-08 08:58:14 BuildModel: quality = High, depth filtering = Mild, PM version, reuse depth maps, source data = Dense cloud, surface type = Height field, face count = High, interpolation = Enabled, vertex colors = 0
2021-05-08 08:58:14 Generating mesh...
2021-05-08 08:59:00 generating 192407x45168 grid (0.00214067 resolution)
2021-05-08 08:59:00 rasterizing dem... done in 50.546 sec
2021-05-08 08:59:51 filtering dem... done in 84.444 sec
2021-05-08 09:01:32 constructed triangulation from 4820125 vertices, 9640244 faces
2021-05-08 09:02:22 grid interpolated in 67.125 sec
2021-05-08 09:07:34 triangulating... 91578521 points 183151732 faces done in 1646.28 sec
2021-05-08 09:35:05 Peak memory used: 68.87 GB at 2021-05-08 09:35:00
2021-05-08 09:35:05 Finished processing in 2210.72 sec
2021-05-08 09:35:05 Saving project...
2021-05-08 09:35:57 saved project in 52.595 sec
2021-05-08 09:35:57 BuildModel: quality = High, depth filtering = Mild, PM version, reuse depth maps, source data = Dense cloud, surface type = Height field, face count = High, interpolation = Enabled, vertex colors = 0
2021-05-08 09:35:57 Generating mesh...
2021-05-08 09:37:52 generating 248536x639273 grid (0.00215781 resolution)
2021-05-08 09:37:52 rasterizing dem...2021-05-08 09:37:52 Error: bad allocation
Saving project...
2021-05-08 09:37:52 saved project in 0.048 sec
2021-05-08 09:37:52 Finished batch processing in 2550.7 sec (exit code 1)

I'll reprocess a single chunk and see what my max ram usage gets to. Is it possible to approximately calculate the maximum number of vertices I can have in a mesh with a given amount of RAM? Is there any other troubleshooting/diagnosis I can do to pin down the cause?

Mesh is being built with the parameters shown in the attachment. Only things I changed from the defaults were Surface Type (default Abitrary, changed to Height field), Custom Face Count (default 200,000; changed to 0 - but shouldn't make a difference since I chose 'high' and not 'custom'), and calculate vertex colors (default yes, changed to no because I figured it would save time and I don't need them).

-EDIT- I ran the batch with just the first chunk selected. It failed in 100s and RAM usage never got above ~22MB

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 08, 2021, 06:53:24 PM »

Hi Alexey,

Thanks for the quick reply!

Quote from: Alexey Pasumansky on May 08, 2021, 03:22:46 PM

How many points are there in the dense point cloud? Does it help to use Medium face count preset for Build Model operation?

~3.8 billion points (3,783,821,907)

Quote from: Alexey Pasumansky on May 08, 2021, 03:22:46 PM

If possible, please also provide the screenshot of the source dense cloud with the bounding box.

screenshot attached - <sigh> apparently I didn't save this particular psx after running the bounding box script (or I missed this one), so it doesn't/didn't have smaller, PCS-oriented bounding boxes. Too many irons in the fire... I also attached a screenshot of the "corrected" bounding box.

I downloaded the project from the HPC last night to try it on my local workstation in non-network mode. If it works here, I'll try the modified extent on the HPC. If that works, I'll also probably try my local workstation in network mode (with the single machine acting as host, monitor, node, and GUI).

Bug Reports / Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

« on: May 08, 2021, 04:48:50 AM »

Getting std::bad_alloc when trying to build an interpolated (not extrapolated) mesh on high from the dense cloud on some big chunks. Dense cloud is GCS (NAD83(2011)). I have successfully built interpolated and uninterpolated DEMs, and orthoimages for these chunks.

We first built an uninterpolated DEM from the dense cloud for the elevation model, then built an interpolated DEM and orthophoto (using the interpolated DEM.

I am now trying to build a mesh from the dense cloud to use for a comparison orthoimage (because in smaller experiments the mesh was much faster and smaller than the interpolated DEM).

The mesh was generated after rotating the bounding box to the DEM projected coordinate system (PCS = NAD83 UTM). Rotation was performed to minimize the height/width of the nodata collars on the DEM generated from the dense cloud, since if it stays rotated, the DEM bounds go all the way to the corners of the along-track-oriented (not PCS-oriented) bounding box. I wonder if the mesh is failing because it's doing grid interpolation over the whole empty area of the rotated bounding box. In that case, I need to switch the order or re-rotate the region to be oriented with the data, but it will probably still fail on another section that is L-shaped with a bunch of empty space.

These are the details from the node - I included a previous successful (smaller) mesh generation before too:

2021-05-07 17:45:55 BuildModel: source data = Dense cloud, surface type = Height field, face count = High, interpolation = Enabled, vertex colors = 0
2021-05-07 17:45:56 Generating mesh...
2021-05-07 17:46:20 generating 213317x132869 grid (0.00214379 resolution)
2021-05-07 17:46:20 rasterizing dem... done in 81.9141 sec
2021-05-07 17:47:42 filtering dem... done in 375.867 sec
2021-05-07 17:55:06 constructed triangulation from 21327465 vertices, 42654924 faces
2021-05-07 17:57:38 grid interpolated in 220.33 sec
2021-05-07 18:13:56 triangulating... 106374525 points 212748181 faces done in 4727.18 sec
2021-05-07 19:32:45 Peak memory used: 181.40 GB at 2021-05-07 19:32:43
2021-05-07 19:33:00 processing finished in 6425.13 sec
2021-05-07 19:33:00 BuildModel: source data = Dense cloud, surface type = Height field, face count = High, interpolation = Enabled, vertex colors = 0
2021-05-07 19:33:01 Generating mesh...
2021-05-07 19:33:37 generating 262471x233536 grid (0.00219694 resolution)
2021-05-07 19:33:37 rasterizing dem... done in 209.04 sec
2021-05-07 19:37:06 filtering dem... done in 847.863 sec
2021-05-07 19:53:17 constructed triangulation from 23493503 vertices, 46987000 faces
2021-05-07 19:57:34 grid interpolated in 380.113 sec
2021-05-07 20:20:53 Error: std::bad_alloc
2021-05-07 20:20:53 processing failed in 2872.89 sec

Bug Reports / Re: Installation fails on linux

« on: May 07, 2021, 09:28:20 PM »

Quote from: Alexey Pasumansky on May 07, 2021, 01:30:27 PM

Which OS distribution you are using...

CentOS 7.7.1908

Quote from: Alexey Pasumansky on May 07, 2021, 01:30:27 PM

and if you are working on the computer remotely (if so, then how the remote connection is established)

vncserver with Xfce desktop environment

the fix was:

yum install xcb-util-wm xcb-util-image xcb-util-keysyms xcb-util-renderutil

working now

[EDIT - adding new observations below]

I'm seeing these messages in the terminal now. I included the last debug line before the repeating sequence of console messages - with one buried error amongst the sequence. Been running for about 45 minutes now with ~8 repeats of the sequence:

Code: [Select]

loaded library "udev"
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
qt.qpa.xcb: QXcbConnection: XCB error: 3 (BadWindow), sequence: 38163, resource id: 16689444, major code: 40 (TranslateCoords), minor code: 0
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation
Only C and default locale supported with the posix collation implementation
Only C and default locale supported with the posix collation implementation
Case insensitive sorting unsupported in the posix collation implementation
Numeric mode unsupported in the posix collation implementation

wondering if it's related to this post that says installing libicu-dev (before building Qt) makes the messages disappear...

Pages: 1 ... 3 4 [5] 6 7 ... 30

Forum

Show Posts

Messages - andyroo

Bug Reports / GPU mask is reversed, or applied from high-to-low (CentOS)

Python and Java API / HPC scripting and node usage - best practices for processing nicely?

Bug Reports / 1.7.3 Possible save bug if disk full

Bug Reports / 1.7.3 [win10] Error in console and python interpreter dies until restart

General / Network process surprise in 1.7.2 - align resumes after cancel & divide project!

General / Metashape 1.6.5/1.7.2 observation on memory limit to alignment size

General / Re: New and still wondering...

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Error: std::bad_alloc building mesh from dense cloud - 1.7.2 node w/384GB RAM

Bug Reports / Re: Installation fails on linux