Forum

Author Topic: network processing - host/worker freeze/disconnect (freshdesk ticket Tuesday)  (Read 2365 times)

andyroo

  • Sr. Member
  • ****
  • Posts: 443
    • View Profile
Hi Agisoft folks. I'm running version 2.1.0.17526 on host and workers, on Windows 10 (host and some workers) and Windows 11 (a few workers).

TLDR; I'm having problems during align.finalize with freezing/disconnecting workers, and after an overnight freeze without disconnect, I was able to restart worker progress today by pressing the <enter> key in the console window?!

Additional details below, and I filed a freshdesk ticket (#194391) a couple days ago and have been updating it. Some information is repeated here in case it helps anyone or anyone has any insight...

On my first network align of ~26,000 images everything went fine until the align.finalize step and then all of the workers were periodically disconnected about 275 times with a message something like this on the monitor (on the host machine):

2024-01-06 14:40:04 192.168.88.201:58729] recv: An established connection was aborted by the software in your host machine (10053)
2024-01-06 14:40:04 [192.168.88.201:58729] failed #0 AlignCameras.finalize (1/1): Connection closed
2024-01-06 14:40:04 [192.168.88.201:58729] send: An existing connection was forcibly closed by the remote host (10054)
2024-01-06 14:40:04 [192.168.88.201:58729] worker removed

...

after about 250 times, I disconnected all nodes, saved the batch, updated network card drivers, restarted the host and a single node, and tried again. That was the day before yesterday. Overnight the worker disconnected another 25 or so times, then finally finished the first step(?!)

So that's weird - why would it fail 275 times then suddenly work? But it gets weirder -  I noticed yesterday around 5pm that the worker node appeared frozen - there was no activity on the host/monitor on the worker/details progress graph, and the last update was at 2024-01-10 14:59:30. I left it alone until a few moments ago, and was surprised to see that the worker node never disconnected. I noticed that the host showed this:
2024-01-10 14:59:30 adjusting: !xx

while the worker showed this:
2024-01-10 14:59:30 adjusting: !x

Out of frustration, or desperation, or I don't know why, I hit <enter> in the console (cmd) window of the worker, and then it showed this!
2024-01-10 14:59:30 adjusting: !xxx

AND weirdest of all (to me) the host graph started updating again, and the worker appears to be running normally...

The current worker log looks like this (with around 25 more disconnects above that I didn't paste in)

...
x2024-01-10 08:18:51 Error: Aborted by user
2024-01-10 08:18:51 processing failed in 1950.93 sec
disconnected from server
connected to 192.168.88.205:5840
registration accepted
2024-01-10 08:19:48 AlignCameras.finalize (1/1): subtask = finalize, adaptive_fitting = off, level = 6, cache_path = //SFM-HOST/Network_SfM/psx/SBC_master_all_images_guided_5k.files/0/align.1
2024-01-10 08:20:26 3 blocks: 20308 146 2
2024-01-10 08:34:38 block: 14 sensors, 20456 cameras, 381612284 points, 1593718574 projections
2024-01-10 08:34:38 block_sensors: 0.0118561 MB (0.0127029 MB allocated)
2024-01-10 08:34:38 block_cameras: 7.95941 MB (11.8527 MB allocated)
2024-01-10 08:34:38 block_points: 17468.8 MB (21044.8 MB allocated)
2024-01-10 08:34:38 block_tracks: 1455.74 MB (1455.74 MB allocated)
2024-01-10 08:34:38 block_obs: 72954.6 MB (72954.6 MB allocated)
2024-01-10 08:34:38 block_ofs: 2911.47 MB (2911.47 MB allocated)
2024-01-10 08:34:38 block_fre: 0 MB (0 MB allocated)
2024-01-10 08:37:58 adding 353608275 points, 0 far, 1612701 inaccurate, 14316 invisible, 179 weak
2024-01-10 08:40:59 adjusting: !x[192.168.88.205:5840] recv: An existing connection was forcibly closed by the remote host (10054)
x2024-01-10 08:52:17 Error: Aborted by user
2024-01-10 08:52:17 processing failed in 1949.04 sec
disconnected from server
connected to 192.168.88.205:5840
registration accepted
2024-01-10 08:53:15 AlignCameras.finalize (1/1): subtask = finalize, adaptive_fitting = off, level = 6, cache_path = //SFM-HOST/Network_SfM/psx/SBC_master_all_images_guided_5k.files/0/align.1
2024-01-10 08:53:53 3 blocks: 20308 146 2
2024-01-10 09:08:08 block: 14 sensors, 20456 cameras, 381612284 points, 1593718574 projections
2024-01-10 09:08:08 block_sensors: 0.0118561 MB (0.0127029 MB allocated)
2024-01-10 09:08:08 block_cameras: 7.95941 MB (11.8527 MB allocated)
2024-01-10 09:08:08 block_points: 17468.8 MB (21044.8 MB allocated)
2024-01-10 09:08:08 block_tracks: 1455.74 MB (1455.74 MB allocated)
2024-01-10 09:08:08 block_obs: 72954.6 MB (72954.6 MB allocated)
2024-01-10 09:08:08 block_ofs: 2911.47 MB (2911.47 MB allocated)
2024-01-10 09:08:08 block_fre: 0 MB (0 MB allocated)
2024-01-10 09:11:28 adding 353608275 points, 0 far, 1612701 inaccurate, 14316 invisible, 179 weak
2024-01-10 09:14:34 adjusting: !xxxxxxxxxxxxx!x!x!x 0.823228 -> 0.75973
2024-01-10 10:40:05 disabled 1 points
2024-01-10 10:43:18 adding 1641944 points, 117372 far, 1628257 inaccurate, 14315 invisible, 180 weak
2024-01-10 10:43:18 optimized in 5510.16 seconds
2024-01-10 10:43:18 f 8429.71, cx 27.8211, cy 29.5716, k1 -0.118365, k2 0.125958, k3 0.0315367
2024-01-10 10:43:18 f 8430.09, cx 25.449, cy 29.4083, k1 -0.118596, k2 0.127053, k3 0.030009
2024-01-10 10:43:18 f 8428.33, cx -0.407983, cy 16.2905, k1 -0.11752, k2 0.1239, k3 0.0364546
2024-01-10 10:43:18 f 8433.91, cx -2.50267, cy 15.1906, k1 -0.117851, k2 0.125403, k3 0.0334202
2024-01-10 10:43:18 f 8426.79, cx 6.52533, cy 14.7927, k1 -0.118174, k2 0.12028, k3 0.0464695
2024-01-10 10:43:18 f 8427.1, cx 10.2056, cy 13.7934, k1 -0.11839, k2 0.123723, k3 0.0398974
2024-01-10 10:43:18 f 8427.91, cx 39.1095, cy 31.2033, k1 -0.118537, k2 0.128099, k3 0.0267767
2024-01-10 10:43:18 f 8429.72, cx 20.7913, cy 33.5567, k1 -0.117826, k2 0.123211, k3 0.0376948
2024-01-10 10:43:18 f 7366.84, cx -13.8719, cy 18.8519, k1 -0.0897347, k2 0.117177, k3 -0.0388326
2024-01-10 10:43:18 f 8428.96, cx 22.1275, cy 34.2011, k1 -0.11777, k2 0.12464, k3 0.0328493
2024-01-10 10:43:18 f 8748.63, cx 4.73646, cy -35.1598, k1 -0.114165, k2 0.15207, k3 0.0610519
2024-01-10 10:43:18 f 8193.61, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:43:18 f 10242, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:43:18 f 8193.61, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:46:12 adjusting: !xxxxxxxxxxxxx!x!x!x 0.74288 -> 0.742355
2024-01-10 12:13:53 final block size: 20456
2024-01-10 12:17:13 adding 353617053 points, 0 far, 1613259 inaccurate, 14284 invisible, 179 weak
2024-01-10 12:17:13 (3 px, 2 3d) sigma filtering...
2024-01-10 12:20:19 adjusting: !xxxxxxxxxxxxx!x!x!x 0.822517 -> 0.759586
2024-01-10 13:47:34 point variance: 0.841952 px, threshold: 2.52586 px
2024-01-10 13:50:44 adding 1449532 points, 9624987 far (2.52586 px threshold), 1323685 inaccurate, 2899 invisible, 91 weak
2024-01-10 13:51:13 removed 4 cameras: 20395, 20397, 20398, 20399
2024-01-10 13:51:13 removed 4 stations
2024-01-10 13:53:46 adjusting: !xxxxxxxxxxxxx!x 0.278704 -> 0.27798
2024-01-10 14:54:00 point variance: 0.306791 px, threshold: 0.920373 px
2024-01-10 14:57:03 adding 1254004 points, 14764293 far (0.920373 px threshold), 1111041 inaccurate, 1055 invisible, 137 weak
2024-01-10 14:59:30 adjusting: !xxxxxxxx


[edit]
Update - the host killed the process when it was about 87% complete. Looks like I might have to restart from scratch and process on a local machine. Going to try to copy everything to the worker node and run host/worker/monitor all on one machine so I don't have to restart everything but I need to modify paths... not sure how that's going to work...

« Last Edit: January 12, 2024, 01:35:58 AM by andyroo »

Alexey Pasumansky

  • Agisoft Technical Support
  • Hero Member
  • *****
  • Posts: 15086
    • View Profile
Hello Andy,

How much RAM is available on the nodes that fail to complete the tasks? Do you see if memory consumption reaches the available limit before the nodes disconnect?
Best regards,
Alexey Pasumansky,
Agisoft LLC

andyroo

  • Sr. Member
  • ****
  • Posts: 443
    • View Profile
Hi Alexey,

I replied on freshdesk with basically the same text but posting here for other folks in case it's useful.

TLDR; I updated to 2.1.0.17532 and launched host/monitor/worker with admin privileges :o and was able to finish the align.finalize step

Three of the nodes the host rotated through had enough RAM (one with 192GB and two with 256GB), and disk swap/virtual memory was enabled (175GB to NVMe) on the one with 64GB. I watched the node with 64GB for a while yesterday before it crashed and the RAM got up to just about full then it looked like disk swap kicked in.

I didn't see crash reports on any of the nodes, and it looked like the host was hanging up on them based on the messages in the host log:

recv: An established connection was aborted by the software in your host machine (10053)
failed #0 AlignCameras.finalize (1/1): Connection closed
send: An existing connection was forcibly closed by the remote host (10054)


After I updated the server and a worker node to 2.1.0.17532, and edited the JSON to enter that version, AND launched the host and worker and monitor processes with admin privileges THEN the job finally completed, although it looks like the worker crashed and generated an error report before recovering and finally finishing.