Hi Agisoft folks. I'm running version 2.1.0.17526 on host and workers, on Windows 10 (host and some workers) and Windows 11 (a few workers).
TLDR; I'm having problems during align.finalize with freezing/disconnecting workers, and after an overnight freeze without disconnect, I was able to restart worker progress today by pressing the <enter> key in the console window?!
Additional details below, and I filed a freshdesk ticket (#194391) a couple days ago and have been updating it. Some information is repeated here in case it helps anyone or anyone has any insight...
On my first network align of ~26,000 images everything went fine until the align.finalize step and then all of the workers were periodically disconnected about 275 times with a message something like this on the monitor (on the host machine):
2024-01-06 14:40:04 192.168.88.201:58729] recv: An established connection was aborted by the software in your host machine (10053)
2024-01-06 14:40:04 [192.168.88.201:58729] failed #0 AlignCameras.finalize (1/1): Connection closed
2024-01-06 14:40:04 [192.168.88.201:58729] send: An existing connection was forcibly closed by the remote host (10054)
2024-01-06 14:40:04 [192.168.88.201:58729] worker removed
...
after about 250 times, I disconnected all nodes, saved the batch, updated network card drivers, restarted the host and a single node, and tried again. That was the day before yesterday. Overnight the worker disconnected another 25 or so times, then finally finished the first step(?!)
So that's weird - why would it fail 275 times then suddenly work? But it gets weirder - I noticed yesterday around 5pm that the worker node appeared frozen - there was no activity on the host/monitor on the worker/details progress graph, and the last update was at 2024-01-10 14:59:30. I left it alone until a few moments ago, and was surprised to see that the worker node never disconnected. I noticed that the host showed this:
2024-01-10 14:59:30 adjusting: !xx
while the worker showed this:
2024-01-10 14:59:30 adjusting: !x
Out of frustration, or desperation, or I don't know why, I hit <enter> in the console (cmd) window of the worker, and then it showed this!
2024-01-10 14:59:30 adjusting: !xxx
AND weirdest of all (to me) the host graph started updating again, and the worker appears to be running normally...
The current worker log looks like this (with around 25 more disconnects above that I didn't paste in)
...
x2024-01-10 08:18:51 Error: Aborted by user
2024-01-10 08:18:51 processing failed in 1950.93 sec
disconnected from server
connected to 192.168.88.205:5840
registration accepted
2024-01-10 08:19:48 AlignCameras.finalize (1/1): subtask = finalize, adaptive_fitting = off, level = 6, cache_path = //SFM-HOST/Network_SfM/psx/SBC_master_all_images_guided_5k.files/0/align.1
2024-01-10 08:20:26 3 blocks: 20308 146 2
2024-01-10 08:34:38 block: 14 sensors, 20456 cameras, 381612284 points, 1593718574 projections
2024-01-10 08:34:38 block_sensors: 0.0118561 MB (0.0127029 MB allocated)
2024-01-10 08:34:38 block_cameras: 7.95941 MB (11.8527 MB allocated)
2024-01-10 08:34:38 block_points: 17468.8 MB (21044.8 MB allocated)
2024-01-10 08:34:38 block_tracks: 1455.74 MB (1455.74 MB allocated)
2024-01-10 08:34:38 block_obs: 72954.6 MB (72954.6 MB allocated)
2024-01-10 08:34:38 block_ofs: 2911.47 MB (2911.47 MB allocated)
2024-01-10 08:34:38 block_fre: 0 MB (0 MB allocated)
2024-01-10 08:37:58 adding 353608275 points, 0 far, 1612701 inaccurate, 14316 invisible, 179 weak
2024-01-10 08:40:59 adjusting: !x[192.168.88.205:5840] recv: An existing connection was forcibly closed by the remote host (10054)
x2024-01-10 08:52:17 Error: Aborted by user
2024-01-10 08:52:17 processing failed in 1949.04 sec
disconnected from server
connected to 192.168.88.205:5840
registration accepted
2024-01-10 08:53:15 AlignCameras.finalize (1/1): subtask = finalize, adaptive_fitting = off, level = 6, cache_path = //SFM-HOST/Network_SfM/psx/SBC_master_all_images_guided_5k.files/0/align.1
2024-01-10 08:53:53 3 blocks: 20308 146 2
2024-01-10 09:08:08 block: 14 sensors, 20456 cameras, 381612284 points, 1593718574 projections
2024-01-10 09:08:08 block_sensors: 0.0118561 MB (0.0127029 MB allocated)
2024-01-10 09:08:08 block_cameras: 7.95941 MB (11.8527 MB allocated)
2024-01-10 09:08:08 block_points: 17468.8 MB (21044.8 MB allocated)
2024-01-10 09:08:08 block_tracks: 1455.74 MB (1455.74 MB allocated)
2024-01-10 09:08:08 block_obs: 72954.6 MB (72954.6 MB allocated)
2024-01-10 09:08:08 block_ofs: 2911.47 MB (2911.47 MB allocated)
2024-01-10 09:08:08 block_fre: 0 MB (0 MB allocated)
2024-01-10 09:11:28 adding 353608275 points, 0 far, 1612701 inaccurate, 14316 invisible, 179 weak
2024-01-10 09:14:34 adjusting: !xxxxxxxxxxxxx!x!x!x 0.823228 -> 0.75973
2024-01-10 10:40:05 disabled 1 points
2024-01-10 10:43:18 adding 1641944 points, 117372 far, 1628257 inaccurate, 14315 invisible, 180 weak
2024-01-10 10:43:18 optimized in 5510.16 seconds
2024-01-10 10:43:18 f 8429.71, cx 27.8211, cy 29.5716, k1 -0.118365, k2 0.125958, k3 0.0315367
2024-01-10 10:43:18 f 8430.09, cx 25.449, cy 29.4083, k1 -0.118596, k2 0.127053, k3 0.030009
2024-01-10 10:43:18 f 8428.33, cx -0.407983, cy 16.2905, k1 -0.11752, k2 0.1239, k3 0.0364546
2024-01-10 10:43:18 f 8433.91, cx -2.50267, cy 15.1906, k1 -0.117851, k2 0.125403, k3 0.0334202
2024-01-10 10:43:18 f 8426.79, cx 6.52533, cy 14.7927, k1 -0.118174, k2 0.12028, k3 0.0464695
2024-01-10 10:43:18 f 8427.1, cx 10.2056, cy 13.7934, k1 -0.11839, k2 0.123723, k3 0.0398974
2024-01-10 10:43:18 f 8427.91, cx 39.1095, cy 31.2033, k1 -0.118537, k2 0.128099, k3 0.0267767
2024-01-10 10:43:18 f 8429.72, cx 20.7913, cy 33.5567, k1 -0.117826, k2 0.123211, k3 0.0376948
2024-01-10 10:43:18 f 7366.84, cx -13.8719, cy 18.8519, k1 -0.0897347, k2 0.117177, k3 -0.0388326
2024-01-10 10:43:18 f 8428.96, cx 22.1275, cy 34.2011, k1 -0.11777, k2 0.12464, k3 0.0328493
2024-01-10 10:43:18 f 8748.63, cx 4.73646, cy -35.1598, k1 -0.114165, k2 0.15207, k3 0.0610519
2024-01-10 10:43:18 f 8193.61, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:43:18 f 10242, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:43:18 f 8193.61, cx 0, cy 0, k1 0, k2 0, k3 0
2024-01-10 10:46:12 adjusting: !xxxxxxxxxxxxx!x!x!x 0.74288 -> 0.742355
2024-01-10 12:13:53 final block size: 20456
2024-01-10 12:17:13 adding 353617053 points, 0 far, 1613259 inaccurate, 14284 invisible, 179 weak
2024-01-10 12:17:13 (3 px, 2 3d) sigma filtering...
2024-01-10 12:20:19 adjusting: !xxxxxxxxxxxxx!x!x!x 0.822517 -> 0.759586
2024-01-10 13:47:34 point variance: 0.841952 px, threshold: 2.52586 px
2024-01-10 13:50:44 adding 1449532 points, 9624987 far (2.52586 px threshold), 1323685 inaccurate, 2899 invisible, 91 weak
2024-01-10 13:51:13 removed 4 cameras: 20395, 20397, 20398, 20399
2024-01-10 13:51:13 removed 4 stations
2024-01-10 13:53:46 adjusting: !xxxxxxxxxxxxx!x 0.278704 -> 0.27798
2024-01-10 14:54:00 point variance: 0.306791 px, threshold: 0.920373 px
2024-01-10 14:57:03 adding 1254004 points, 14764293 far (0.920373 px threshold), 1111041 inaccurate, 1055 invisible, 137 weak
2024-01-10 14:59:30 adjusting: !xxxxxxxx
[edit]
Update - the host killed the process when it was about 87% complete. Looks like I might have to restart from scratch and process on a local machine. Going to try to copy everything to the worker node and run host/worker/monitor all on one machine so I don't have to restart everything but I need to modify paths... not sure how that's going to work...