Whilst the video may be continuous this does not guarantee success when it comes to aligning. Image blur caused by movement, out of focus, lack of detail & tie points will all trigger failure to align - take a close look at the images and see if there is any pattern to failure.
26 USBL measurements over 10 minutes will require interpolation to estimate the image locations for every frame between those taken at the same time as measurement - I would hesitate to suggest this will help with alignment or deliver a scaled model.
Extracting images from video will mean cameras are treated as N/C - not calibrated - and whilst Metashape can estimate the lack of focal length may be causing alignment issues.
ROV cameras tend to be good for seeing what is in front of the lens...to record video in low light...and guide the operator...they may not deliver high quality stills that work best for photogrammetry - can you share camera data?
We use a similar technique but work with GPS points taken every 2~4 seconds. Using these for scaling delivers very consistent results but would not use these values to aid camera alignment:
https://accupixel.co.uk/2021/07/26/new-release-gps-position-processing/Not all images will need GPS reference for scaling and location, so the first steps would be to validate the source image quality, rerun alignment and then apply GPS values during recursive optimisation.