Went with my original idea of summing the view vectors of each camera and this seems to have put me on the right track. This algorithm assumes that the imagery is either horizontal and pointing towards the ground - some that point up slightly are OK, but if your dataset also has sky (i.e. full image coverage), then this algorithm would not work. It also assumes that the image set covers 360 degrees such that the x and y portions of the camera view vectors will essentially cancel each other out and only leave a vector that points vertically.
After the images have been loaded, assigned to a station camera group, matched and aligned, you can then sum all the camera view vectors (ignore non-aligned cameras), by transforming the point (0, 0, 1) from camera coords to world space - this gives you a vector in world space of where the camera is pointing. Once you have summed all those vectors, it should be pointing directly down in model space. You can then determine a rotation matrix from a world axis to this vector and apply that to the export task.
I still see some undulations so its not perfect, but at least I have a starting point that I can iterate on.
The other thing I noticed, is there almost always is 1 camera that is pointing along an axis - so I might revise my algorithm to find this camera, and then pull out the gimbal pitch from this camera as my rotation value. Since we rely on DJI drones, we will have the gimbal angles in the image metadata, so this second algorithm wouldn't work for any drones that don't record this data - or any image sets where Metashape doesn't have the camera reference rotation data.