This is a very complex question indeed.
Being somewhat of a purist, I insisted we went to the lengths recently when setting up a lab to specifically to large volumes of small object photogrammetry, of selecting a model of camera and lens that had no IS functionality, as I didn't trust that even with it off, the sensor would be properly centred. It was also a good value proposition otherwise (Sony a6400 + 50 FE macro).
A modern IS system can have up to 7 variables used in IS simultaneously (if I recall):
- the sensor can translate in 2 directions, which is equivalent to Cx and Cy
- the sensor can be rotated around the optical axis -- this should look almost like the camera simply being rotated, so should be fully corrected with basic position optimisations
- the sensor can rotate around the other 2 axes, which could do all sorts of unpredictable things to the results
- an optical element in the lens can move in 2 directions to give similar results to the sensor translating, minor optical aberrations may be introduced here
Notably, neither focal length nor focus point should be affected by any of these. In theory. (I’ve also been debating whether changing focus between images could have similar issues, but that’s a different discussion. And don’t get me started on rolling shutter effects on all of the above.)
Also, do remember that IS can allow slower shutter speeds, which can allow lower ISO or smaller aperture, which may increase sharpness or reduce noise in the image, which may have other positive effects.
There is also a huge, variable that I can't think of a way to evaluate: we know that the motions can move during the capture (which are only moderately controversial as they at least attempt to keep the image as it is at the start of the exposure), but how much does it move before the capture? If the implementation has a very low distance for this, it's likely better on than off. I know some implementations turn IS on when the user half presses the shutter. The only thing that I can say is that we can basically guarantee that it will be as good as random for every shot up to an undetermined maximum in each direction as it’s based on the details of the dynamics of the camera shake at that time.
Remember that IS on with a tripod is often recommended against by manufactures, and it can straight up reduce image quality. IS is specifically designed and optimised to compensate for human hand shake.
In short, I have no idea, I'd love to see a well run study into this though, that covers all the variables.