One other 'out of the box' idea, is:
In photoshop or similar downsample the detailed images by a factor of two and save, then repeat a few times so you have several sets at different resolutions all the way from the original resolution down to a resolution that closely matches that of the aerial images, with a few intermediate resolutions, and add all these images into a single chunk and align.
you can try grouping the sets of identical images at the different resolutions into 'camera stations' and that might...do...something...?