Hi there,
based on my experience - combining aerial and terrestrial images requires the following:
1. high resolution images containing a lot of detail - you shoot much more detailed images from the ground than from Phantom (I don't know what kind of camera you use)
2. There has to be a good overlap inbetween the images - from my experience - oblique images from the "2nd floor elevation" are good.
When I do such a type of modelling - I collect nadir images (as I was producing orthophoto), then I take camera with zoom lens and fly around the site shooting oblique imagery. And if there is not enough detail I take the camera in hand and I shoot additional imagery.
Neverless to say - I don't use UAV but MAV (small aircraft) because in my country it is illegal to fly drones in inhabitted areas and around buildings.
See some examples of our results:
Here and