In general you need to try to fill the frame of each image with the object you're scanning.
For 'thin' objects this means getting closer to them, and therefore taking a lot more photos.
If the thin object doesn't take up much of the image Metashape won't know that it's 'important' and it will just align based on what else it can see.
If you mask out everything in the background and only leave a small part of the frame unmasked (the thin object) then the angles to matching points won't be well enough distributed to give a good 3D position estimation (alignment), or there just might not be enough image unmasked to even find matching points.
If you do fill the frame of each image with your object but it has no texture, then it also won't find many useful matching points.
So you may need to include some background in the images to help it align. Don't worry too much about cars, people or clouds - Metashape won't find many points there anyway, and only worry about masking moving trees once you've got a good alignment and want to make it better.
If you get closer to your object then background objects will start to go out of focus anyway, assuming your camera is not fixed focus like a GoPro etc.
Using an external point cloud as a data source is only applicable to the Pro version, and won't really help photos that can't align by themselves. Markers are also a Pro only feature, and manually aligned photos that wouldn't align by themselves will still not be very useful when you come to the meshing stage.
Trying to force photos to align that just won't align by themselves is a thankless task even with the Pro version and it's almost always necessary to return and get more well focussed photos with plenty of the object in them and lots and lots of overlap.
Yes depth filtering will have more impact on thin objects, but I've never found that changing that setting helped if I was getting bad results with the default.
I would almost always use depth maps method for mesh generation and skip the dense point cloud generation step.
For a ~2m fence 'post' for example, I would take ~30 photos in a circle around it, all facing in towards it, and down to include the bottom half and some of the ground. Take another ring of 30 photos looking more horizontally to get the middle and some background. Then hold the camera high above your head so you can capture the top half without pointing the camera to the sky. The sky is no use, cloudy or clear!
That should give you 90 photos that align ok, but depending on the texture of the post and the distance to it, might still not give you a nice clean model. If it's a shiny or clean plain post then it will be very hard, but if it's a nice old wooden one then you can get better results by just getting closer and taking more photos.