Your output will be just dense cloud, or are you planning to create also mesh with textures?
What will be the final quality of depth maps/dense cloud...ultra high/high/or just medium?
If your project/s is some aerial photos it is easier to process and less time consuming(less depth maps count needs to be filtered together).
If your project is archeology / interior-exterior of building,..where parts of object were taken from tens of different angles...it will be more time consuming(more depth maps needs to be filtered together).
If you can answer these questions then it will be more clear if you need more invest on GPU(s) or on CPU part. Alignment phase is good handled by GPU, so easy decision. Rest of the processing is ~50:50 spreaded between CPU/GPU...depends on what will be the output.
2000-7000 is plenty of photos(especialy 60Mpix), but it still can be done on one computer...if there is no time pressure to finish the project.
I would build one computer and see how it can handle the project in terms of time. I don't have enough knowledge about metashape network processing, what data are shared, which phase is most effective/speeded up,...
I would go for one PC with AMD 16/24/32 core CPU and one or two GPUs(if 24GB VRAM not needed, 3080TI better option). Amount of RAM depends on quality of the output.
My guess is that whole $50k does not need to be spent on HW + licenses, at the size of the projects you mentioned.
We'll see what others suggest.