Forum

Author Topic: HPC scripting and node usage - best practices for processing nicely?  (Read 1745 times)

andyroo

  • Sr. Member
  • ****
  • Posts: 440
    • View Profile
I'm trying to set up --nice slurm scripts to run big jobs at a low priority but using all available nodes. I have a couple questions about best practices:
  • Is it possible to pass a signal to the server/monitor/node to die/suspend nicely (i.e. pause/stop, finish particular task, then quit)? Right now by default if I scancel a job it just dies, but I can pass signals to child processes of the batch script. I'd like to figure out how my --nice nodes could exit/suspend in a way that finishes whatever subtask I'm in the middle of (because some of them take hours)
  • Is it possible to specify that certain nodes get priorities for specific long-running tasks (like AlignCameras.finalize)? Ideally I'd assign that task to a node that has normal priority, and not a --nice node. Also ideally if there's an unused node of the right type available (CPU vs GPU), I'd like to start a fresh job to maximize my time in case I'm near the end of my allocation for a given node, since interrupting that task sometimes costs up to 24+ hours.
  • are there other metashape tips/tricks of network processing nicely that I haven't thought of? I'm especially interested in being able to spawn and retire nodes as needed during different stages of a batch or script. At the moment I have a workflow that is batch-driven, where some steps run scripts that affect the whole document, and others are simple batch steps. I imagine to do good node management I'd need to go to 100% scripted.
  • If anyone has some example python code that spawns and retires nodes I would be much-obliged for sharing

Thanks!