Slurm

From CAC Documentation wiki
Revision as of 18:00, 31 May 2019 by Pzv2 (talk | contribs) (→‎Common Commands: Formatted commands better and added descriptions)
Jump to navigation Jump to search

Some of the CAC's Private Clusters are managed with OpenHPC, which includes the Slurm Workload Manager (Slurm for short). Slurm (Simple Linux Utility for Resource Management) is a group of utilities used for managing workloads on compute clusters.

This page is intended to give users an overview of Slurm. Some of the information on this page has been adapted from the Cornell Virtual Workshop topics on the Stampede2 Environment and Advanced Slurm. For a more in-depth tutorial, please review these topics directly.

Overview

Some clusters use Slurm as the batch queuing system and the scheduling mechanism. This means that jobs are submitted to Slurm from a login node and Slurm handles scheduling these jobs on nodes as resources becomes available. Users submit jobs to the batch component which is responsible for maintaining one or more queues (also known as "partitions"). These jobs include information about themselves as well as a set of resource requests. Resource requests include anything from the number of CPUs or nodes to specific node requirements (e.g. only use nodes with > 2GB RAM). A separate component, called the scheduler, is responsible for figuring out when and where these jobs can be run on the cluster. The scheduler needs to take into account the priority of the job, any reservations that may exist, when currently running jobs are likely to end, etc. Once informed of scheduling information, the batch system will handle starting your job at the appropriate time and place. Slurm handles both of these components, so you don't have to think of them as separate processes, you just need to know how to submit jobs to the batch queue(s).

Note: Refer to the documentation for your cluster to determine what queues/partitions are available.

Common Commands

  • sinfo displays information about nodes and partitions/queues. Use -l for more detailed information.
  • scontrol show nodes views the state of the nodes.
  • scontrol show partition views the state of the partition/queue.
  • sbatch testjob.sh submits a job where testjob.sh is the script you want to run.
  • srun -p <partition> --pty /bin/bash starts an interactive job.
  • scontrol show job <job id> views the stat of a job.
  • scancel <job id> cancels a job.

References