Some of the CAC's Private Clusters are managed with OpenHPC, which includes the Slurm Workload Manager (Slurm for short). Slurm (originally the Simple Linux Utility for Resource Management) is a group of utilities used for managing workloads on compute clusters.
This page is intended to give users an overview of Slurm. Some of the information on this page has been adapted from the Cornell Virtual Workshop topics on the Stampede2 Environment and Advanced Slurm. For a more in-depth tutorial, please review these topics directly.
One important distinction of Slurm configurations on CAC private clusters is that scheduling is typically done by CPU, not by node. This means that by default, a node may be shared among multiple users.
Some clusters use Slurm as the batch queuing system and the scheduling mechanism. This means that jobs are submitted to Slurm from a login node and Slurm handles scheduling these jobs on nodes as resources becomes available. Users submit jobs to the batch component which is responsible for maintaining one or more queues (also known as "partitions"). These jobs include information about themselves as well as a set of resource requests. Resource requests include anything from the number of CPUs or nodes to specific node requirements (e.g. only use nodes with > 2GB RAM). A separate component, called the scheduler, is responsible for figuring out when and where these jobs can be run on the cluster. The scheduler needs to take into account the priority of the job, any reservations that may exist, when currently running jobs are likely to end, etc. Once informed of scheduling information, the batch system will handle starting your job at the appropriate time and place. Slurm handles both of these components, so you don't have to think of them as separate processes, you just need to know how to submit jobs to the batch queue(s).
Note: Refer to the documentation for your cluster to determine what queues/partitions are available.
This section covers general job submission and job script composition; for more specific details on how to run jobs or job scripts and use queues on your particular system, see the documentation for the Private Cluster you are working on. Also note, many of the following commands have several options. For full details, see the man page for the command, or the Slurm Docs.
Common commands used to display information:
sinfodisplays information about nodes and partitions/queues. Use
-lfor more detailed information.
scontrol show nodesviews the state of the nodes.
scontrol show partitionviews the state of the partition/queue.
Here are some common Job Control commands:
sbatch testjob.shsubmits a job where testjob.sh is the script you want to run. Also see the Job Scripts section and the sbatch documentation.
srun -p <partition> --pty /bin/bash -lstarts an interactive job for you and opens a login shell where you can enter commands. Also see the srun documentation.
- Note: remember to exit the session once you are done to free resources for other users.
squeue -u my_useridshows state of jobs for user my_userid. Also see the squeue documentation.
scontrol show job <job id>views the state of a job. Also see the scontrol documentation.
scancel <job id>cancels a job. Also see the scancel documentation.
squeuewith no arguments retrieves summary information on all jobs scheduled.
Once the job has completed, the stdout and stderr streams will be put in your $HOME directory named with the job id. To verify the job ran successfully, examine these output files.
The following table shows key directives that may be specified for each job. If these directives are not supplied, then defaults go into effect which could influence how your job runs. Some clusters may require one or more of these arguments to be provided.
Meaning Flag Allowed Value Example Default Submission queue -p Queue/partition name
(valid names are cluster dependent)
-p normal (cluster dependent) Job walltime -t hh:mm:ss
(not to exceed time limit of queue)
-t 00:05:00 time limit of queue Number of tasks -n 1 ... number of CPUs on N nodes
(Slurm calculates N, if -N is not present)
-n 16 2*N
(2, if -N is also not present)
Number of nodes -N 1 ... NP = number of nodes in partition
(if N > NP, job is queued but never runs)
-N 2 enough to satisfy -n
(1, if -n is also not present)
Specifying a time limit is helpful not only to you, but also to the scheduler and to your fellow cluster users. If you know in advance that your job will not take the maximum running time allowed in the queue, you should use the
-t option to give an accurate expectation. This encourages the scheduler to "backfill" your job so it might run sooner that it otherwise would; and as a nice bonus, your job might get out of the way of other waiting jobs.
The maximum number of tasks that is feasible for a job depends on the hardware that is available in the chosen queue. In the absence of any other information, Slurm allocates the minimum number of nodes that can accommodate
-n tasks with each task occupying one CPU. (Be careful that this number of nodes does not exceed the total number in the partition!)
It is important to know that Slurm typically counts each physical core of a multi-core processor as two CPUs. This is due to Intel's hyperthreading technology, which makes each physical core appear to be two hardware threads to the OS. For this reason, your job will always be assigned an even number of CPUs. Even a job consisting of just one task is assigned two CPUs.
Accordingly, when scheduling jobs, Slurm calculates the total number of CPUs per node as follows:
CPUs/node = (boards/node) * (sockets/board) * (cores/socket) * (hardware threads/core) = 1 * 2 * (cores/socket) * 2
Most of the above factors are fixed, because nearly all CAC clusters consist of dual-socket nodes, i.e., each of the 2 sockets in the node holds one Intel multi-core processor. The number of cores per processor is generally the only variable, and it can vary quite a lot, even within a single queue. Check your cluster's documentation for details.
How Many Tasks and Nodes Do I Need?
If you just want to run a serial application, it is sufficient to specify
-n 1, which schedules your job on 1 CPU of 1 dedicated core. (This means Slurm actually has to assign you 2 CPUs = 2 hardware threads = 1 physical core.) However, there are many cases requiring more careful consideration:
- Your job script launches multiple processes (using
srun, e.g.) and exits when all are done
- Your application is parallelized with MPI to run on multiple cores, across multiple nodes
- Your application is multithreaded with OpenMP to use multiple cores on the same node
- Your MATLAB or Simulink application is parallelized with the Parallel Computing Toolbox to use multiple cores
- Your application (including MATLAB) has very high memory requirements, in excess of 1-2 GB/core
- Your job script launches multiple processes (using
In these situations, you should assign
-N (along with other options) to make sure your job has the resources it will need to run successfully. Please refer to the Parallel Applications section for guidance.
Example Command-line Job Submission
All of the key options can be specified on the command-line with
sbatch. For example, say you had the following script "simple_cmd.sh" to run:
#!/bin/bash #Ensures that the node can sleep #print date and hostname date hostname #verify that sleep 5 works time sleep 5
In order to run this on the command-line, you could issue (where short is an available queue on the system):
$ sbatch -p short -t 00:01:00 -n 1 simple_cmd.sh
After you submit the job, see if you can quickly detect it in the queue with
squeue. When it completes, you will find the output and error files in the same directory from which you submitted the job, with the names "slurm-%j.out" and "slurm-%j.err", where %j is the Slurm job number.
To save yourself from typing the same options repeatedly, there is an easier way to submit options to
sbatch, as demonstrated in Job Scripts.
Meaning Flag Value Example Default Name of job -J any string that can be part of a filename -J SimpleJob filename of job script Stdout -o filename (with or without path) -o $HOME/project1/%j.out slurm-%j.out Sterr -e filename (with or without path) -e $HOME/project1/%j.err slurm-%j.err Job dependency -d type:job_id -d=afterok:1234 (none) Email address --mail-user email@domain --email@example.com (none) Email notification type --mail-type BEGIN, END, FAIL, REQUEUE, or ALL --mail-type=ALL (none) Environment variable(s) to pass --export varname[=varvalue], ALL, NONE --export ALL,LOC=/tmp/x ALL
Note: for more specific examples on how to write job scripts and use queues on your particular system, see the documentation for the Private Cluster you are working on.
Simple Job Script
sbatch can be put into the batch script itself. This makes it easy to copy and paste to new scripts, as well as be confident that a job is submitted with the same arguments over and over again.
All that is required is to place the command line options in the batch script and prepend them with #SBATCH. They appear as comments to the shell, but Slurm parses them for you and applies them to your job. Here is an example, 'batch.sh', which also illustrates some of the useful environment variables that Slurm sets for you in your batch environment:
#!/bin/bash #SBATCH -p short #SBATCH -t 00:01:00 #SBATCH -n 1 echo "starting at `date` on `hostname`" # Print Slurm job properties echo "SLURM_JOB_ID = $SLURM_JOB_ID" echo "SLURM_NTASKS = $SLURM_NTASKS" echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES" echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" echo "SLURM_TASKS_PER_NODE = $SLURM_TASKS_PER_NODE" echo "SLURM_JOB_CPUS_PER_NODE = $SLURM_JOB_CPUS_PER_NODE"
Submit the above script with
sbatch and examine the output in 'slurm-NNNNN.out' (where 'NNNNN' is the job number) to see the values of the Slurm environment variables.
As mentioned in How Many Tasks and Nodes Do I Need?, careful consideration must be given to the
-N arguments for parallel applications of various kinds. Furthermore, Slurm provides additional flags that can help you to get the right resources for your job. Here is a list of the useful flags for parallel jobs that will be described in this section.
-n option is the primary way to obtain the right number of CPUs. This does not mean that there must be exactly that number of real "tasks" to run (except in the case of MPI jobs); the important part is that Slurm must assign you the appropriate number of CPUs and nodes for the requested
-n. In general, each task will be allocated 1 CPU (at a minimum, because CPUs/node must be an even number). The assigned CPUs and nodes will depend on resource availability, the number of tasks requested, and the values of other Slurm options.
However, you may be better off asking Slurm to give you twice as many CPUs as tasks. When 2 tasks run on 2 hardware threads (or Slurm CPUs) in the same physical core, there will be contention for the resources of that core (cycles, registers, caches, etc.). Therefore, it is often better if your request includes
--cpus-per-task=2, or equivalently,
-N is specified along with
-n, Slurm will allocate the exact number of nodes specified by
-N, then distribute the total number of tasks specified by
-n among the nodes. For example,
-n 16 -N 2 will specify 16 tasks to be launched, distributed among the 2 allocated nodes. Each task will be allocated 1 CPU (or 2 if
--cpus-per-task=2, though in any case, CPUs/node will be an even number).
Are tasks and CPUs distributed evenly among the nodes? Let's find out. The example batch script below, 'srun_batch.sh', also illustrates how
srun can be used to launch multiple processes within an
#!/bin/bash #SBATCH -p short #SBATCH -t 00:01:00 #SBATCH -n 16 #SBATCH -N 2 #SBATCH -J srun_batch #SBATCH -o sbatch-%j.out #SBATCH -e sbatch-%j.err echo "starting at `date` on `hostname`" # Print Slurm job properties echo "SLURM_JOB_ID = $SLURM_JOB_ID" echo "SLURM_NTASKS = $SLURM_NTASKS" echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES" echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" echo "SLURM_TASKS_PER_NODE = $SLURM_TASKS_PER_NODE" echo "SLURM_JOB_CPUS_PER_NODE = $SLURM_JOB_CPUS_PER_NODE" srun -n $SLURM_NTASKS srun_hello.sh echo "ended at `date` on `hostname`" exit 0
The commands in the above script will run just on the first node allocated to your job, and not in parallel. However, the
srun command inside it will launch multiple, parallel processes of the script 'srun_hello.sh' on the full set of CPUs allocated to your job. This script, shown below, reports the Slurm variables that are defined in the environment of each
srun process. Please create both of these scripts on your cluster, submit the first script to
sbatch, then watch for the output in 'sbatch-NNNNN.out'.
#!/bin/bash echo "Hello from `hostname`," \ "$SLURM_CPUS_ON_NODE CPUs are allocated here" \ "(or $SLURM_JOB_CPUS_PER_NODE)," \ "I am rank $SLURM_PROCID on node $SLURM_NODEID," \ "my task ID on this node is $SLURM_LOCALID"
You will find in most cases that the tasks are not distributed evenly among the nodes. But in MPI applications, such a distribution is often desired. One option would be to use
--tasks-per-node instead of
-n to determine the number of MPI tasks (processes) running on each node. In the above example, you could use
--tasks-per-node=8 instead of
-n 16, because 16 tasks ÷ 2 nodes = 8 tasks per node.
Another option is to use the
--exclusive flag to prevent other users from sharing your nodes with you. One effect of this option is to cause tasks to be distributed evenly among nodes. But it's an option you probably want anyway, if you request fewer tasks than there are CPUs on each node. By using it, you can allow each process to consume a greater percentage of a node's memory, e.g. Or, you can allow each process to fork multiple threads (with OpenMP, say) and use all the CPUs on a node in that way.
MATLAB Parallel Computing Toolbox (PCT) is a special case. The best plan may be to specify
-n 1 --exclusive when you submit the batch job. Then, have your MATLAB script fill up the node with MATLAB parallel workers, so that one worker runs on each physical core.