Difference between revisions of "Slurm"
(Added Slurm Required Arguments section) |
(Added documentation links and minor edits) |
||
Line 23: | Line 23: | ||
=== Job Control === | === Job Control === | ||
− | :* <code>sbatch testjob.sh</code> submits a job where testjob.sh is the script you want to run. | + | :* <code>sbatch testjob.sh</code> submits a job where testjob.sh is the script you want to run. Also see the [https://slurm.schedmd.com/sbatch.html sbatch] documentation. |
− | :* <code>srun -p <partition> --pty /bin/bash</code> starts an interactive job. | + | :* <code>srun -p <partition> --pty /bin/bash</code> starts an interactive job. Also see the [https://slurm.schedmd.com/srun.html srun] documentation. |
− | :* <code>squeue -u my_userid</code> shows state of jobs for user my_userid. | + | :* <code>squeue -u my_userid</code> shows state of jobs for user my_userid. Also see the [https://slurm.schedmd.com/squeue.html squeue] documentation. |
− | :* <code>scontrol show job <job id></code> views the state of a job. | + | :* <code>scontrol show job <job id></code> views the state of a job. Also see the [https://slurm.schedmd.com/scontrol.html scontrol] documentation. |
− | :* <code>scancel <job id></code> cancels a job. | + | :* <code>scancel <job id></code> cancels a job. Also see the [https://slurm.schedmd.com/scancel.html scancel] documentation. |
:* <code>squeue</code> with no arguments retrieves summary information on all jobs scheduled. | :* <code>squeue</code> with no arguments retrieves summary information on all jobs scheduled. | ||
Line 45: | Line 45: | ||
| align="center" | <tt>-t</tt> | | align="center" | <tt>-t</tt> | ||
| align="center" | hh:mm:ss | | align="center" | hh:mm:ss | ||
− | | align="center" | <code>-t 00:05:00</code> | + | | align="center" | <code>-t 00:05:00</code> (5 minutes) |
|- | |- | ||
| Number of tasks/processes | | Number of tasks/processes | ||
| align="center" | <tt>-n</tt> | | align="center" | <tt>-n</tt> | ||
| align="center" | 1 ... (nodes * max tasks/processes per node) | | align="center" | 1 ... (nodes * max tasks/processes per node) | ||
− | | align="center" | <code>-n 16</code> ( | + | | align="center" | <code>-n 16</code> (16 tasks/processes) |
|- | |- | ||
| Submission Queue | | Submission Queue | ||
| align="center" | <tt>-p</tt> | | align="center" | <tt>-p</tt> | ||
| align="center" | Queue Name | | align="center" | Queue Name | ||
− | | align="center" | <code>-p</code> normal | + | | align="center" | <code>-p</code> (normal queue/partition) |
|- | |- | ||
|} | |} |
Revision as of 12:59, 3 June 2019
Some of the CAC's Private Clusters are managed with OpenHPC, which includes the Slurm Workload Manager (Slurm for short). Slurm (Simple Linux Utility for Resource Management) is a group of utilities used for managing workloads on compute clusters.
This page is intended to give users an overview of Slurm. Some of the information on this page has been adapted from the Cornell Virtual Workshop topics on the Stampede2 Environment and Advanced Slurm. For a more in-depth tutorial, please review these topics directly.
Overview
Some clusters use Slurm as the batch queuing system and the scheduling mechanism. This means that jobs are submitted to Slurm from a login node and Slurm handles scheduling these jobs on nodes as resources becomes available. Users submit jobs to the batch component which is responsible for maintaining one or more queues (also known as "partitions"). These jobs include information about themselves as well as a set of resource requests. Resource requests include anything from the number of CPUs or nodes to specific node requirements (e.g. only use nodes with > 2GB RAM). A separate component, called the scheduler, is responsible for figuring out when and where these jobs can be run on the cluster. The scheduler needs to take into account the priority of the job, any reservations that may exist, when currently running jobs are likely to end, etc. Once informed of scheduling information, the batch system will handle starting your job at the appropriate time and place. Slurm handles both of these components, so you don't have to think of them as separate processes, you just need to know how to submit jobs to the batch queue(s).
Note: Refer to the documentation for your cluster to determine what queues/partitions are available.
Common Commands
Many of the following commands have many options. For full details, see the Slurm Docs.
Display Info
sinfo
displays information about nodes and partitions/queues. Use-l
for more detailed information.scontrol show nodes
views the state of the nodes.scontrol show partition
views the state of the partition/queue.
Job Control
sbatch testjob.sh
submits a job where testjob.sh is the script you want to run. Also see the sbatch documentation.srun -p <partition> --pty /bin/bash
starts an interactive job. Also see the srun documentation.squeue -u my_userid
shows state of jobs for user my_userid. Also see the squeue documentation.scontrol show job <job id>
views the state of a job. Also see the scontrol documentation.scancel <job id>
cancels a job. Also see the scancel documentation.squeue
with no arguments retrieves summary information on all jobs scheduled.
Once the job has completed, the stdout and stderr streams will be put in your $HOME directory named with the job id. To verify the job ran successfully, examine these output files.
Slurm Required Arguments
The following table shows required directives that must be provided with each job. Note that the max number of tasks/processes is dependent on the system you are running on, and some systems may have more required arguments.
Meaning Flag Value Example Job Walltime -t hh:mm:ss -t 00:05:00
(5 minutes)Number of tasks/processes -n 1 ... (nodes * max tasks/processes per node) -n 16
(16 tasks/processes)Submission Queue -p Queue Name -p
(normal queue/partition)
References
- Slurm Docs
- Quick Start User Guide - this page lists all of the available Slurm commands
- Command/Option Summary (two page PDF)
- Frequently Asked Questions includes FAQs for Management, Users and Administrators
- Convenient SLURM Commands has examples for getting information on jobs and controlling jobs
- sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.