Difference between revisions of "Pool Cluster"
Jump to navigation
Jump to search
(Created page with "== Getting Started == === General Information === :* pool is a private cluster with restricted access to the following groups: :* Head node: '''pool.cac.cornell.edu''' ([...") |
|||
Line 36: | Line 36: | ||
=== Networking === | === Networking === | ||
:* All nodes have a 1GB ethernet connection for eth0 on a private net served out from the pool head node. | :* All nodes have a 1GB ethernet connection for eth0 on a private net served out from the pool head node. | ||
+ | |||
+ | == Running Jobs == | ||
+ | |||
+ | |||
+ | === Slurm === | ||
+ | |||
+ | |||
+ | ==== Queues/Partitions ==== | ||
+ | |||
+ | ("Partition" is the term used by slurm) | ||
+ | |||
+ | :* '''hyperthreading is turned on for ALL nodes''' | ||
+ | :* '''all partitions have a default time of 1 hour''' | ||
+ | :* pool has X separate queues: | ||
+ | |||
+ | :{| class="wikitable" border="1" cellpadding="4" style="width: auto" | ||
+ | ! style="background:#e9e9e9;" | Queue/Partition | ||
+ | ! style="background:#e9e9e9;" | Number of nodes | ||
+ | ! style="background:#e9e9e9;" | Node Names | ||
+ | ! style="background:#e9e9e9;" | Limits | ||
+ | |- | ||
+ | | '''normal''' (default) | ||
+ | | align="center" | 9 | ||
+ | | align="center" | c000[1-9] | ||
+ | | align="center" | walltime limit: 4 hours | ||
+ | |- | ||
+ | |||
+ | |} | ||
+ | |||
+ | ==== Common Slurm Commands ==== | ||
+ | [https://slurm.schedmd.com/quickstart.html Slurm Quickstart Guide] | ||
+ | |||
+ | [https://slurm.schedmd.com/pdfs/summary.pdf Command/Option Summary (two page PDF)] | ||
+ | |||
+ | ==== Slurm HELP ==== | ||
+ | |||
+ | Slurm Workload Manager [https://slurm.schedmd.com/quickstart.html Quick Start User Guide] - this page lists all of the available Slurm commands | ||
+ | |||
+ | Slurm Workload Manager [https://slurm.schedmd.com/faq.html Frequently Asked Questions] includes FAQs for Management, Users and Administrators | ||
+ | |||
+ | [https://rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/ Convenient SLURM Commands] has examples for getting information on jobs and controlling jobs | ||
+ | |||
+ | Slurm Workload Manager - [https://slurm.schedmd.com/sbatch.html sbatch] - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks. | ||
+ | <pre> | ||
+ | A few slurm commands to initially get familiar with: | ||
+ | scontrol show nodes | ||
+ | scontrol show partition | ||
+ | |||
+ | Submit a job: sbatch testjob.sh | ||
+ | Interactive Job: srun -p short --pty /bin/bash | ||
+ | |||
+ | scontrol show job [job id] | ||
+ | scancel [job id] | ||
+ | sinfo -l | ||
+ | </pre> | ||
+ | |||
+ | === Example in Short Partition/Queue === | ||
+ | |||
+ | Example sbatch file to run a job in the short partition/queue; save as example.sh: | ||
+ | |||
+ | <pre> | ||
+ | #!/bin/bash | ||
+ | ## J sets the name of job | ||
+ | #SBATCH -J TestJob | ||
+ | |||
+ | ## -p sets the partition (queue) | ||
+ | #SBATCH -p long | ||
+ | |||
+ | ## 10 min | ||
+ | #SBATCH --time=00:10:00 | ||
+ | |||
+ | ## sets the tasks per core (default=2; keep default if you want to take advantage of hyperthreading) | ||
+ | ## 2 will take whole cores, but will divide by 2 with hyperthreading | ||
+ | #SBATCH --ntasks-per-core=1 | ||
+ | |||
+ | ## request 300MB per core | ||
+ | #SBATCH --mem-per-cpu=4GB | ||
+ | |||
+ | ## define jobs stdout file | ||
+ | #SBATCH -o testlong-%j.out | ||
+ | |||
+ | ## define jobs stderr file | ||
+ | #SBATCH -e testlong-%j.err | ||
+ | |||
+ | echo "starting at `date` on `hostname`" | ||
+ | |||
+ | # Print the SLURM job ID. | ||
+ | echo "SLURM_JOBID=$SLURM_JOBID" | ||
+ | |||
+ | echo "hello world `hostname`" | ||
+ | |||
+ | echo "ended at `date` on `hostname`" | ||
+ | exit 0 | ||
+ | |||
+ | </pre> | ||
+ | |||
+ | Submit/Run your job: | ||
+ | <pre> | ||
+ | sbatch example.sh | ||
+ | </pre> | ||
+ | |||
+ | View your job: | ||
+ | <pre> | ||
+ | scontrol show job 9 | ||
+ | </pre> |
Revision as of 16:59, 22 February 2019
Getting Started
General Information
- pool is a private cluster with restricted access to the following groups:
- Head node: pool.cac.cornell.edu (access via ssh)
- Open HPC deployment running Centos 7.6
- Cluster scheduler: slurm 17.11.10
/home
15TB directory server (nfs exported to all cluster nodes)
-
- 9 compute nodes c00[01-9] hyperthreading on.
- Current Cluster Status: Ganglia.
- Please send any questions and report problems to: cac-help@cornell.edu
How To Login
- To get started, login to the head node pool.cac.cornell.edu via ssh.
- If you are unfamiliar with Linux and ssh, we suggest reading the Linux Tutorial and looking into how to Connect to Linux before proceeding.
- You will be prompted for your CAC account password
Hardware
c000[1-9] have hyperthreading turned
Node Names Memory per node Model name Processor count per node Core(s) per socket Sockets Thread(s) per core
Networking
- All nodes have a 1GB ethernet connection for eth0 on a private net served out from the pool head node.
Running Jobs
Slurm
Queues/Partitions
("Partition" is the term used by slurm)
- hyperthreading is turned on for ALL nodes
- all partitions have a default time of 1 hour
- pool has X separate queues:
Queue/Partition Number of nodes Node Names Limits normal (default) 9 c000[1-9] walltime limit: 4 hours
Common Slurm Commands
Command/Option Summary (two page PDF)
Slurm HELP
Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands
Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators
Convenient SLURM Commands has examples for getting information on jobs and controlling jobs
Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
A few slurm commands to initially get familiar with: scontrol show nodes scontrol show partition Submit a job: sbatch testjob.sh Interactive Job: srun -p short --pty /bin/bash scontrol show job [job id] scancel [job id] sinfo -l
Example in Short Partition/Queue
Example sbatch file to run a job in the short partition/queue; save as example.sh:
#!/bin/bash ## J sets the name of job #SBATCH -J TestJob ## -p sets the partition (queue) #SBATCH -p long ## 10 min #SBATCH --time=00:10:00 ## sets the tasks per core (default=2; keep default if you want to take advantage of hyperthreading) ## 2 will take whole cores, but will divide by 2 with hyperthreading #SBATCH --ntasks-per-core=1 ## request 300MB per core #SBATCH --mem-per-cpu=4GB ## define jobs stdout file #SBATCH -o testlong-%j.out ## define jobs stderr file #SBATCH -e testlong-%j.err echo "starting at `date` on `hostname`" # Print the SLURM job ID. echo "SLURM_JOBID=$SLURM_JOBID" echo "hello world `hostname`" echo "ended at `date` on `hostname`" exit 0
Submit/Run your job:
sbatch example.sh
View your job:
scontrol show job 9