OH10 Cluster

From CAC Documentation wiki
Jump to navigation Jump to search

OH10 General Information

  • OH10 is a private cluster with restricted access to the oh10_0001 group
  • OH10 currently has one head node (oh10.cac.cornell.edu) and 4 compute nodes (c00[1-4])
  • Head node: oh10.cac.cornell.edu (access via ssh)
    • Open HPC deployment running Centos 7.3.1611
    • Cluster scheduler: slurm 16.05
    • /home (1.6 TB) directory server (nfs exported to all cluster nodes)

Installed Software

The 'module' system is implemented. Use: module avail 
to list what environment you can put yourself in for a software version. 
To be sure you are using the environment setup for Quantum Espresso, you would type:
* module avail
* module load qe/6.1
* module list (will show you what modules you have loaded)
- when done using the environment, either logout OR type:
* module unload qe/6.1

(sortable table)
Package and Version Module Notes
firefox (52.3) Installed only on the head node
gromacs (5.1.4) gromacs/5.1.4
mathematica (11.1.0) mathematica/11.1
openmpi (1.10.3) /opt/ohpc/pub/mpi
papi (5.4.3) papi/5.4.3
quantum espresso (6.1) qe/6.1

Queues/Partitions ("Partition" is the term used by slurm) =

OH10 has 1 queues: normal 'normal' is the default name for slurm.

Slurm HELP

Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands

Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators

Convenient SLURM Commands has examples for getting information on jobs and controlling jobs

Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

A few slurm commands to initially get familiar with:
scontrol show nodes
scontrol show partition

Submit a job: sbatch testjob.sh
Interactive Job: srun -p short --pty /bin/bash

scontrol show job [job id]
scancel [job id]
sinfo -l

Running jobs on the oh10 cluster

Example sbatch file to run a job in the short partition/queue; save as example.sh:

## J sets the name of job
#SBATCH -J TestJob
## -p sets the partition (queue)
#SBATCH -p normal
## 10 min
#SBATCH --time=00:10:00
## request a single task(core)
## request 300MB per core
## #SBATCH --mem-per-cpu=300
## define jobs stdout file
#SBATCH -o testlong-%j.out
## define jobs stderr file
#SBATCH -e testlong-%j.err

echo "starting at `date` on `hostname`"

# Print the SLURM job ID.

echo "hello world `hostname`"

echo "ended at `date` on `hostname`"
exit 0

Then run:

sbatch example.sh

Then submit:

scontrol show job 9

You should see the node it ran on and that it was run in the 'default' (i.e. normal) partition/queue.