VEGA Cluster

From CAC Documentation wiki
Jump to navigation Jump to search

VEGA General Information

  • VEGA is a private cluster with restricted access to the rad332_0001 group
  • VEGA has 6 nodes: vega0[1-6]
  • Head node: vega01.cac.cornell.edu (access via ssh)
    • Centos 7.3.1611
    • Cluster scheduler: slurm 17.02
    • /home (1.9 TB) directory server (nfs exported to all 5 of the other nodes)
    • /scratch (1.9TB) found on vega0[2-5]
    • /scratch (3.6TB) found on vega06

Getting Started with the VEGA cluster

vega01 is currently acting as a 'head node' for scheduling and "cluster" access (until one is purchased later in the year).

Best Practices for the Vega cluster:

  • ssh to vega01.cac.cornell.edu
  • schedule your jobs using a batch script (see below)

Partitions (i.e. queues)

("partition" is the term used by slurm)

VEGA currently has 2 partitions:

  • normal (default)
Number of nodes: 5
Node Names: vega0[2-6]
Sockets:2, CoresPerSocket:24, ThreadsPerCore:2
Maximim Memory: no limit
/scratch per node: 1.9TB on vega[02-05]; 3.6TB on vega06
Maximum allowed time: no limit
  • short
Number of nodes: 1
Node Names: vega01
Sockets:1, CoresPerSocket:24, ThreadsPerCore:2
Maximum Core: 24
Maximim Memory: no limit
/scratch: 1.9TB on vega[02-05]; 3.6TB on vega06
Maximum allowed time: 1 hour

Installed Software

The 'module' system is implemented. Use: module avail 
to list what environment you can put yourself in for a software version. 
    
EXAMPLE:
To be sure you are using the environment setup for Intel mpi, you would type:
* module avail
* module load intelmpi
* module list (will show you what modules you have loaded)
- when done using the environment, either logout OR type:
* module unload intelmpi


(sortable table)
Package and Version Module Notes
Devtoolset (7) devtoolset7 contains gcc 7.1.1
Firefox Installed on vega01 ONLY
gcc (4.8.5) Default - no module
Intel C Compiler (17.04) intel17
Intel Fortran Compiler (17.04) intel17
Intel mpi (2017 U3 Build 20170405) intelmpi
Intel python 2 (2.7.13) intelpython2
Intel python 3 (3.5.3) intelpython3
psi4conda 1.1 psi4conda/1.1/py2.7 Built with Intel Python 2.7
psi4conda 1.1 psi4conda/1.1/py3.5 Built with Intel Python 3.5
Quantum Espresso 6.1 quantum-espresso6.1
  • It is usually possible to install software in your home directory.
  • List installed software via rpms: 'rpm -qa'. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
  • Send email to: cac-help@cornell.edu to request installation or update software. You're not limited to what appears in the list above (possibly pending the permission of the cluster PI).

Common Slurm Commands

Slurm Quickstart Guide

Command/Option Summary (two page PDF)

Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands

Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators

Convenient SLURM Commands has examples for getting information on jobs and controlling jobs

Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

A few slurm commands to initially get familiar with:

sinfo -l
scontrol show nodes
scontrol show partition

Submit a job: sbatch testjob.sh
Interactive Job: srun -p normal --pty /bin/bash

scontrol show job [job id]
scancel [job id]

squeue -u userid

Running jobs on the VEGA cluster

Use the /scratch space during jobs vs. writing to your nfs-mounted home directory

It is faster to perform local file I/O and copy complete data files to/from $HOME at the beginning and the end of the job, rather than perform I/O over the network ($HOME is network mounted on vega0[2-6]).


Example sbatch file to run a job in the normal partition; save as example.sh:

#!/bin/bash
## J sets the name of job
#SBATCH -J TestJob
## -p sets the partition (queue)
#SBATCH -p normal
## 10 min
#SBATCH --time=00:10:00
## request a single task(core)
#SBATCH -n1
## request 300MB per core
## #SBATCH --mem-per-cpu=300
## define jobs stdout file
#SBATCH -o testnormal-%j.out
## define jobs stderr file
#SBATCH -e testnormal-%j.err

echo "moving data to the scratch space..."
mkdir /scratch/[userid]
rsync -av $HOME/mydata /scratch/[userid]/ 
echo "starting at `date` on `hostname`"  > /scratch/[userid]/testfile

# Print the SLURM job ID.
echo "SLURM_JOBID=$SLURM_JOBID"

echo "hello world `hostname`"
# use the data from /scratch that you would need & be sure to write your job
# output into your /scratch/[userid] directory

#Copy output data back to home
rsync -av /scratch/[userid]/dataoutput $HOME/tmpdir

exit 0

Then run:

sbatch example.sh

Then submit:

scontrol show job 9

You should see details regarding the job ran. Be sure to take a look at your *.err & *.out files for the job id. (above example would be: cat testnormal-9.*)