VEGA Cluster

From CAC Documentation wiki
Revision as of 09:14, 30 August 2017 by Jhs43 (talk | contribs) (Created page with "== VEGA General Information == :* VEGA is a private cluster with restricted access to the rad332_0001 group :* VEGA has 6 nodes: vega0[1-6] * vega01.cac.cornell.edu is acting...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

VEGA General Information

  • VEGA is a private cluster with restricted access to the rad332_0001 group
  • VEGA has 6 nodes: vega0[1-6]
  • vega01.cac.cornell.edu is acting as the scheduler server until a "head node" is purchased later in the year. (access via ssh)
  • Head node: atlas2.cac.cornell.edu (access via ssh)
    • Centos 7.3.1611
    • Cluster scheduler: slurm 17.02
    • /home (1.9 TB) directory server (nfs exported to all 5 of the other nodes)

Getting Started with the VEGA cluster

Partitions (i.e. queues) ("Partition" is the term used by slurm)

VEGA currently has only 1 partition: normal:

  • normal (default)
Number of nodes: 5
Node Names: vega0[2-6]
Sockets:2, CoresPerSocket:24, ThreadsPerCore:2
Maximim Memory: no limit
/tmp per node: 409GB
Maximum allowed time: no limit


Common Slurm Commands

Slurm Quickstart Guide

Command/Option Summary (two page PDF)

Software

Installed Software

The 'module' system is implemented. Use: module avail 
to list what environment you can put yourself in for a software version. 
    
EXAMPLE:
To be sure you are using the environment setup for Intel mpi, you would type:
* module avail
* module load intelmpi
* module list (will show you what modules you have loaded)
- when done using the environment, either logout OR type:
* module unload intelmpi


(sortable table)
Package and Version Module Notes
Devtoolset (7) devtoolset7 contains gcc 7.1.1
Firefox Installed on vega01 ONLY
gcc (4.8.5) Default - no module
Intel C Compiler (17.04) intel17
Intel Fortran Compiler (17.04) intel17
Intel mpi (2017 U3 Build 20170405) intelmpi
Intel python 2 (2.7.13) intelpython2
Intel python 3 (3.5.3) intelpython3
psi4conda 1.1 psi4conda/1.1/py2.7 Built with Intel Python 2.7
psi4conda 1.1 psi4conda/1.1/py3.5 Built with Intel Python 3.5
Quantum Espresso 6.1 quantum-espresso6.1
  • It is usually possible to install software in your home directory.
  • List installed software via rpms: 'rpm -qa'. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
  • Send email to: cac-help@cornell.edu to request installation or update software. You're not limited to what appears in the list above (possibly pending the permission of the cluster PI).

Slurm HELP

Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands

Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators

Convenient SLURM Commands has examples for getting information on jobs and controlling jobs

Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.

A few slurm commands to initially get familiar with:
scontrol show nodes
scontrol show partition

Submit a job: sbatch testjob.sh
Interactive Job: srun -p short --pty /bin/bash

scontrol show job [job id]
scancel [job id]
sinfo -l

Running jobs on the VEGA cluster

Running Jobs on the VEGA cluster


Example sbatch file to run a job in the short partition/queue; save as example.sh:

#!/bin/bash
## J sets the name of job
#SBATCH -J TestJob
## -p sets the partition (queue)
#SBATCH -p normal
## 10 min
#SBATCH --time=00:10:00
## request a single task(core)
#SBATCH -n1
## request 300MB per core
## #SBATCH --mem-per-cpu=300
## define jobs stdout file
#SBATCH -o testlong-%j.out
## define jobs stderr file
#SBATCH -e testlong-%j.err

echo "starting at `date` on `hostname`"

# Print the SLURM job ID.
echo "SLURM_JOBID=$SLURM_JOBID"

echo "hello world `hostname`"

echo "ended at `date` on `hostname`"
exit 0

Then run:

sbatch example.sh

Then submit:

scontrol show job 9

You should see the node it ran on and that it was run in the short partition/queue.