VEGA Cluster
THIS IS AN OBSOLETE WIKI PAGE
FOR current VEGA Information see Pool Cluster
THIS IS AN OBSOLETE WIKI PAGE
THIS IS AN OBSOLETE WIKI PAGE KEPT FOR HISTORICAL REASONS
- VEGA is a private cluster with restricted access to the rad332_0001 group
- VEGA has 6 nodes: vega0[1-6]
- Head node: vega01.cac.cornell.edu (access via ssh)
- Centos 7.3.1611
- Cluster scheduler: slurm 17.02
- /home (1.9 TB) directory server (nfs exported to all 5 of the other nodes)
- /scratch (1.9TB) found on vega0[2-5]
- /scratch (3.6TB) found on vega06
- Head node: vega01.cac.cornell.edu (access via ssh)
- Cluster Status: Ganglia. ** COMING SOON **
- Questions/Issues: cac-help@cornell.edu
Getting Started with the VEGA cluster
vega01 is currently acting as a 'head node' for scheduling and "cluster" access (until one is purchased later in the year).
Best Practices for the Vega cluster:
- ssh to vega01.cac.cornell.edu
- schedule your jobs using a batch script (see below)
Partitions (i.e. queues)
("partition" is the term used by slurm)
VEGA currently has 2 partitions:
- normal (default)
- Number of nodes: 5
- Node Names: vega0[2-6]
- Sockets:2, CoresPerSocket:24, ThreadsPerCore:2
- Maximim Memory: no limit
- /scratch per node: 1.9TB on vega[02-05]; 3.6TB on vega06
- Maximum allowed time: no limit
- short
- Number of nodes: 1
- Node Names: vega01
- Sockets:1, CoresPerSocket:24, ThreadsPerCore:2
- Maximum Core: 24
- Maximim Memory: no limit
- /scratch: 1.9TB on vega[02-05]; 3.6TB on vega06
- Maximum allowed time: 1 hour
Installed Software
The 'module' system is implemented. Use: module avail to list what environment you can put yourself in for a software version. EXAMPLE: To be sure you are using the environment setup for Intel mpi, you would type: * module avail * module load intelmpi * module list (will show you what modules you have loaded) - when done using the environment, either logout OR type: * module unload intelmpi
(sortable table)
Package and Version Module Notes Devtoolset (7) devtoolset7 contains gcc 7.1.1 Firefox Installed on vega01 ONLY gcc (4.8.5) Default - no module Intel C Compiler (17.04) intel17 Intel Fortran Compiler (17.04) intel17 Intel mpi (2017 U3 Build 20170405) intelmpi Intel python 2 (2.7.13) intelpython2 Intel python 3 (3.5.3) intelpython3 psi4conda 1.1 psi4conda/1.1/py2.7 Built with Intel Python 2.7 psi4conda 1.1 psi4conda/1.1/py3.5 Built with Intel Python 3.5 Quantum Espresso 6.1 quantum-espresso6.1
- It is usually possible to install software in your home directory.
- List installed software via rpms: 'rpm -qa'. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
- Send email to: cac-help@cornell.edu to request installation or update software. You're not limited to what appears in the list above (possibly pending the permission of the cluster PI).
Common Slurm Commands
Command/Option Summary (two page PDF)
Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands
Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators
Convenient SLURM Commands has examples for getting information on jobs and controlling jobs
Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
A few slurm commands to initially get familiar with: sinfo -l scontrol show nodes scontrol show partition Submit a job: sbatch testjob.sh Interactive Job: srun -p normal --pty /bin/bash scontrol show job [job id] scancel [job id] squeue -u userid
Running jobs on the VEGA cluster
Use the /scratch space during jobs vs. writing to your nfs-mounted home directory
It is faster to perform local file I/O and copy complete data files to/from $HOME at the beginning and the end of the job, rather than perform I/O over the network ($HOME is network mounted on vega0[2-6]).
Example sbatch file to run a job in the normal partition; save as example.sh:
#!/bin/bash ## J sets the name of job #SBATCH -J TestJob ## -p sets the partition (queue) #SBATCH -p normal ## 10 min #SBATCH --time=00:10:00 ## request a single task(core) #SBATCH -n1 ## request 300MB per core ## #SBATCH --mem-per-cpu=300 ## define jobs stdout file #SBATCH -o testnormal-%j.out ## define jobs stderr file #SBATCH -e testnormal-%j.err echo "moving data to the scratch space..." mkdir /scratch/[userid] rsync -av $HOME/mydata /scratch/[userid]/ echo "starting at `date` on `hostname`" > /scratch/[userid]/testfile # Print the SLURM job ID. echo "SLURM_JOBID=$SLURM_JOBID" echo "hello world `hostname`" # use the data from /scratch that you would need & be sure to write your job # output into your /scratch/[userid] directory #Copy output data back to home rsync -av /scratch/[userid]/dataoutput $HOME/tmpdir exit 0
Then run:
sbatch example.sh
Then submit:
scontrol show job 9
You should see details regarding the job ran. Be sure to take a look at your *.err & *.out files for the job id. (above example would be: cat testnormal-9.*)