ATLAS2 Cluster
ATLAS2 General Information
- ATLAS2 is a private cluster with restricted access to the bs54_0001 group. (currently in the process of building and not yet available to the group. ETA: 1/20/17)
- ATLAS2 currently has one head node (atlas2.cac.cornell.edu) and 6 compute nodes [c0012, c0015, c0016, c0034, c0044 and c0046) (will have the full 56 compute nodes upon completion)
- Head node: atlas2.cac.cornell.edu (access via ssh)
- Open HPC deployment running Centos 7.3.1611
- Cluster scheduler: slurm 16.05
- /home (15TB) directory server (nfs exported to all cluster nodes)
- Head node: atlas2.cac.cornell.edu (access via ssh)
- Cluster Status: Ganglia.
- Please send any questions and report problems to: cac-help@cornell.edu
Getting Started with the ATLAS cluster
Queues/Partitions ("Partition" is the term used by slurm) =
ATLAS2 has 5 separate queues:
- short (default)
- Number of nodes: 28 servers (total: 56 cpu, 336 cores)
- Node Names: c00[17-44] **CURRENTLY ONLY c0034 & c0044 are up for testing of the short queue! **
- Memory per node: 48GB
- /tmp per node: 409GB
- Limits: Maximum of 28 nodes, walltime limit: 4 hours
- long
- Number of nodes: 18 servers (total: 36 cpu, 216 cores)
- Node Names: c00[17-34] **CURRENTLY ONLY c0034 is up for testing of the long queue! **
- Memory per node: 48GB
- /tmp per node: 409GB
- Limits: Maximum of 18 nodes, walltime limit: 504 hours
- inter ~Interactive
- Number of nodes: 12 servers (total: 24 cpu, 144 cores)
- Node Names: c00[45-56] . **CURRENTLY ONLY c0046 is up for testing of the inter queue! **
- Memory per node: 48GB
- /tmp per node: 68GB
- Limits: Maximum of 12 nodes, walltime limit: 168 hours
- bigmem
- Number of Nodes 12 servers (total: 24 cpu, 144 cores)
- Node Names: c00[01-12] . **CURRENTLY ONLY c0012 is up for testing of the bigmem queue! **
- HW: Intel x5690 3.46GHZ, 128GB SSD drive,
- Memory per node: 96GB
- /tmp per node: 68GB
- Limits: Maximum of 12 nodes, walltime limit: 168 hours
- gpu
- Number of nodes: 4 servers (total: 8 cpu, 48 cores)
- Node Names: c00[13-16] **CURRENTLY ONLY c0015 & c0016 are up for testing of the gpu queue! **
- HW: 2xIntel x5670 2.93GHZ, 500GB SATA drive, 1 Tesla M2090
- Memory per node: 48GB
- /tmp per node: 409GB
- Limits: Maximum of 4 nodes, walltime limit: 168 hours
Common Slurm Commands
Command/Option Summary (two page PDF)
Software
Installed Software
The 'lmod module' system is implemented. Use: module avail to list what environment you can put yourself in for a software version. (to get a more complete listing, type: module spider) EXAMPLE: To be sure you are using the environment setup for gdal2, you would type: * module avail * module load gdal2 - when done, either logout and log back in or type: * module unload gdal2 You can create your own modules and place them in your $HOME. Once created, type: module use $HOME/path/to/personal/modulefiles This will prepend the path to $MODULEPATH [type echo $MODULEPATH to confirm]
(sortable table)
Package and Version Location module available Notes cplex studio 128 /opt/ohpc/pub/ibm/ILOG/CPLEX_Studio128/ cplex/12.8 cuda toolkit 8-0 /usr/local/cuda-8.0 c0015 & c0016 (gpus) gcc 7.2.0 /opt/ohpc/pub/compiler/gcc/7.2.0/bin/gcc gnu7/7.2.0 gcc 4.8.5 (default) /usr/bin/gcc gdal 2.2.3 /opt/ohpc/pub/gdal2.2.3 gdal/2.2.3 java openjdk 1.8.0 /usr/bin/java Python 2.7.5 (default) /usr/bin/python The system-wide installation of packages is no longer supported. See below for Anaconda/miniconda install information. R 3.4.3 /usr/bin/R The system-wide installation of packages is no longer supported. Subversion (svn) 1.7 /usr/bin/svn
- It is usually possible to install software in your home directory.
- List installed software via rpms: 'rpm -qa'. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
How to run jobs
Running jobs on the ATLAS2 cluster
Running Jobs on the ATLAS2 cluster
Slurm Workload Manager Quick Start User Guide - this page lists all of the available Slurm commands
Slurm Workload Manager Frequently Asked Questions includes FAQs for Management, Users and Administrators
Convenient SLURM Commands has examples for getting information on jobs and controlling jobs
Slurm Workload Manager - sbatch - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
Example sbatch file to run a job in the short partition/queue; save as example.sh:
#!/bin/bash #SBATCH -J TestJob #SBATCH -p short #SBATCH --time=00:10:00 #SBATCH -n1 #SBATCH --mem-per-cpu=30000 #SBATCH -o sampletest-%j.out #SBATCH -e sampletest-%j.err echo "starting at `date` on `hostname`" # Print the SLURM job ID. echo "SLURM_JOBID=$SLURM_JOBID" echo "hello world example" sleep 30 # Run the vmtest application ## echo "running vmtest 256 100000" ## $HOME/vmtest 256 100000 echo "ended at `date` on `hostname`" exit 0
Then run:
sbatch example.sh
Then submit:
scontrol show job 9
You should see the node it ran on and that it was run in the short partition/queue.
MATLAB: running MDCS jobs on the ATLAS2 cluster
Running MDCS Jobs on the ATLAS2 cluster