Difference between revisions of "ATLAS2 Cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
(Fixed link to MATLAB page)
(fixed bad comment in example script)
 
(10 intermediate revisions by 3 users not shown)
Line 5: Line 5:
 
:* ATLAS2 is a private cluster with restricted access to the bs54_0001 group.
 
:* ATLAS2 is a private cluster with restricted access to the bs54_0001 group.
 
:* Head node:  '''atlas2.cac.cornell.edu'''  ([[#How To Login|access via ssh]])
 
:* Head node:  '''atlas2.cac.cornell.edu'''  ([[#How To Login|access via ssh]])
:** Open HPC deployment running Centos 7.6
+
:** [https://github.com/openhpc/ohpc/wiki OpenHPC] deployment running Centos 7.6
:** Cluster scheduler: slurm 17.11.10
+
:** Cluster scheduler: [[Slurm]] 17.11.10
 
:** <code>/home</code> 15TB directory server (nfs exported to all cluster nodes)  
 
:** <code>/home</code> 15TB directory server (nfs exported to all cluster nodes)  
 
:** Intel(R) Xeon(R) E5-2637 v4 @ 3.5GHz; supports vector extensions up to AVX2
 
:** Intel(R) Xeon(R) E5-2637 v4 @ 3.5GHz; supports vector extensions up to AVX2
Line 21: Line 21:
 
=== Hardware ===
 
=== Hardware ===
  
'''All nodes''' have hyperthreading is turned on and are Xeon generations that supports vector extensions: SSE4.2.
+
'''All nodes''' have hyperthreading turned on and are Xeon generations that supports vector extensions: SSE4.2.
  
 
:{| class="wikitable" border="1" cellpadding="5" style="width: auto"
 
:{| class="wikitable" border="1" cellpadding="5" style="width: auto"
Line 70: Line 70:
 
:* All nodes have an Infiniband connection:
 
:* All nodes have an Infiniband connection:
 
:** InfiniPath_QLE7340n  (QDR speed, 8Gbits/sec)
 
:** InfiniPath_QLE7340n  (QDR speed, 8Gbits/sec)
:**''' PLEASE NOTE:''' One of the 5 Infiniband switches has failed. While it is determined if it will be replaced, the following nodes do not have an "Active" state for Infiniband: <p><code>c00[31-32,43-44,47,50-53,55-58] </code></p>
 
  
== Running Jobs ==
+
== Running Jobs with Slurm ==
  
 +
'''For detailed information and a quick-start guide, see the [[Slurm]] page.'''
  
=== Slurm ===
+
=== ATLAS2 Queues/Partitions  ===
  
 
+
("Partition" is the term used by Slurm)
==== Queues/Partitions  ====
 
 
 
("Partition" is the term used by slurm)
 
  
 
:* '''hyperthreading is turned on for ALL nodes'''  
 
:* '''hyperthreading is turned on for ALL nodes'''  
Line 118: Line 115:
 
  |-
 
  |-
 
  |}
 
  |}
 
==== Common Slurm Commands ====
 
[https://slurm.schedmd.com/quickstart.html Slurm Quickstart Guide]
 
 
[https://slurm.schedmd.com/pdfs/summary.pdf Command/Option Summary (two page PDF)]
 
 
==== Slurm HELP ====
 
 
Slurm Workload Manager [https://slurm.schedmd.com/quickstart.html Quick Start User Guide] - this page lists all of the available Slurm commands
 
 
Slurm Workload Manager [https://slurm.schedmd.com/faq.html Frequently Asked Questions] includes FAQs for Management, Users and Administrators
 
 
[https://rc.fas.harvard.edu/resources/documentation/convenient-slurm-commands/ Convenient SLURM Commands] has examples for getting information on jobs and controlling jobs
 
 
Slurm Workload Manager - [https://slurm.schedmd.com/sbatch.html sbatch] - used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
 
<pre>
 
A few slurm commands to initially get familiar with:
 
scontrol show nodes
 
scontrol show partition
 
 
Submit a job: sbatch testjob.sh
 
Interactive Job: srun -p short --pty /bin/bash
 
 
scontrol show job [job id]
 
scancel [job id]
 
sinfo -l
 
</pre>
 
  
 
=== Example in Short Partition/Queue ===
 
=== Example in Short Partition/Queue ===
Line 165: Line 135:
 
#SBATCH --ntasks-per-core=1  
 
#SBATCH --ntasks-per-core=1  
  
## request 300MB per core
+
## request 4GB per CPU (may limit # of tasks, depending on total memory)
 
#SBATCH --mem-per-cpu=4GB
 
#SBATCH --mem-per-cpu=4GB
  
Line 195: Line 165:
 
scontrol show job 9
 
scontrol show job 9
 
</pre>
 
</pre>
 
=== MATLAB: Running MDCS Jobs ===
 
{{:Running MDCS Jobs on the ATLAS cluster}}
 
  
 
==Software==
 
==Software==
Line 211: Line 178:
  
 
=== Installed Software ===
 
=== Installed Software ===
 
The  'lmod module' system is implemented.
 
To list an environment you can put yourself in:
 
module avail
 
(to get a more complete listing, type: module spider)
 
 
EXAMPLE:
 
To be sure you are using the environment setup for cplex, you would type:
 
<pre>
 
* module avail
 
* module load cplex
 
- when done, either logout and log back in or type:
 
* module unload cplex
 
</pre>
 
You can create your own modules and place them in your $HOME. 
 
Once created, type:
 
module use $HOME/path/to/personal/modulefiles
 
This will prepend the path to $MODULEPATH
 
[type echo $MODULEPATH to confirm]
 
 
Reference: [http://lmod.readthedocs.io/en/latest/020_advanced.html User Created Modules]
 
  
 
:{| class="sortable wikitable" border="1" cellpadding="4" style="width: auto"
 
:{| class="sortable wikitable" border="1" cellpadding="4" style="width: auto"
Line 304: Line 250:
 
:* It is usually possible to install software in your home directory.
 
:* It is usually possible to install software in your home directory.
 
:* List installed software via rpms: ''''rpm -qa''''. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
 
:* List installed software via rpms: ''''rpm -qa''''. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]
 +
 +
=== Modules ===
 +
 +
Since this cluster is managed with OpenHPC, the [[Modules (Lmod)| Lmod Module System]] is implemented.  You can see detailed information and instructions at the linked page. 
 +
 +
'''Example:'''
 +
To be sure you are using the environment setup for <code>cplex</code>, you would type:
 +
<pre>
 +
$ module avail
 +
$ module load cplex
 +
</pre>
 +
When done, either logout and log back in or type <code>module unload cplex</code>
 +
 +
You can also ''create your own modules'' and place them in your $HOME.  For instructions, see the [[Modules_(Lmod)#Personal_Modulefiles| Modules (Lmod)]] page.
 +
 +
Once created, type <code>module use $HOME/path/to/personal/modulefiles</code>.  This will prepend the path to <code>$MODULEPATH</code>.  Type <code>echo $MODULEPATH</code> to confirm.
  
 
=== Build software from source into your home directory ($HOME) ===
 
=== Build software from source into your home directory ($HOME) ===
Line 382: Line 344:
 
* Please take the tutorials to assist you with your management of conda packages:
 
* Please take the tutorials to assist you with your management of conda packages:
 
https://conda.io/docs/user-guide/tutorials/index.html
 
https://conda.io/docs/user-guide/tutorials/index.html
 +
 +
===MPI  ===
 +
* To use MPI with Infiniband, use openmpi3
 +
* The mpich default transport is TCP/IP (not ininiband)
 +
* NOTE: mvapich2/2.2 will NOT work on this cluster

Latest revision as of 19:41, 21 January 2020

Getting Started

General Information

  • ATLAS2 is a private cluster with restricted access to the bs54_0001 group.
  • Head node: atlas2.cac.cornell.edu (access via ssh)
    • OpenHPC deployment running Centos 7.6
    • Cluster scheduler: Slurm 17.11.10
    • /home 15TB directory server (nfs exported to all cluster nodes)
    • Intel(R) Xeon(R) E5-2637 v4 @ 3.5GHz; supports vector extensions up to AVX2
  • 55 compute nodes c00[01-16, 31-48,50-70]
  • Current Cluster Status: Ganglia.
  • Please send any questions and report problems to: cac-help@cornell.edu

How To Login

  • To get started, login to the head node atlas2.cac.cornell.edu via ssh.
  • If you are unfamiliar with Linux and ssh, we suggest reading the Linux Tutorial and looking into how to Connect to Linux before proceeding.
  • You will be prompted for your CAC account password

Hardware

All nodes have hyperthreading turned on and are Xeon generations that supports vector extensions: SSE4.2.

Node Names Memory per node Model name Processor count per node Core(s) per socket Sockets Thread(s) per core
c00[01-12] 94GB Intel(R) Xeon(R) CPU X5690 @ 3.47GHz 24 6 2 2
c00[13-16] 47GB Intel(R) Xeon(R) CPU X5670 @ 2.93GHz 24 6 2 2
c00[31-48,50-58] 47GB Intel(R) Xeon(R) CPU X5670 @ 2.93GHz 24 6 2 2
c00[59-70] 47GB Intel(R) Xeon(R) CPU X5690 @ 3.47GHz 24 6 2 2

Networking

  • All nodes have a 1GB ethernet connection for eth0 on a private net served out from the atlas2 head node.
  • All nodes have an Infiniband connection:
    • InfiniPath_QLE7340n (QDR speed, 8Gbits/sec)

Running Jobs with Slurm

For detailed information and a quick-start guide, see the Slurm page.

ATLAS2 Queues/Partitions

("Partition" is the term used by Slurm)

  • hyperthreading is turned on for ALL nodes
  • all partitions have a default time of 1 hour
  • ATLAS2 has 5 separate queues:
Queue/Partition Number of nodes Node Names Limits
short (default) 31 c00[13-16,31-48,50-58] walltime limit: 4 hours
long 22 c00[13-16,31-48] walltime limit: 504 hours
inter ~Interactive 12 c00[59-70] walltime limit: 168 hours
bigmem 12 servers c00[01-12]   Maximum of 12 nodes, walltime limit: 168 hours
normal 55 servers c00[01-16, 31-48,50-70] walltime limit: 4 hours

Example in Short Partition/Queue

Example sbatch file to run a job in the short partition/queue; save as example.sh:

#!/bin/bash
## J sets the name of job
#SBATCH -J TestJob

## -p sets the partition (queue)
#SBATCH -p long 

## 10 min
#SBATCH --time=00:10:00

## sets the tasks per core (default=2; keep default if you want to take advantage of hyperthreading)
## 2 will take whole cores, but will divide by 2 with hyperthreading
#SBATCH --ntasks-per-core=1 

## request 4GB per CPU (may limit # of tasks, depending on total memory)
#SBATCH --mem-per-cpu=4GB

## define jobs stdout file
#SBATCH -o testlong-%j.out

## define jobs stderr file
#SBATCH -e testlong-%j.err

echo "starting at `date` on `hostname`"

# Print the SLURM job ID.
echo "SLURM_JOBID=$SLURM_JOBID"

echo "hello world `hostname`"

echo "ended at `date` on `hostname`"
exit 0

Submit/Run your job:

sbatch example.sh

View your job:

scontrol show job 9

Software

The cluster is managed with OpenHPC, which uses yum to install available software from the installed repositories.

  • To view all options of yum, type: man yum
  • To view installed repositories, type: yum repolist
  • To view if your requested software package is in one of the installed repositories, use: yum search <package>
  • i.e. To search whether variations of tau are available, you would type:
 yum search tau

Installed Software

(sortable table)
Package and Version Location module available Notes
cplex studio 128 /opt/ohpc/pub/ibm/ILOG/CPLEX_Studio128/ cplex/12.8
cuda toolkit 9.0 /opt/ohpc/pub/cuda-9.0 cudnn 9.0 in targets/x86_64-linux/lib/
cuda toolkit 9.1 /opt/ohpc/pub/cuda-9.1 cudnn 9.1 in targets/x86_64-linux/lib/
cuda toolkit 9.2 /opt/ohpc/pub/cuda-9.2 cudnn 9.2 in targets/x86_64-linux/lib/
cuda toolkit 10.0 /opt/ohpc/pub/cuda-10.0 cudnn 7.4.1 for cuda10 in targets/x86_64-linux/lib/
gcc 7.2.0 /opt/ohpc/pub/compiler/gcc/7.2.0/bin/gcc gnu7/7.2.0
gcc 4.8.5 (default) /usr/bin/gcc
gdal 2.2.3 /opt/ohpc/pub/gdal2.2.3 gdal/2.2.3
java openjdk 1.8.0 /usr/bin/java
Python 2.7.5 (default) /usr/bin/python The system-wide installation of packages is no longer supported. See below for Anaconda/miniconda install information.
R 3.5.1 /usr/bin/R The system-wide installation of packages is no longer supported.
Subversion (svn) 1.7 /usr/bin/svn
  • It is usually possible to install software in your home directory.
  • List installed software via rpms: 'rpm -qa'. Use grep to search for specific software: rpm -qa | grep sw_name [i.e. rpm -qa | grep perl ]

Modules

Since this cluster is managed with OpenHPC, the Lmod Module System is implemented. You can see detailed information and instructions at the linked page.

Example: To be sure you are using the environment setup for cplex, you would type:

$ module avail
$ module load cplex

When done, either logout and log back in or type module unload cplex

You can also create your own modules and place them in your $HOME. For instructions, see the Modules (Lmod) page.

Once created, type module use $HOME/path/to/personal/modulefiles. This will prepend the path to $MODULEPATH. Type echo $MODULEPATH to confirm.

Build software from source into your home directory ($HOME)

* download and extract your source
* cd to your extracted source directory
* ./configure --./configure --prefix=$HOME/appdir
[You need to refer to your source documentation to get the full list of options you can provide 'configure' with.]
* make
* make install

The binary would then be located in ~/appdir/bin. 
* Add the following to your $HOME/.bashrc: 
      export PATH="$HOME/appdir/bin:$PATH"
* Reload the .bashrc file with source ~/.bashrc. (or logout and log back in)

How to Install R packages in your home directory

Reference: http://cran.r-project.org/doc/manuals/R-admin.html#Managing-libraries

************************************************************************************
NOTE: Steps 1) through 4) need to be done once, after your Rlibs directory
has been created and your R_LIBS environment is set, you can install additional 
packages using step 5).
************************************************************************************

Know your R library search path:
    Start R and run .libPaths()  Sample output is shown below:
    > .libPaths()
     [1] "/usr/lib64/R/library"

Now we will create a local Rlibs directory and add this to the library search path.
NOTE: Make sure R is NOT running before you proceed.

1) Create a directory in your home directory you would like to install the R packages, e.g. Rlibs 
mkdir  ~/Rlibs

2) Create a .profile file in your home directory (or modify existing) using your favorite editor (emacs, vim, nano, etc)  
   
     Add the following to your .profile
     #!/bin/sh
     if [ -n $R_LIBS ]; then
        export R_LIBS=~/Rlibs:$R_LIBS
     else
        export R_LIBS=~/Rlibs
     fi

3) To reset the R_LIBS path we need to run the following: "source ~/.profile" (or logout and log back in) 

4) Confirm the change is in your library path:
     start R
> .libPaths()
[1] "$HOME/Rlibs"     
[2] "/usr/lib64/R/library"   

  
5) Install the package in your local directory 
>install.packages("packagename","~/Rlibs","https://cran.r-project.org/")
i.e. to install the package:snow
>install.packages("snow","~/Rlibs","https://cran.r-project.org/")

6) For more help with install.packages() use
>?install.packages( )  

7) To see which libraries are available in your R library path, run library() 
The output will show your local packages and the system wide packages
>library()

How to Install Python Anaconda (miniconda) home directory

https://conda.io/docs/user-guide/tutorials/index.html

MPI

  • To use MPI with Infiniband, use openmpi3
  • The mpich default transport is TCP/IP (not ininiband)
  • NOTE: mvapich2/2.2 will NOT work on this cluster