Difference between revisions of "WALLE2 Cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
 
(10 intermediate revisions by 2 users not shown)
Line 2: Line 2:
 
= General Information =
 
= General Information =
  
* Walle2 is a private cluster with restricted access to the following groups: ek436_0001.  
+
* Walle2 is a private cluster. Access is restricted to the ek436_0001 group.  
 
* Head node:  '''walle2.cac.cornell.edu'''  ([[#How To Login|access via ssh]])
 
* Head node:  '''walle2.cac.cornell.edu'''  ([[#How To Login|access via ssh]])
 
** [https://openhpc.community/ OpenHPC] deployment running Rocky Linux 8
 
** [https://openhpc.community/ OpenHPC] deployment running Rocky Linux 8
 
** Cluster scheduler: slurm (See [[Slurm | CAC slurm documentation ]] for more info)
 
** Cluster scheduler: slurm (See [[Slurm | CAC slurm documentation ]] for more info)
 
* compute nodes:
 
* compute nodes:
** 4 non-GPU nodes (c0001-c0004)
+
** 5 non-GPU nodes (c0001-c0005): 40 cores/80 threads, 768 GB of RAM
** 2 GPU nodes (g0001: 2 x NVidia V100, g0002: 2 x NVidia T4)
+
** 2 GPU nodes (g0001: 2 x NVidia V100, g0002: 2 x NVidia T4): 24 cores/48 threads, 384 GB of RAM
* data on the walle cluster storage is <tt>'''NOT'''</tt> backed up
+
* Data on the walle cluster storage are <tt>'''NOT'''</tt> backed up.
 
* Please send any questions and report problems to:  [mailto:cac-help@cornell.edu cac-help@cornell.edu]
 
* Please send any questions and report problems to:  [mailto:cac-help@cornell.edu cac-help@cornell.edu]
  
 
= Scheduler/Queues =
 
= Scheduler/Queues =
  
Slurm will create the required Red Cloud instance to run the jobs submitted to the partition. If no partition is specified in the <code>sbatch</code> or <code>srun</code> command, the <code>normal</code> partition is the default.
+
There are two Slurm partitions: <code>normal</code> and <code>gpu</code>. If no partition is specified in the <code>sbatch</code> or <code>srun</code> command, jobs will run in the <code>normal</code> partition.
  
To request access to GPUs for your jobs,  
+
There are two types of GPUs, V100 (quantity 2) and T4 (quantity 2). To access the GPUs, you MUST specify BOTH the gpu partition (<code>-p gpu</code>) AND gpu type (<code>--gres=gpu:<gpu>:<number of gpus></code>).
  
# Submit the job to the <code>gpu</code> partition using the <code>-p</code> option.
+
Examples for submitting an interactive job via <code>srun</code> to the <code>gpu</code> partition.
#
+
 
#* Use <code>--gres=gpu:v100:<number of GPUs></code> to request NVidia V100 GPUs. The job will land on node g0001.
+
To request 1 NVidia V100 GPU (node g0001 has 2 V100s):
#* Use <code>--gres=gpu:t4:<number of GPUs></code> to request NVidia T4 GPUs. The job will land on node g0002.
+
#<code>srun -p gpu --gres=gpu:v100:1 --pty bash</code>  
 +
 
 +
To request 2 T4 GPU (g0002 has 2 T4s):
 +
#<code>srun -p gpu --gres=gpu:t4:2 --pty bash</code>  
  
 
See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.  
 
See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.  
Line 184: Line 187:
 
! style="width: 34%"|Notes
 
! style="width: 34%"|Notes
 
|-
 
|-
| *GNU Compilers 9.3.0  
+
| *GNU Compilers 9.4.0  
| /opt/ohpc/pub/compiler/gcc/9.3.0  
+
| /opt/ohpc/pub/compiler/gcc/9.4.0  
| module load gnu9/9.3.0
+
| module load gnu9/9.4.0
 
|-
 
|-
 
| Intel Compilers (2021 Update 2)
 
| Intel Compilers (2021 Update 2)

Latest revision as of 12:59, 1 March 2022

General Information

  • Walle2 is a private cluster. Access is restricted to the ek436_0001 group.
  • Head node: walle2.cac.cornell.edu (access via ssh)
  • compute nodes:
    • 5 non-GPU nodes (c0001-c0005): 40 cores/80 threads, 768 GB of RAM
    • 2 GPU nodes (g0001: 2 x NVidia V100, g0002: 2 x NVidia T4): 24 cores/48 threads, 384 GB of RAM
  • Data on the walle cluster storage are NOT backed up.
  • Please send any questions and report problems to: cac-help@cornell.edu

Scheduler/Queues

There are two Slurm partitions: normal and gpu. If no partition is specified in the sbatch or srun command, jobs will run in the normal partition.

There are two types of GPUs, V100 (quantity 2) and T4 (quantity 2). To access the GPUs, you MUST specify BOTH the gpu partition (-p gpu) AND gpu type (--gres=gpu:<gpu>:<number of gpus>).

Examples for submitting an interactive job via srun to the gpu partition.

To request 1 NVidia V100 GPU (node g0001 has 2 V100s):

  1. srun -p gpu --gres=gpu:v100:1 --pty bash

To request 2 T4 GPU (g0002 has 2 T4s):

  1. srun -p gpu --gres=gpu:t4:2 --pty bash

See the Requesting GPUs section for information on how to request GPUs on compute nodes for your jobs.

Slurm Partitions
Slurm Partition Description Time Limit
normal* Non GPU nodes None
gpu Nvidia V100 or T4 GPU nodes None

Software

Work with Environment Modules

Set up the working environment for each package using the module command. The module command will activate dependent modules if there are any.

To show currently loaded modules: (These modules are loaded by default system configurations)

-bash-4.2$ module list

Currently Loaded Modules:
  1) autotools   3) gnu9/9.3.0   5) libfabric/1.10.1   7) ohpc
  2) prun/2.0    4) ucx/1.8.0    6) openmpi4/4.0.4

To show all available modules:

-bash-4.2$ module avail

------------------- /opt/ohpc/pub/moduledeps/gnu9-openmpi4 --------------------
   adios/1.13.1            netcdf-fortran/4.5.2        py3-mpi4py/3.0.3
   boost/1.75.0            netcdf/4.7.3                py3-scipy/1.5.1
   fftw/3.3.8       (L)    opencoarrays/2.9.2          scalapack/2.1.0
   hypre/2.18.1            petsc/3.14.4                slepc/3.14.2
   jdftx/1.6.0      (L)    petsc/3.15.0         (D)    slepc/3.15.0       (D)
   mfem/4.2                phdf5/1.10.6                superlu_dist/6.1.1
   mumps/5.2.1             pnetcdf/1.12.1              trilinos/13.0.0
   netcdf-cxx/4.3.1        ptscotch/6.0.6

------------------------ /opt/ohpc/pub/moduledeps/gnu9 ------------------------
   gsl/2.6       (L)    mpich/3.3.2-ofi        py3-numpy/1.19.0
   hdf5/1.10.6   (L)    mvapich2/2.3.4         superlu/5.2.1
   impi/2021.2.0        openblas/0.3.7  (L)
   metis/5.1.0          openmpi4/4.0.5  (L)

-------------------------- /opt/ohpc/pub/modulefiles --------------------------
   autotools          (L)    julia/1.6.1             os
   cmake/3.19.4              libfabric/1.11.2 (L)    prun/2.1  (L)
   cuda/11.3                 mkl/2021.2.0.610 (L)    py3-libs
   gnu9/9.3.0         (L)    octave/6.3.0     (L)    ucx/1.9.0 (L)
   intel/2021.2.0.610        ohpc             (L)

  Where:
   D:  Default Module
   L:  Module is loaded

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

To load a module and verify:

-bash-4.2$ module load cmake
-bash-4.2$ module list

Currently Loaded Modules:
  1) autotools   3) gnu9/9.3.0   5) libfabric/1.10.1   7) ohpc
  2) prun/2.0    4) ucx/1.8.0    6) openmpi4/4.0.4     8) cmake/3.16.2

Manage Modules in Your Python Virtual Environment

python 3.6.8 is installed. Users can manage their own python environment (including installing needed modules) using virtual environments. Please see the documentation on virtual environments on python.org for details.

Create Virtual Environment

You can create as many virtual environments, each in their own directory, as needed.

python3 -m venv <your virtual environment directory>

Activate Virtual Environment

You need to activate a virtual environment before using it:

source <your virtual environment directory>/bin/activate

Install Python Modules Using pip

After activating your virtual environment, you can now install python modules for the activated environment:

  • It's always a good idea to update pip first:
pip install --upgrade pip
  • Install the module:
pip install <module name>
  • List installed python modules in the environment:
pip list modules
  • Examples: Install tensorflow and keras like this:
-bash-4.2$ python3 -m venv tensorflow
-bash-4.2$ source tensorflow/bin/activate
(tensorflow) -bash-4.2$ pip install --upgrade pip
Collecting pip
  Using cached https://files.pythonhosted.org/packages/30/db/9e38760b32e3e7f40cce46dd5fb107b8c73840df38f0046d8e6514e675a1/pip-19.2.3-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 18.1
    Uninstalling pip-18.1:
      Successfully uninstalled pip-18.1
Successfully installed pip-19.2.3
(tensorflow) -bash-4.2$ pip install tensorflow keras
Collecting tensorflow
  Using cached https://files.pythonhosted.org/packages/de/f0/96fb2e0412ae9692dbf400e5b04432885f677ad6241c088ccc5fe7724d69/tensorflow-1.14.0-cp36-cp36m-manylinux1_x86_64.whl
:
:
:
Successfully installed absl-py-0.8.0 astor-0.8.0 gast-0.2.2 google-pasta-0.1.7 grpcio-1.23.0 h5py-2.9.0 keras-2.2.5 keras-applications-1.0.8 keras-preprocessing-1.1.0 markdown-3.1.1 numpy-1.17.1 protobuf-3.9.1 pyyaml-5.1.2 scipy-1.3.1 six-1.12.0 tensorboard-1.14.0 tensorflow-1.14.0 tensorflow-estimator-1.14.0 termcolor-1.1.0 werkzeug-0.15.5 wheel-0.33.6 wrapt-1.11.2
(tensorflow) -bash-4.2$ pip list modules
Package              Version
-------------------- -------
absl-py              0.8.0  
astor                0.8.0  
gast                 0.2.2  
google-pasta         0.1.7  
grpcio               1.23.0 
h5py                 2.9.0  
Keras                2.2.5  
Keras-Applications   1.0.8  
Keras-Preprocessing  1.1.0  
Markdown             3.1.1  
numpy                1.17.1 
pip                  19.2.3 
protobuf             3.9.1  
PyYAML               5.1.2  
scipy                1.3.1  
setuptools           40.6.2 
six                  1.12.0 
tensorboard          1.14.0 
tensorflow           1.14.0 
tensorflow-estimator 1.14.0 
termcolor            1.1.0  
Werkzeug             0.15.5 
wheel                0.33.6 
wrapt                1.11.2 

Software List

Software Path Notes
*GNU Compilers 9.4.0 /opt/ohpc/pub/compiler/gcc/9.4.0 module load gnu9/9.4.0
Intel Compilers (2021 Update 2) /opt/intel/oneapi/compiler/2021.2.0 module load intel/2021.2.0.610
MKL 2021.2.0.610 (2021 Update 2) /opt/intel/oneapi/mkl/2021.2.0 module load mkl/2021.2.0.610
*openmpi 4.0.5
  • /opt/ohpc/pub/mpi/openmpi4-gnu9
  • /opt/ohpc/pub/mpi/openmpi4-intel
module load openmpi4
Intel MPI 2021.2.0 /opt/intel/oneapi/mpi/2021.2.0 module load impi/2021.2.0
Julia 1.6.1 /opt/ohpc/pub/compiler/julia/1.6.1 module load julia/1.6.1
JDFTx 1.6.0 /opt/ohpc/pub/apps/jdftx/1.6.0 module load jdftx/1.6.0
CUDA 11.5 /usr/local/cuda-11.5 module load cuda/11.5
petsc / petsc4py 3.15.0
  • /opt/ohpc/pub/libs/gnu9/openmpi4/petsc/3.15.0
  • /opt/ohpc/pub/libs/intel/impi/petsc/3.15.0
module load petsc/3.15.0
slepc / slepc4py 3.15.0
  • /opt/ohpc/pub/libs/gnu9/openmpi4/slepc/3.15.0
  • /opt/ohpc/pub/libs/intel/impi/slepc/3.15.0
module load slepc/3.15.0
Python3 Modules

(pytorch with CUDA support, numpy, scipy, matplotlib, pandas tensorflow, keras, sklearn, umap)

/opt/ohpc/pub/utils/py3-libs/ module load py3-libs