Difference between revisions of "TARDIS Cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
Line 17: Line 17:
 
     ssh USERID@tardis.cac.cornell.edu
 
     ssh USERID@tardis.cac.cornell.edu
  
General information about running jobs and logging in can be found in the [[Getting started|Getting started documentation]]
+
General information about running jobs and logging in can be found in the [[Getting Started|Getting started documentation]]
  
 
Reminder this is information about the v4 cluster, not the tardis; the commands, syntax for batch jobs would be similar. (use qsub, not nsub)
 
Reminder this is information about the v4 cluster, not the tardis; the commands, syntax for batch jobs would be similar. (use qsub, not nsub)

Revision as of 14:41, 30 September 2015

TARDIS General Information

  • Tardis is a private cluster with restricted access to Craig Fennie's cjf76_0001 group.
  • Tardis has one head node and 85 compute nodes (1084 cores in total - compute node details can be found below under the 'Queues' section)
  • Head node: tardis.cac.cornell.edu (access via ssh only) serves as:
    • the cluster scheduler
    • Rocks Cluster server for compute node deployment
  • Compute nodes: (see below for node names and further information by the queue)
  • Rocks 6.1 with CentOS 6.3
  • Cluster Status: Ganglia.
  • Submit HELP requests: help OR by sending email to: help@cac.cornell.edu

Getting Started

login via ssh to tardis.cac.cornell.edu:

   ssh USERID@tardis.cac.cornell.edu

General information about running jobs and logging in can be found in the Getting started documentation

Reminder this is information about the v4 cluster, not the tardis; the commands, syntax for batch jobs would be similar. (use qsub, not nsub)

showq displays jobs running

Once you have a job batch file created, you use 'qsub jobname' (not nsub as written in the v4 documentation).

Use: checkjob/canceljob to check on or cancel your job.

Scheduler

Maui 3.3.1/Torque 3.0.5; no project ID required.


Queues

qstat -q shows Tardis has 6 separate queues. Descriptions of the nodes in each queue follow.

  • c1: compute-1-[1-16].tardis
  • c2: compute-2-[1-16].tardis
  • c3: compute-3-[1-16].tardis
  • c4: compute-4-[1-16].tardis
  • c5: compute-5-[1-16].tardis
  • c6: compute-6-[1-5].tardis


c1: (original queue, slowest processors in the cluster).

 Number of nodes: 16 servers (8 cores per server, 128 total cores)
 Node names: compute-1-[1-16].tardis
 HW: Intel Xeon E5430 2.66GHz
 Memory: 16GB ram/server
 /tmp: 77GB

c2:

 Number of nodes: 16 servers (8 cores per server, 128 total cores)
 Node names: compute-2-[1-16].tardis
 HW: Intel Xeon E5530 2.40GHz
 Memory: 16GB ram/server
 /tmp: 113GB

c3: (fastest processors)

 Number of nodes: 16 servers (20 cores per server, 320 total cores)
 Node names: compute-3-[1-2].tardis
 HW: Intel Xeon E5-2670v2 2.50GHz
 Memory: 48GB ram/server
 /tmp: 242GB

c4:

 Number of nodes: 16 servers (12 cores per server, 192 total cores)
 Node names: compute-4-[1-16].tardis
 HW: Intel Xeon E5-2640 2.50GHz
 Memory: 32GB ram/server
 /tmp: 242GB

c5:

 Number of nodes: 16 servers (16 cores per server, 256 total cores)
 Node names: compute-5-[1-16].tardis
 HW: Intel Xeon E5-2680 2.70GHz
 Memory: 32GB ram/server
 /tmp: 242GB

c6: (for collaborators)

 Number of nodes: 5 servers (12 cores per server, 60 total cores)
 Node names: compute-6-[1-5].tardis
 HW: Intel Xeon E5-2643v2 3.50GHz
 Memory: 32GB ram/server
 /tmp: 242GB

Bear in mind that the login node has some of the newest hardware, from the Intel Xeon E5 family. Therefore, programs compiled on the login node may not run on the older hardware in queues c1-c2, unless you supply the correct compiler options for those architectures. Options affecting the Intel SSE level can be especially sensitive.

Module System

The "module load" command helps you set up the software environment (e.g., $PATH) correctly for certain applications such as ABINIT and Phonopy.

But mainly, the module system helps you set up the proper environment for your preferred compiler, especially when you want to build or run MPI-parallel software. Two choices need to be made: (a) compiler family, and (b) MPI implementation. On Tardis, 4 total combinations are possible. Taking the choices for Fortran as an example, (a) can be either gfortran or ifort, and (b) can be either OpenMPI or Intel MPI. Program names like mpif90 and mpiifort are merely wrappers to assist you in using a particular compiler with a particular MPI implementation. Here is how you achieve the 4 combinations on Tardis:

Compiler and MPI Module to load Compile command Run command
gcc/gfortran/OpenMPI rocks-openmpi mpicc or mpif90 mpiexec --mca btl openib,self,sm
icc/ifort/OpenMPI openmpi-1.6.2-intel-x86_64 mpicc or mpif90 mpiexec
gcc/gfortran/Intel MPI intel-mpi-4.0.3-x86_64 mpicc or mpif90 mpiexec
icc/ifort/Intel MPI intel-mpi-4.0.3-x86_64 ***mpiicc or mpiif90*** mpiexec

In all cases, mpirun and mpiexec are equivalent. The rocks-openmpi module is loaded by default; however, some extra options to mpiexec (or mpirun) are needed to tell it to use the Infiniband interconnect. If you are unsure what you're getting in cases 1-3, try doing (a) "mpif90 --version" and (b) "which mpiexec": (a) will tell you the compiler, and (b) will tell you the MPI you are using, as indicated by the path.

When you submit your batch job, don't assume the environment is automatically the same on the batch nodes, even if you submitted the job with "qsub -V". You will again want to use module commands to set up the run environment in batch, so it matches the one that was in place when the application was built. However, be aware that in batch, the shell's built-in module command is not automatically recognized. The module system must first be initialized with a special command:

. /usr/share/Modules/init/sh
module load my-favorite-MPI-and-compiler

Don't omit the dot! You can run the above commands in two ways:

  • Insert them into your batch script, somewhere near the top.
  • Put them in your $HOME/.profile. This is handy if you always use the same set of modules. It is the only way that works with rocks-openmpi.

Other tips:

  • Use "module list" to see what modules are currently loaded.
  • Use "module avail" to see what modules are available.
  • If you want to switch compilers, use "module switch", or at least do "module unload" first, to reset the current compiler-specific environment.
  • You can use "module purge" if you want to be absolutely sure that you're starting with a clean slate.

Cluster Tips

Cluster Status: Ganglia < link is http://tardis.cac.cornell.edu/ganglia/

Submit HELP requests: help < link is https://www.cac.cornell.edu/help OR by sending email to: help@cac.cornell.edu