LIPID Cluster

From CAC Documentation wiki
Revision as of 10:46, 30 September 2015 by Ad876 (talk | contribs)
Jump to navigation Jump to search

LIPID Cluster General Information

LIPID is a private cluster with restricted access to the gwf3_0001 group.

  • Head node: lipid.cac.cornell.edu
  • Rocks 6.1.1 with CentOS 6.5 (upgraded 9/15/14)
  • 17 compute nodes with Dual processor, four core Intel Xeon E5620 CPUs @ 2.40GHz, Hyperthreaded, 12GB Memory, 100GB /tmp.
  • Cluster Status: Ganglia.
  • Submit HELP requests: help OR by sending email to: cac-help@cornell.edu

Maui Scheduler and Job submission/monitoring commands

  • Scheduler: maui 3.2.5; Resource manager: torque 2.4.7

Jobs are scheduled by the Maui scheduler with the Torque resource manager. We suggest you use a job submission batch file utilizing PBS Directives ('Options' section).

Common Maui Commands

(If you have any experience with PBS/Torque or SGE, Maui Commands may be recognizable. Most used:

'qsub - Job submission (jobid will be displayed for the job submitted)

  • $ qsub jobscript.sh

showq - Display queue information.

  • $ showq (dump everything)
  • $ showq -r (show running jobs)
  • $ showq -u foo42 (shows foo42's jobs)

checkjob - Display job information. (You can only checkjob your own jobs.)

  • $ checkjob -A jobid (get dense key-value pair information on job 42)
  • $ checkjob -v jobid (get verbose information on job 42)

canceljob - Cancel Job. (You can only cancel your own jobs.)

  • $ canceljob jobid

Setting Up your Job Submission Batch File

Commands can be run on the command line with the qsub command. However, we suggest running your jobs from a batch script. PBS Directives are command line arguments inserted at the top of the batch script, each directive prepended with '#PBS' (no spaces). We suggest you use a job submission batch file utilizing PBS Directives ('Options' section).

Queues:

  • all
    • contains compute-1-[1-17]
    • wallclock limit: 72 hours (3 days)
  • long
    • contains compute-1-[1-7]
    • wallclock limit: 336 hours (14 days)
  • priority
    • contains compute-1-[1-17]
    • wallclock limit: 72 hours (3 days)
    • This queue is on an "honor" system; please only use this when your job is important to not have to wait in line behind other jobs submitted to 'long' or 'all'. This queue will not stop running jobs. It is for bumping the job priority waiting to run only.

Software Installed:

  • charmm v39b1 (/opt/charmm; /usr/local/bin/charmm)
  • gromacs v4.5.7
  • gromacs-custom (/opt/gromacs)

Quick Tutorial

The batch system on lipid treats each core of a node as a "virtual processor." That means the nodes keyword in batch scripts refers to the number of cores that are scheduled.

Running an MPI Job on the Whole Cluster

First use showq to see how many cores are available. It may be less than 272 if a node is down.

#!/bin/sh
#PBS -l nodes=17,walltime=10:00
#PBS -N test
#PBS -j oe
#PBS -q all
#PBS -S /bin/bash

set -x
cd "$PBS_O_WORKDIR"
nhosts=$(wc -l < $PBS_NODEFILE)

mpiexec -np $nhosts ring -v
Running an MPI Job using 8 Tasks Per Node

Because the nodes have 8 physical cores, you may want to limit jobs to 8 tasks per node. The node file lists each node 16 times, so make a copy with each node listed 8 times, and hand that version to MPI.

#!/bin/sh
#PBS -l nodes=17,walltime=10:00
#PBS -N test
#PBS -j oe
#PBS -q all
#PBS -S /bin/bash

set -x
cd "$PBS_O_WORKDIR"

# Construct a copy of the hostfile with only 8 entries per node.
# MPI can use this to run 8 tasks on each node.
uniq "$PBS_NODEFILE"|awk '{for(i=0;i<8;i+=1) print}'>nodefile.8way

# Run 8-way on 4 nodes
mpiexec --hostfile nodefile.8way ring -v
Running Many Copies of a Serial Job

In order to run 30 separate instances of the same program, use the scheduler's task array feature, through the "-t" option. The "nodes" parameter here refers to a core.

#!/bin/sh
#PBS -l nodes=1,walltime=10:00
#PBS -t 1-30
#PBS -N test
#PBS -j oe
#PBS -q all
#PBS -S /bin/bash

set -x
cd "$PBS_O_WORKDIR"
echo Run my job.

When you start jobs this way, separate jobs will pile one-per-core onto nodes like a box of hamsters.