Difference between revisions of "AIDA Cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
Line 43: Line 43:
  
 
==Scheduler/Queues==
 
==Scheduler/Queues==
:* The cluster scheduler is Slurm. All nodes are configured to be in the "normal" partition with no time limits. See [[ slurm | Slurm documentation page ]] for details. See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.
+
:* The cluster scheduler is Slurm.  
 +
:* Partitions
 +
See [[ slurm | Slurm documentation page ]] for details.  
 +
See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.
 
:*# <code>--gres=gpu:2g.20gb:<number of MIG devices></code> or <code>--gres=gpu:1g.10gb:1</code> to request MIG devices. The job will land on one of c0002, c0003, or c0004.
 
:*# <code>--gres=gpu:2g.20gb:<number of MIG devices></code> or <code>--gres=gpu:1g.10gb:1</code> to request MIG devices. The job will land on one of c0002, c0003, or c0004.
 
:*# <code>--gres=gpu:a100:<number of GPUs></code> to request entire A100 GPUs. The job will land on node c0001.
 
:*# <code>--gres=gpu:a100:<number of GPUs></code> to request entire A100 GPUs. The job will land on node c0001.
Line 57: Line 60:
 
| no limit
 
| no limit
 
|}
 
|}
 +
 
==Help==
 
==Help==
 
:* Submit questions or requests at [https://www.cac.cornell.edu/help help] or by sending email to: [mailto:help@cac.cornell.edu help@cac.cornell.edu]. Please include AIDA in the subject area.
 
:* Submit questions or requests at [https://www.cac.cornell.edu/help help] or by sending email to: [mailto:help@cac.cornell.edu help@cac.cornell.edu]. Please include AIDA in the subject area.

Revision as of 07:02, 22 September 2022

AIDA General Information

  • Aida is a private cluster with restricted access to members of bs54_0001 and rab38_0001 groups.
  • Head node: aida.cac.cornell.edu (access via ssh)
  • 12 GPU nodes
    • 6 with V100 GPUs (c0017-c0022)
    • 6 with A100 GPUs (c0071-c0076)
  • many nodes from the former Atlas2 cluster

Hardware

  • All GPU nodes support vector extensions up to AVX-512
  • All nodes have hyperthreading turned on.

c00[17-22]:

    2x18 core Intel Xeon Skylake 6154 CPUs with base clock 3GHz (turbo up to 3.7GHz)

c0017: x5 GPU/Nvidia Tesla V100 16GB

    Memory: 754GB
    swap: 187GB
    /tmp: 700GB

c00[18-21]: x5 GPU/Nvidia Tesla V100 16GB

     Memory: 376GB
     swap: 187GB
     /tmp: 700GB

c0022: x2 GPU/Nvidia Tesla V100 16GB

     Memory: 1.5TB 
     swap: 187GB
     /tmp: 100GB
     /scratch: 1TB

c00[71-76]:

    2x28 core Intel Xeon Ice Lake Gold 6348 CPUs with base clock 2.6GHz
    x4 GPU/Nvidia Tesla A100 80GB
    Memory: 1TB
    swap: 187GB
    /tmp: 3TB

Networking

File Systems

Scheduler/Queues

  • The cluster scheduler is Slurm.
  • Partitions

See Slurm documentation page for details. See the Requesting GPUs section for information on how to request GPUs on compute nodes for your jobs.

    1. --gres=gpu:2g.20gb:<number of MIG devices> or --gres=gpu:1g.10gb:1 to request MIG devices. The job will land on one of c0002, c0003, or c0004.
    2. --gres=gpu:a100:<number of GPUs> to request entire A100 GPUs. The job will land on node c0001.
  • Remember, hyperthreading is enabled on the cluster, so Slurm considers each physical core to consist of two logical CPUs.
  • Partitions (queues):
Name Description Time Limit
normal all nodes, each node with 4 Nvidia A100 GPUs no limit

Help

  • Submit questions or requests at help or by sending email to: help@cac.cornell.edu. Please include AIDA in the subject area.