Difference between revisions of "AIDA Cluster"

From CAC Documentation wiki
Jump to navigation Jump to search
Line 44: Line 44:
 
==Scheduler/Queues==
 
==Scheduler/Queues==
 
:* The cluster scheduler is Slurm.  
 
:* The cluster scheduler is Slurm.  
:* Partitions
+
:* See [[ slurm | Slurm documentation page ]] for details.  
See [[ slurm | Slurm documentation page ]] for details.  
+
:* See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.
See the [[ Slurm#Requesting_GPUs | Requesting GPUs ]] section for information on how to request GPUs on compute nodes for your jobs.
+
:*# <code>--gres=gpu:2g.20gb:<number of MIG devices></code> or <code>--gres=gpu:1g.10gb:1</code> to request MIG devices.  
:*# <code>--gres=gpu:2g.20gb:<number of MIG devices></code> or <code>--gres=gpu:1g.10gb:1</code> to request MIG devices. The job will land on one of c0002, c0003, or c0004.
+
:* The job will land on one of the A100 nodes with MIG configured.
:*# <code>--gres=gpu:a100:<number of GPUs></code> to request entire A100 GPUs. The job will land on node c0001.
+
:*# <code>--gres=gpu:a100:<number of GPUs></code> to request entire A100 GPUs. The job will land on an A100 node with no MIG.
 
:* Remember, hyperthreading is enabled on the cluster, so Slurm considers each physical core to consist of two logical CPUs.
 
:* Remember, hyperthreading is enabled on the cluster, so Slurm considers each physical core to consist of two logical CPUs.
 
:* Partitions (queues):
 
:* Partitions (queues):
Line 57: Line 57:
 
|-
 
|-
 
| normal
 
| normal
| all nodes, each node with 4 Nvidia A100 GPUs
+
| xxxxxxxxxx
 
| no limit
 
| no limit
 
|}
 
|}

Revision as of 07:07, 22 September 2022

AIDA General Information

  • Aida is a private cluster with restricted access to members of bs54_0001 and rab38_0001 groups.
  • Head node: aida.cac.cornell.edu (access via ssh)
  • 12 GPU nodes
    • 6 with V100 GPUs (c0017-c0022)
    • 6 with A100 GPUs (c0071-c0076)
  • many nodes from the former Atlas2 cluster

Hardware

  • All GPU nodes support vector extensions up to AVX-512
  • All nodes have hyperthreading turned on.

c00[17-22]:

    2x18 core Intel Xeon Skylake 6154 CPUs with base clock 3GHz (turbo up to 3.7GHz)

c0017: x5 GPU/Nvidia Tesla V100 16GB

    Memory: 754GB
    swap: 187GB
    /tmp: 700GB

c00[18-21]: x5 GPU/Nvidia Tesla V100 16GB

     Memory: 376GB
     swap: 187GB
     /tmp: 700GB

c0022: x2 GPU/Nvidia Tesla V100 16GB

     Memory: 1.5TB 
     swap: 187GB
     /tmp: 100GB
     /scratch: 1TB

c00[71-76]:

    2x28 core Intel Xeon Ice Lake Gold 6348 CPUs with base clock 2.6GHz
    x4 GPU/Nvidia Tesla A100 80GB
    Memory: 1TB
    swap: 187GB
    /tmp: 3TB

Networking

File Systems

Scheduler/Queues

  • The cluster scheduler is Slurm.
  • See Slurm documentation page for details.
  • See the Requesting GPUs section for information on how to request GPUs on compute nodes for your jobs.
    1. --gres=gpu:2g.20gb:<number of MIG devices> or --gres=gpu:1g.10gb:1 to request MIG devices.
  • The job will land on one of the A100 nodes with MIG configured.
    1. --gres=gpu:a100:<number of GPUs> to request entire A100 GPUs. The job will land on an A100 node with no MIG.
  • Remember, hyperthreading is enabled on the cluster, so Slurm considers each physical core to consist of two logical CPUs.
  • Partitions (queues):
Name Description Time Limit
normal xxxxxxxxxx no limit

Help

  • Submit questions or requests at help or by sending email to: help@cac.cornell.edu. Please include AIDA in the subject area.