You are here: start » en » centro » servizos » hpc

High Performance Computing Cluster (HPC) ctcomp3

High Performance Computing Cluster (HPC) ctcomp3

Description

The cluster's compute part consists of:

  • 9 servers for general-purpose computing.
  • 1 “fat node” for jobs that require a lot of memory.
  • 6 servers for GPU computing.

Users only have direct access to the login node, which has more limited resources and must not be used for computing.
All nodes are interconnected by a 10Gb network.
There is a distributed storage accessible from all nodes with 220 TB capacity connected via a dual 25Gb fiber network.

Name Model Processor Memory GPU
hpc-login2 Dell R440 1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c) 16 GB -
hpc-node[1-2] Dell R740 2 x Intel Xeon Gold 5220 @2.2 GHz (18c) 192 GB -
hpc-node[3-9] Dell R740 2 x Intel Xeon Gold 5220R @2.2 GHz (24c) 192 GB -
hpc-fat1 Dell R840 4 x Xeon Gold 6248 @ 2.50GHz (20c) 1 TB -
hpc-gpu[1-2] Dell R740 2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c) 192 GB 2x Nvidia Tesla V100S 32GB
hpc-gpu3 Dell R7525 2 x AMD EPYC 7543 @2.80 GHz (32c) 256 GB 2x Nvidia Ampere A100 40GB
hpc-gpu4 Dell R7525 2 x AMD EPYC 7543 @2.80 GHz (32c) 256 GB 1x Nvidia Ampere A100 80GB
hpc-gpu5 Dell R7725 2 x AMD EPYC 9255 @3.25 GHz (24c) 364 GB 2x Nvidia L4 24GB
hpc-gpu6 Dell R7725 2 x AMD EPYC 9255 @3.25 GHz (24c) 384 GB 2x Nvidia L4 24GB

Connection to the system

To access the cluster, you must request it beforehand through incident form. Users who do not have access permission will receive an “incorrect password” message.

Access is via an SSH connection to the login node (172.16.242.211):

ssh <nombre_de_usuario>@hpc-login2.inv.usc.es

Storage, directories and file systems

No backups are made of any of the cluster's file systems!!

Users' HOME on the cluster is on the shared file system, so it is accessible from all nodes of the cluster. Path defined in the environment variable $HOME.
Each node has a local 1 TB partition for scratch, which is erased when each job ends. It can be accessed via the environment variable $LOCAL_SCRATCH in scripts.
For data to be shared by groups of users, request the creation of a folder in the shared storage that will only be accessible by the members of the group.

Directory Variable Mount point Capacity
Home $HOME /mnt/beegfs/home/<username> 220 TB*
Scratch local $LOCAL_SCRATCH varies 1 TB
Group folder $GRUPOS/<nombre> /mnt/beegfs/groups/<nombre> 220 TB*

* the storage is shared

IMPORTANT NOTICE

The shared file system has poor performance when working with many small files. To improve performance in such scenarios you should create a file-system inside an image file and mount it to work directly on it. The procedure is as follows:

  • Create the image file in your home:
## truncate image.name -s SIZE_IN_BYTES
truncate ejemplo.ext4 -s 20G
  • Create a file system in the image file:
## mkfs.ext4 -T small -m 0 image.name
## -T small options optimized for small files
## -m 0 Do not reserve space for root
mkfs.ext4 -T small -m 0 ejemplo.ext4
  • Mount the image (using SUDO) with the script mount_image.py :
## By default it is mounted in /mnt/imagenes/<username>/ in read-only mode.
sudo mount_image.py ejemplo.ext4
  • To unmount the image use the script umount_image.py (using SUDO)
sudo umount_image.py
The file can only be mounted from a single node if mounted readwrite, but it can be mounted from any number of nodes in readonly mode.

The mount script has these options:

--mount-point path   <-- (optional) With this option it creates subdirectories under /mnt/imagenes/<username>/<path>
--rw                  <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.

The unmount script has these options:

it only accepts as an optional parameter the same path you used for mounting with the 
--mount-point  <-- (optional)

File and data transfer

SCP

From your local machine to the cluster:

scp filename <username>@hpc-login2:/<ruta>

From the cluster to your local machine:

scp filename <username>@<hostname>:/<ruta>

SCP manual page

SFTP

To transfer multiple files or to browse the file system.

<hostname>:~$ sftp <user_name>@hpc-login2
sftp>
sftp> ls
sftp> cd <path>
sftp> put <file>
sftp> get <file>
sftp> quit

SFTP manual page

RSYNC

SSHFS

Requires installation of the sshfs package.
Allows, for example, mounting the user's home from their workstation on hpc-login2:

## Mount
sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <mount_point>
## Unmount
fusermount -u <mount_point>

SSHFS manual page

Available software

All nodes have the basic software that is installed by default with AlmaLinux 8.4, in particular:

  • GCC 8.5.0
  • Python 3.6.8
  • Perl 5.26.3

On GPU nodes, additionally:

  • nVidia Driver 560.35.03
  • CUDA 11.6
  • libcudnn 8.7

To use any other software not installed on the system or a different version there are three options:

  1. Use Modules with the modules that are already installed (or request the installation of a new module if it is not available)
  2. Use a container (uDocker or Apptainer/Singularity)
  3. Use Conda

A module is the simplest solution to use software without modifications or hard-to-satisfy dependencies.
A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if what you seek is reproducibility, ease of distribution and teamwork.
Conda is the best solution if you need the latest version of a library or program or packages not available otherwise.

Using modules/Lmod

Lmod documentation

# See available modules:
module avail
# Load a module:
module <module_name>
# Unload a module:
module unload <module_name>
# See modules loaded in your environment:
module list
# You can use ml as a shorthand for the module command:
ml avail
# To get information about a module:
ml spider <module_name>

Running software containers

uDocker

uDocker manual
uDocker is installed as a module, so it needs to be loaded into the environment:

ml udocker

Apptainer/Singularity

Apptainer documentation
Apptainer is installed on each node's system, so nothing needs to be done to use it.

CONDA

Conda documentation
Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python and a few necessary packages. From there each user only downloads and installs the packages they need.

# Get miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install it 
bash Miniconda3-latest-Linux-x86_64.sh
# Initialize miniconda for the bash shell
~/miniconda3/bin/conda init bash

Using SLURM

The queue manager in the cluster is SLURM .

The term CPU identifies a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).
Available resources
hpc-login2 ~]# ver_estado.sh
=============================================================================================================
  NODE     STATE                         CORES IN USE                           MEM USAGE     GPUS(Use/Total)
=============================================================================================================
 hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
 hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
 hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
 hpc-gpu5    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
 hpc-gpu6    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
 hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node4    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node5    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node6    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node7    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node8    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node9    up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
=============================================================================================================
TOTALS: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 
hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
# There is an alias for this command:
hpc-login2 ~]$ ver_recursos
NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES                           
hpc-fat1                        80                    1027273               cpu_intel             (null)                         
hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:2                    
hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:2                  
hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)             
hpc-gpu[5-6]                    48                    375484                cpu_amd               gpu:L4:2(S:1)
hpc-node[1-2]                   36                    187645                cpu_intel             (null)                         
hpc-node[3-9]                   48                    187645                cpu_intel             (null)
 
# To see the current usage of resources: (CPUS (Allocated/Idle/Other/Total))
hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
# There is an alias for this command:
hpc-login2 ~]$ ver_uso
NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
hpc-gpu1            16/20/0/36          187911              181851              gpu:V100S:2(S:0-1)  gpu:V100S:2(IDX:0-1)
hpc-gpu2            4/32/0/36           187911              183657              gpu:V100S:2(S:0-1)  gpu:V100S:1(IDX:0),m
hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:2       gpu:A100_40:2(IDX:0-
hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
hpc-gpu5            8/40/0/48           375484              380850              gpu:L4:2(S:1)       gpu:L4:1(IDX:1),mps:
hpc-gpu6            0/48/0/48           375484              380969              gpu:L4:2(S:1)       gpu:L4:0(IDX:N/A),mp
hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
hpc-node3           36/12/0/48          187645              126739              (null)              gpu:0,mps:0
hpc-node4           36/12/0/48          187645              126959              (null)              gpu:0,mps:0
hpc-node5           36/12/0/48          187645              128572              (null)              gpu:0,mps:0
hpc-node6           36/12/0/48          187645              127699              (null)              gpu:0,mps:0
hpc-node7           36/12/0/48          187645              127002              (null)              gpu:0,mps:0
hpc-node8           36/12/0/48          187645              128182              (null)              gpu:0,mps:0
hpc-node9           36/12/0/48          187645              127312              (null)              gpu:0,mps:0

Nodes

A node is the SLURM compute unit and corresponds to a physical server.

# Show information about a node:
hpc-login2 ~]$ scontrol show node hpc-node1
NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 
   CPUAlloc=0 CPUTot=36 CPULoad=0.00
   AvailableFeatures=cpu_intel
   ActiveFeatures=cpu_intel
   Gres=(null)
   NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6
   OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021 
   RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defaultPartition 
   BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48
   LastBusyTime=2022-03-07T14:34:12
   CfgTRES=cpu=36,mem=187645M,billing=36
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partitions

Partitions in SLURM are logical groups of nodes. The cluster has a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.

# Show partition information:
hpc-login2 ~]$ sinfo
defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[1-6],hpc-node[1-9]

Jobs

Jobs in SLURM are allocations of resources to a user for a specified time. Jobs are identified by a sequential number or JOBID.
A job (JOB) consists of one or more steps (STEPS), each composed of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that runs sequentially in a JOB and there is one TASK for each program that runs in parallel. Therefore in the simplest case, for example launching a job that runs the hostname command, the JOB has a single STEP and a single TASK.

Queue system (QOS)

The queue to which each job is submitted defines priority, limits and also the user's relative “cost”.

# Show queues
hpc-login2 ~]$ sacctmgr show qos
# There is an alias that shows only the most relevant information:
hpc-login2 ~]$ ver_colas
      Name   Priority           Flags UsageFactor                        MaxTRES     MaxWall                      MaxTRESPU MaxJobsPU MaxSubmitPU 
---------- ---------- --------------- ----------- ------------------------------ ----------- ------------------------------ --------- ----------- 
   regular        100     DenyOnLimit    1.000000      cpu=200,gres/gpu=1,node=4  4-04:00:00      cpu=200,gres/gpu=4,node=4        10          50 
interacti+        200     DenyOnLimit    1.000000              gres/gpu=1,node=1    04:00:00              gres/gpu=1,node=1         1           1 
    urgent        300     DenyOnLimit    2.000000              gres/gpu=1,node=1    04:00:00              cpu=36,gres/gpu=2         5          15 
      long        100     DenyOnLimit    1.000000              gres/gpu=1,node=4  8-04:00:00                     gres/gpu=2         1           5 
     large        100     DenyOnLimit    1.000000             cpu=200,gres/gpu=2  4-04:00:00                     gres/gpu=2         2          10 
     small        100     DenyOnLimit    1.000000        cpu=6,gres/gpu=0,node=2  6-00:00:00                        cpu=400       400         800 
     short        150     DenyOnLimit    1.000000        cpu=6,gres/gpu=0,node=2    04:00:00                                       40         100 

# Priority: is the relative priority of each queue.
# DenyOnLimit: the job will not run if it does not meet the queue limits.
# UsageFactor: the relative cost to the user of running a job in that queue.
# MaxTRES: limits per job.
# MaxWall: maximum time a job can be running.
# MaxTRESPU: global limits per user.
# MaxJobsPU: Maximum number of jobs a user can have running.
# MaxSubmitPU: Maximum number of jobs a user can have in total queued and running.

Submitting a job to the queue system

Resource specification

By default, if a job is submitted without specifying anything the system sends it to the default QOS (regular) and assigns it a node, one CPU and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours). This is very inefficient; ideally you should specify at least three parameters when submitting jobs when possible:

  1. The number of nodes (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).
  2. Memory (--mem) per node or memory per cpu (--mem-per-cpu).
  3. Estimated job runtime ( --time )

Additionally it may be useful to add the following parameters:

-J --job-name Job name. Default: name of the executable
-q --qos Queue name to which the job is submitted. Default: regular
-o --output File or filename pattern to which all standard output and error is redirected.
--gres Type and/or number of GPUs requested for the job.
-C --constraint To specify that Intel or AMD processor nodes are desired (cpu_intel or cpu_amd)
--exclusive To request that the job does not share nodes with other jobs.
-w --nodelist List of nodes on which to run the job
How resources are allocated

By default the method of allocation between nodes is block allocation (all available cores on a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are distributed equally among the sockets available on the node).

Priority calculation

When a job is submitted to the queue system, the first thing that happens is that it is checked whether the requested resources fit within the limits set in the corresponding queue. If it exceeds any limit the submission is canceled.
If resources are available the job runs immediately, but if not it is queued. Each job has an assigned priority that determines the order in which jobs in the queue run when resources become available. To determine each job's priority three factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%) and the user's fairshare (50%).
Fairshare is a dynamic calculation that SLURM performs for each user and is the difference between resources allocated and resources consumed over the last 14 days.

hpc-login2 ~]$ sshare -l 
      User  RawShares  NormShares    RawUsage   NormUsage   FairShare 
---------- ---------- ----------- ----------- -----------  ---------- 
                         1.000000     2872400                0.500000 
                    1    0.500000     2872400    1.000000    0.250000 
user_name         100    0.071429        4833    0.001726    0.246436

# RawShares: is the amount of resources in absolute terms assigned to the user. It is equal for all users.
# NormShares: The previous amount normalized to the total assigned resources.
# RawUsage: The number of CPU-seconds consumed by all the user's jobs.
# NormUsage: The previous amount normalized to the total CPU-seconds consumed in the cluster.
# FairShare: The FairShare factor between 0 and 1. The more the cluster is used, the closer to 0 it will be and the lower the priority.

Job submission methods
  1. sbatch
  2. salloc
  3. srun

1. SBATCH
Used to submit a script to the queue system. It is batch processing and non-blocking.

# Create the script:
hpc-login2 ~]$ vim trabajo_ejemplo.sh
    #!/bin/bash
    #SBATCH --job-name=prueba            # Job name
    #SBATCH --nodes=1                    # -N Run all processes on a single node   
    #SBATCH --ntasks=1                   # -n Run a single task   
    #SBATCH --cpus-per-task=1            # -c Run 1 processor per task       
    #SBATCH --mem=1gb                    # Job memory request
    #SBATCH --time=00:05:00              # Time limit hrs:min:sec
    #SBATCH --qos=urgent                 # Queue
    #SBATCH --output=prueba_%j.log       # Standard output and error log
 
    echo "Hello World!"
 
hpc-login2 ~]$ sbatch trabajo_ejemplo.sh 

2. SALLOC
Used to obtain an immediate allocation of resources (nodes). As soon as it is obtained the specified command is executed or a shell by default.

# Get 5 nodes and run a program.
hpc-login2 ~]$ salloc -N5 myprogram
# Get interactive access to a node (Press Ctrl+D to end the session):
hpc-login2 ~]$ salloc -N1 
# Get interactive exclusive access to a node
hpc-login2 ~]$ salloc -N1 --exclusive

3. SRUN
Used to launch a parallel job (preferred over using mpirun). It is interactive and blocking.

# Run hostname on 2 nodes
hpc-login2 ~]$ srun -N2 hostname
hpc-node1
hpc-node2

Using GPU nodes

To specifically request GPU allocation for a job you must add the options to sbatch or srun:

--gres Request GPUs per NODE --gres=gpu[[:type]:count],...
--gpus or -G Request GPUs per JOB --gpus=[type]:count,...

There are also options --gpus-per-socket,--gpus-per-node and --gpus-per-task,
Examples:

## See the list of nodes and gpus:
hpc-login2 ~]$ ver_recursos
## Request 2 GPUs (any) for a JOB, add:
--gpus=2
## Request an A100 40G on one node and an A100 80G on another, add:
--gres=gpu:A100_40:1,gpu:A100_80:1 

Job monitoring

## List all jobs in the queue
hpc-login2 ~]$ squeue
## List a user's jobs            
hpc-login2 ~]$ squeue -u <login>
## Cancel a job:
hpc-login2 ~]$ scancel <JOBID>
## List recent jobs
hpc-login2 ~]$ sacct -b
## Detailed historical information about a job:
hpc-login2 ~]$ sacct -l -j <JOBID>
## Debug information for a job for troubleshooting:
hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
## See resource usage of a running job:
hpc-login2 ~]$ sstat <JOBID>

Controlling job output

Exit codes

By default these are the exit codes for the commands:

SLURM command Exit code
salloc 0 on success, 1 if the user's command could not be executed
srun The highest of all tasks executed or 253 for an out-of-memory error
sbatch 0 on success, otherwise the exit code of the process that failed
STDIN, STDOUT and STDERR

SRUN:
By default stdout and stderr of all TASKS are redirected to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with:

-i, --input=<option>
-o, --output=<option>
-e, --error=<option>

And the options are:

  • all: default option.
  • none: Nothing is redirected.
  • taskid: Only redirected from and/or to the specified TASK id.
  • filename: Everything is redirected from and/or to the specified file.
  • filename pattern: Same as filename but with a file defined by a pattern

SBATCH:
By default “/dev/null” is open on the script's stdin and stdout and stderr are redirected to a file named “slurm-%j.out”. This can be changed with:

-i, --input=<filename_pattern>
-o, --output=<filename_pattern>
-e, --error=<filename_pattern>

The reference for filename_pattern is here .

Sending emails

JOBS can be configured to send emails under certain circumstances using these two parameters (BOTH ARE REQUIRED):

--mail-type=<type> Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.
--mail-user=<user> The destination email address.

Job states in the queue system

hpc-login2 ~]# squeue -l
JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
 
## See queue usage status on the cluster:
hpc-login2 ~]$ estado_colas.sh
JOBS PER USER:
--------------
       usuario.uno:  3
       usuario.dos:  1
 
JOBS PER QOS:
--------------
             regular:  3
                long:  1
 
JOBS PER STATE:
--------------
             RUNNING:  3
             PENDING:  1
==========================================
Total JOBS in cluster:  4

Most common job STATES:

  • R RUNNING Job currently has an allocation.
  • CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
  • F FAILED Job terminated with non-zero exit code or other failure condition.
  • PD PENDING Job is awaiting resource allocation.

Full list of possible job states .

If a job is not running a reason will appear under REASON: List of reasons for which a job may be waiting to run.