Wiki do CiTIUS

This is an old revision of the document!

Video of the presentation of the service (7/3/22) (Spanish only)

Description

The computing part of the cluster is made up of:

9 servers for general computing.
1 “fat node” for memory-intensive jobs.
4 servers for GPU computing.

Users only have direct access to the login node, which has more limited features and should not be used for computing.
All nodes are interconnected by a 10Gb network.
There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network.

Name	Model	Processor	Memory	GPU
hpc-login2	Dell R440	1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)	16 GB	-
hpc-node[1-2]	Dell R740	2 x Intel Xeon Gold 5220 @2,2 GHz (18c)	192 GB	-
hpc-node[3-9]	Dell R740	2 x Intel Xeon Gold 5220R @2,2 GHz (24c)	192 GB	-
hpc-fat1	Dell R840	4 x Xeon Gold 6248 @ 2.50GHz (20c)	1 TB	-
~~hpc-gpu1~~*	Dell R740	2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)	192 GB	2x Nvidia Tesla V100S
hpc-gpu2	Dell R740	2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)	192 GB	2x Nvidia Tesla V100S
hpc-gpu3	Dell R7525	2 x AMD EPYC 7543 @2,80 GHz (32c)	256 GB	2x Nvidia Ampere A100 40GB
hpc-gpu4	Dell R7525	2 x AMD EPYC 7543 @2,80 GHz (32c)	256 GB	1x Nvidia Ampere A100 80GB

* Now ctgpgpu8. It will be integrated in the cluster soon.

Accessing the system

To access the cluster, access must be requested in advance via incident form. Users who do not have access permission will receive an “incorrect password” message.

The access is done through an SSH connection to the login node:

ssh <nombre_de_usuario>@hpc-login2.inv.usc.es

Storage, directories and filesystems

None of the file systems in the cluster are backed up!!!

The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable $HOME.
Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the $LOCAL_SCRATCH environment variable in the scripts.
For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.

Directory	Variable	Mount point	Capacity
Home	$HOME	/mnt/beegfs/home/<username>	220 TB*
local Scratch	$LOCAL_SCRATCH	varía	1 TB
Group folder	$GRUPOS/<nombre>	/mnt/beegfs/groups/<nombre>	220 TB*

* storage is shared

WARNING

The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows:

Create the image file at your home folder:

## truncate image.name -s SIZE_IN_BYTES
truncate example.ext4 -s 20G

Create a filesystem in the image file:

## mkfs.ext4 -T small -m 0 image.name
## -T small optimized options for small files
## -m 0 Do not reserve capacity for root user 
mkfs.ext4 -T small -m 0 example.ext4

Mount the image (using SUDO) with the script mount_image.py :

## By default it is mounted at /mnt/imagenes/<username>/ in read-only mode.
sudo mount_image.py example.ext4

To unmount the image use the script umount_image.py (using SUDO)

The mount script has this options:

--mount-point path   <-- (optional) This option creates subdirectories under /mnt/imagenes/<username>/<path> 
--rw                  <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.

Do not mount the image file readwrite from more than one node!!!

The unmounting script has this options:

only supports as an optional parameter the same path you have used when mounting with the option 
--mount-point  <-- (optional)

Transference of files and data

SCP

From your local machine to the cluster:

scp filename <username>@hpc-login2:/<path>

From the cluster to your local machine:

scp filename <username>@<hostname>:/<path>

SCP man page

SFTP

To transfer several files or to navigate through the filesystem.

<hostname>:~$ sftp <user_name>@hpc-login2
sftp>
sftp> ls
sftp> cd <path>
sftp> put <file>
sftp> get <file>
sftp> quit

SFTP man page

RSYNC

RSYNC documentation

SSHFS

Requires local installation of the sshfs package.
Allows for example to mount the user's local home in hpc-login2:

## Mount
sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <mount_point>
## Unmount
fusermount -u <mount_point>

SSHFS man page

Available Software

All nodes have the basic software that is installed by default in AlmaLinux 8.4, in particular:

GCC 8.5.0
Python 3.6.8
Perl 5.26.3

To use any other software not installed on the system or another version of the system, there are three options:

Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
Use a container (uDocker or Apptainer/Singularity)
Use Conda

A module is the simplest solution for using software without modifications or difficult to satisfy dependencies.
A container is ideal when dependencies are complicated and/or the software is highly customised. It is also the best solution if you are looking for reproducibility, ease of distribution and teamwork.
Conda is the best solution if you need the latest version of a library or program or packages not otherwise available.

Modules/Lmod use

Lmod documentation

# See available modules:
module avail
# Module load:
module <module_name>
# Unload a module:
module unload <module_name>
# List modules loaded in your environment:
module list
# ml can be used as a shorthand of the module command:
ml avail
# To get info of a module:
ml spider <module_name>

Software containers execution

uDocker

uDocker manual
uDocker is installed as a module, so it needs to be loaded into the environment:

ml uDocker

Apptainer/Singularity

Apptainer/Singularity documentation
Apptainer/Singularity is installed on each node's system, so you don't need to do anything to use it.

CONDA

Conda Documentation
Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python and a few necessary packages. From there on, each user only downloads and installs the packages they need.

# Getting miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
# Install 
sh Miniconda3-py39_4.11.0-Linux-x86_64.sh

Using SLURM

The cluster queue manager is SLURM .

The term CPU identifies a physical core in a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket) it has.

Available resources

hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
# There is an alias for that command:
hpc-login2 ~]$ ver_recursos
NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES                           
hpc-fat1                        80                    1027273               cpu_intel             (null)                         
hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:2                    
hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:2                  
hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)             
hpc-node[1-2]                   36                    187645                cpu_intel             (null)                         
hpc-node[3-9]                   48                    187645                cpu_intel             (null)
 
# To see current resource use: (CPUS (Allocated/Idle/Other/Total))
hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
# There is an alias for that command:
hpc-login2 ~]$ ver_uso
NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:2       gpu:A100_40:2(IDX:0-
hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
hpc-node3           36/12/0/48          187645              126739              (null)              gpu:0,mps:0
hpc-node4           36/12/0/48          187645              126959              (null)              gpu:0,mps:0
hpc-node5           36/12/0/48          187645              128572              (null)              gpu:0,mps:0
hpc-node6           36/12/0/48          187645              127699              (null)              gpu:0,mps:0
hpc-node7           36/12/0/48          187645              127002              (null)              gpu:0,mps:0
hpc-node8           36/12/0/48          187645              128182              (null)              gpu:0,mps:0
hpc-node9           36/12/0/48          187645              127312              (null)              gpu:0,mps:0

Nodes

A node is SLURM's computation unit and corresponds to a physical server.

# Show node info:
hpc-login2 ~]$ scontrol show node hpc-node1
NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 
   CPUAlloc=0 CPUTot=36 CPULoad=0.00
   AvailableFeatures=cpu_intel
   ActiveFeatures=cpu_intel
   Gres=(null)
   NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6
   OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021 
   RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defaultPartition 
   BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48
   LastBusyTime=2022-03-07T14:34:12
   CfgTRES=cpu=36,mem=187645M,billing=36
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partitions

Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.

# Show partition info:
hpc-login2 ~]$ sinfo
defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[1-4],hpc-node[1-9]

Jobs

Jobs in SLURM are resource allocations to a user for a given time. Jobs are identified by a sequential number or JOBID.
A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes sequentially in a JOB and there is one TASK for each program that executes in parallel. Therefore in the simplest case such as launching a job consisting of executing the hostname command the JOB has a single STEP and a single TASK.

Queue system (QOS)

The queue to which each job is submitted defines the priority, the limits and also the relative “cost” to the user.

# Show queues
hpc-login2 ~]$ sacctmgr show qos
# There is an alias that shows only the relevant info:
hpc-login2 ~]$ ver_colas
      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU 
---------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- ----------- 
   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50 
interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1 
    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15 
      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00                                     
     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25 
     admin        500                    0.000000

# Priority: is the relative priority of each queue.
# DenyonLimit: job will not be executed if it doesn't comply with the queue limits
# UsageFactor: relive cost for the user to execute jobs on that queue
# MaxTRES: limnits applied to each job
# MaxWall: maximum time the job can run
# MaxTRESPU: global limits per user
# MaxJobsPU: Maximum number of jobs a user can have running simultaneously.
# MaxSubmitPU: Maximum number of jobs that a user can have in total both queued and running.

Envío de un trabajo al sistema de colas

Especificación de recursos

Por defecto, si se envía un trabajo sin especificar nada el sistema lo envia a la QOS por defecto (regular) y le asigna un nodo, una CPU y toda la memoria disponible. El límite de tiempo para la ejecución del trabajo es el de la cola (4 días y 4 horas). Esto es muy ineficiente, lo ideal es especificar en la medida de lo posible al menos tres parámetros a la hora de enviar los trabajos:

El número de nodos (-N o --nodes), tareas (-n o --ntasks) y/o CPU por tarea (-c o --cpus-per-task).
La memoria (--mem) por nodo o la memoria por cpu (--mem-per-cpu).
El tiempo estimado de ejecución del trabajo ( --time )

A mayores puede ser interesante añadir los siguientes parámetros:

-J	--job-name	Nombre para el trabajo. Por defecto: nombre del ejecutable
-q	--qos	Nombre de la cola a la que se envía el trabajo. Por defecto: regular
-o	--output	Fichero o patrón de fichero al que se redirige toda la salida estandar y de error.
	--gres	Tipo y/o número de GPUs que se solicitan para el trabajo.
-C	--constraint	Para especificar que se quieren nodos con procesadores Intel o AMD (cpu_intel o cpu_amd)
	--exclusive	Para solicitar que el trabajo no comparta nodos con otros trabajos.
-w	--nodelist	Lista de nodos en los que ejecutar el trabajo

Cómo se asignan los recursos

Por defecto el método de asignación entre nodos es la asignación en bloque ( se asignan todos los cores disponibles en un nodo antes de usar otro). El método de asignación por defecto dentro de cada nodo es la asignación cíclica (se van repartiendo por igual los cores requeridos entre los sockests disponibles en el nodo).

Calculo de la prioridad

Cuando se envía un trabajo al sistema de colas, lo primero que ocurre es que se comprueba si los recursos solicitados entran dentro de los límites fijados en la cola correspondiente. Si supera alguno se cancela el envío.
Si hay recursos disponibles el trabajo se ejecuta directamente, pero si no es así se encola. Cada trabajo tiene asignada una prioridad que determina el orden en que se ejecutan los trabajos de la cola cuando quedan recursos disponibles. Para determinar la prioridad de cada trabajo se ponderan 3 factores: el tiempo que lleva esperando en la cola (25%), la prioridad fija que tiene la cola(25%) y el fairshare del usuario (50%).
El fairshare es un cálculo dinámico que hace SLURM para cada usuario y es la diferencia entre los recursos asignados y los recursos consumidos a lo largo de los últimos 14 días.

hpc-login2 ~]$ sshare -l 
      User  RawShares  NormShares    RawUsage   NormUsage   FairShare 
---------- ---------- ----------- ----------- -----------  ---------- 
                         1.000000     2872400                0.500000 
                    1    0.500000     2872400    1.000000    0.250000 
user_name         100    0.071429        4833    0.001726    0.246436

# RawShares: es la cantidad de recursos en términos absolutos asignada al usuario. Es igual para todos los usuarios.
# NormShares: Es la cantidad anterior normalizada a los recursos asignados en total.
# RawUsage: Es la cantidad de segundos/cpu consumida por todos los trabajos del usuario.
# NormUsage: Cantidad anterior normalizada al total de segundos/cpu consumidos en el cluster.
# FairShare: El factor FairShare entre 0 y 1. Cuanto mayor uso del cluster, más se aproximará a 0 y menor será la prioridad.

Envío de trabajos

salloc
srun
sbatch

1. SALLOC
Sirve para obtener de forma inmediata una asignación de recursos (nodos). En cuanto se obtiene se ejecuta el comando especificado o una shell en su defecto.

# Obtener 5 nodos y lanzar un trabajo.
hpc-login2 ~]$ salloc -N5 myprogram
# Obtener acceso interactivo a un nodo (Pulsar Ctrl+D para terminar el acceso):
hpc-login2 ~]$ salloc -N1

2. SRUN
Sirve para lanzar un trabajo paralelo ( es preferible a usar mpirun ). Es interactivo y bloqueante.

# Lanzar un hostname en 2 nodos
hpc-login2 ~]$ srun -N2 hostname
hpc-node1
hpc-node2

3. SBATCH
Sirve para enviar un script al sistema de colas. Es de procesamiento por lotes y no bloqueante.

# Crear el script:
hpc-login2 ~]$ vim trabajo_ejemplo.sh
    #!/bin/bash
    #SBATCH --job-name=prueba            # Job name
    #SBATCH --nodes=1                    # -N Run all processes on a single node   
    #SBATCH --ntasks=1                   # -n Run a single task   
    #SBATCH --cpus-per-task=1            # -c Run 1 processor per task       
    #SBATCH --mem=1gb                    # Job memory request
    #SBATCH --time=00:05:00              # Time limit hrs:min:sec
    #SBATCH --qos=urgent                 # Cola
    #SBATCH --output=prueba_%j.log       # Standard output and error log
 
    echo "Hello World!"
 
hpc-login2 ~]$ sbatch trabajo_ejemplo.sh

Uso de los nodos con GPU

Para solicitar específicamente una asignación de GPUs para un trabajo hay que añadir a sbatch o srun las opciones:

--gres	Solicitud de gpus por NODE	--gres=gpu[[:type]:count],...
--gpus o -G	Solicitud de gpus por JOB	--gpus=[type]:count,...

También existen las opciones --gpus-per-socket,--gpus-per-node y --gpus-per-task,
Ejemplos:

## Ver la lista de nodos y gpus:
hpc-login2 ~]$ ver_recursos
## Solicitar 2 GPU cualesquiera para un JOB, añadir:
--gpus=2
## Solicitar una A100 de 40G en un nodo y una A100 de 80G en otro, añadir:
--gres=gpu:A100_40:1,gpu:A100_80:1

Monitorización de los trabajos

## Listado de todos los trabajos en la cola
hpc-login2 ~]$ squeue
## Listado de los trabajos de un usuario            
hpc-login2 ~]$ squeue -u <login>
## Cancelar un trabajo:
hpc-login2 ~]$ scancel <JOBID>
## Lista de trabajos recientes
hpc-login2 ~]$ sacct -b
## Información histórica detallada de un trabajo:
hpc-login2 ~]$ sacct -l -j <JOBID>
## Información de debug de un trabajo para troubleshooting:
hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
## Ver el uso de recursos de un trabajo en ejecución:
hpc-login2 ~]$ sstat <JOBID>

Controlar la salida de los trabajos

Códigos de salida

Por defecto estos son los códigos de salida de los comandos:

SLURM command	Exit code
salloc	0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario
srun	El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem
sbatch	0 en caso de éxito, si no, el código de salida correspondiente del proceso que falló

STDIN, STDOUT y STDERR

SRUN:
Por defecto stdout y stderr se redirigen de todos los TASKS a el stdout y stderr de srun, y stdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con:

-i, --input=<opcion>

-o, --output=<opcion>

-e, --error=<opcion>

Y las opciones son:

all: opción por defecto.
none: No se redirecciona nada.
taskid: Solo se redirecciona desde y/o al TASK id especificado.
filename: Se redirecciona todo desde y/o al fichero especificado.
filename pattern: Igual que filename pero con un fichero definido por un patrón

SBATCH:
Por defecto “/dev/null” está abierto en el stdin del script y stdout y stderror se redirigen a un fichero de nombre “slurm-%j.out”. Esto se puede cambiar con:

-i, --input=<filename_pattern>

-o, --output=<filename_pattern>

-e, --error=<filename_pattern>

La referencia de filename_pattern está aquí .

Envío de correos

Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (SON NECESARIOS AMBOS):

--mail-type=<type>	Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.
--mail-user=<user>	La dirección de correo de destino.

Estados de los trabajos en el sistema de colas

hpc-login2 ~]# squeue -l
JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1

Estados (STATE) más comunes de un trabajo:

R RUNNING Job currently has an allocation.
CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
F FAILED Job terminated with non-zero exit code or other failure condition.
PD PENDING Job is awaiting resource allocation.

Lista completa de posibles estados de un trabajo .

Si un trabajo no está en ejecución aparecerá una razón debajo de REASON: Lista de las razones por las que un trabajo puede estar esperando su ejecución.

High Performance Computing (HPC) cluster ctcomp3