====== High Performance Computing (HPC) cluster ctcomp3 ======
[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the presentation of the service (7/3/22) (Spanish only) ]]
===== Description =====
The computing part of the cluster is made up of:
* 9 servers for general computing.
* 1 "fat node" for memory-intensive jobs.
* 4 servers for GPU computing.
Users only have direct access to the login node, which has more limited features and should not be used for computing. \\
All nodes are interconnected by a 10Gb network. \\
There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network. \\
\\
^ Name ^ Model ^ Processor ^ Memory ^ GPU ^
| hpc-login2 | Dell R440 | 1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c) | 16 GB | - |
| hpc-node[1-2] | Dell R740 | 2 x Intel Xeon Gold 5220 @2,2 GHz (18c) | 192 GB | - |
| hpc-node[3-9] | Dell R740 | 2 x Intel Xeon Gold 5220R @2,2 GHz (24c) | 192 GB | - |
| hpc-fat1 | Dell R840 | 4 x Xeon Gold 6248 @ 2.50GHz (20c) | 1 TB | - |
| hpc-gpu[1-2] | Dell R740 | 2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c) | 192 GB | 2x Nvidia Tesla V100S |
| hpc-gpu3 | Dell R7525 | 2 x AMD EPYC 7543 @2,80 GHz (32c) | 256 GB | 2x Nvidia Ampere A100 40GB |
| hpc-gpu4 | Dell R7525 | 2 x AMD EPYC 7543 @2,80 GHz (32c) | 256 GB | 1x Nvidia Ampere A100 80GB |
===== Accessing the cluster =====
To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
The access is done through an SSH connection to the login node (172.16.242.211):
ssh @hpc-login2.inv.usc.es
===== Storage, directories and filesystems =====
None of the file systems in the cluster are backed up!!!
The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\
Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the %%$LOCAL_SCRATCH%% environment variable in the scripts. \\
For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.\\
^ Directory ^ Variable ^ Mount point ^ Capacity ^
| Home | %%$HOME%% | /mnt/beegfs/home/ | 220 TB* |
| local Scratch | %%$LOCAL_SCRATCH%% | varĂa | 1 TB |
| Group folder | %% $GRUPOS/%% | /mnt/beegfs/groups/ | 220 TB* |
%%* storage is shared %%
=== WARNING ===
The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows:
* Create the image file at your home folder:
## truncate image.name -s SIZE_IN_BYTES
truncate example.ext4 -s 20G
* Create a filesystem in the image file:
## mkfs.ext4 -T small -m 0 image.name
## -T small optimized options for small files
## -m 0 Do not reserve capacity for root user
mkfs.ext4 -T small -m 0 example.ext4
* Mount the image (using SUDO) with the script //mount_image.py// :
## By default it is mounted at /mnt/imagenes// in read-only mode.
sudo mount_image.py example.ext4
* To unmount the image use the script //umount_image.py// (using SUDO)
The mount script has this options:
--mount-point path <-- (optional) This option creates subdirectories under /mnt/imagenes//
--rw <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.
Do not mount the image file readwrite from more than one node!!!
The unmounting script has this options:
only supports as an optional parameter the same path you have used when mounting with the option
--mount-point <-- (optional)
===== Transference of files and data =====
=== SCP ===
From your local machine to the cluster:
scp filename @hpc-login2:/
From the cluster to your local machine:
scp filename @:/
[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP man page]]
=== SFTP ===
To transfer several files or to navigate through the filesystem.
:~$ sftp @hpc-login2
sftp>
sftp> ls
sftp> cd
sftp> put
sftp> get
sftp> quit
[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP man page]]
=== RSYNC ===
[[ https://rsync.samba.org/documentation.html | RSYNC documentation ]]
=== SSHFS ===
Requires local installation of the sshfs package.\\
Allows for example to mount the user's local home in hpc-login2:
## Mount
sshfs @ctdeskxxx.inv.usc.es:/home/
## Unmount
fusermount -u
[[https://linux.die.net/man/1/sshfs | SSHFS man page]]
===== Available Software =====
All nodes have the basic software that is installed by default in AlmaLinux 8.4, in particular:
* GCC 8.5.0
* Python 3.6.8
* Perl 5.26.3
GPU nodes, in addition:
* nVidia Driver 510.47.03
* CUDA 11.6
* libcudnn 8.7
To use any other software not installed on the system or another version of the system, there are three options:
- Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
- Use a container (uDocker or Apptainer/Singularity)
- Use Conda
A module is the simplest solution for using software without modifications or difficult to satisfy dependencies.\\
A container is ideal when dependencies are complicated and/or the software is highly customised. It is also the best solution if you are looking for reproducibility, ease of distribution and teamwork.\\
Conda is the best solution if you need the latest version of a library or program or packages not otherwise available.\\
==== Modules/Lmod use====
[[ https://lmod.readthedocs.io/en/latest/010_user.html | Lmod documentation]]
# See available modules:
module avail
# Module load:
module
# Unload a module:
module unload
# List modules loaded in your environment:
module list
# ml can be used as a shorthand of the module command:
ml avail
# To get info of a module:
ml spider
==== Software containers execution ====
=== uDocker ====
[[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
udocker is installed as a module, so it needs to be loaded into the environment:
ml uDocker
=== Apptainer/Singularity ===
[[ https://sylabs.io/guides/3.8/user-guide/ | Apptainer/Singularity documentation]] \\
Apptainer/Singularity is installed on each node's system, so you don't need to do anything to use it.
==== CONDA ====
[[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\
Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python and a few necessary packages. From there on, each user only downloads and installs the packages they need.
# Getting miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install
bash Miniconda3-latest-Linux-x86_64.sh
# Initialize for bash shell
~/miniconda3/bin/conda init bash
===== Using SLURM =====
The cluster queue manager is[[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
The term CPU identifies a physical core in a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket) it has.
== Available resources ==
hpc-login2 ~]# ver_estado.sh
=============================================================================================================
NODO ESTADO CORES EN USO USO MEM GPUS(Uso/Total)
=============================================================================================================
hpc-fat1 up 0%[--------------------------------------------------]( 0/80) RAM: 0% ---
hpc-gpu1 up 2%[||------------------------------------------------]( 1/36) RAM: 47% V100S (1/2)
hpc-gpu2 up 2%[||------------------------------------------------]( 1/36) RAM: 47% V100S (1/2)
hpc-gpu3 up 0%[--------------------------------------------------]( 0/64) RAM: 0% A100_40 (0/2)
hpc-gpu4 up 1%[|-------------------------------------------------]( 1/64) RAM: 35% A100_80 (1/1)
hpc-node1 up 0%[--------------------------------------------------]( 0/36) RAM: 0% ---
hpc-node2 up 0%[--------------------------------------------------]( 0/36) RAM: 0% ---
hpc-node3 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node4 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node5 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node6 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node7 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node8 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
hpc-node9 up 0%[--------------------------------------------------]( 0/48) RAM: 0% ---
=============================================================================================================
TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
hpc-login2 ~]$ sinfo -e -o "%30N %20c %20m %20f %30G " --sort=N
# There is an alias for that command:
hpc-login2 ~]$ ver_recursos
NODELIST CPUS MEMORY AVAIL_FEATURES GRES
hpc-fat1 80 1027273 cpu_intel (null)
hpc-gpu[1-2] 36 187911 cpu_intel gpu:V100S:2
hpc-gpu3 64 253282 cpu_amd gpu:A100_40:2
hpc-gpu4 64 253282 cpu_amd gpu:A100_80:1(S:0)
hpc-node[1-2] 36 187645 cpu_intel (null)
hpc-node[3-9] 48 187645 cpu_intel (null)
# To see current resource use: (CPUS (Allocated/Idle/Other/Total))
hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
# There is an alias for that command:
hpc-login2 ~]$ ver_uso
NODELIST CPUS(A/I/O/T) MEMORY FREE_MEM GRES GRES_USED
hpc-fat1 80/0/0/80 1027273 900850 (null) gpu:0,mps:0
hpc-gpu3 2/62/0/64 253282 226026 gpu:A100_40:2 gpu:A100_40:2(IDX:0-
hpc-gpu4 1/63/0/64 253282 244994 gpu:A100_80:1(S:0) gpu:A100_80:1(IDX:0)
hpc-node1 36/0/0/36 187645 121401 (null) gpu:0,mps:0
hpc-node2 36/0/0/36 187645 130012 (null) gpu:0,mps:0
hpc-node3 36/12/0/48 187645 126739 (null) gpu:0,mps:0
hpc-node4 36/12/0/48 187645 126959 (null) gpu:0,mps:0
hpc-node5 36/12/0/48 187645 128572 (null) gpu:0,mps:0
hpc-node6 36/12/0/48 187645 127699 (null) gpu:0,mps:0
hpc-node7 36/12/0/48 187645 127002 (null) gpu:0,mps:0
hpc-node8 36/12/0/48 187645 128182 (null) gpu:0,mps:0
hpc-node9 36/12/0/48 187645 127312 (null) gpu:0,mps:0
==== Nodes ====
A node is SLURM's computation unit and corresponds to a physical server.
# Show node info:
hpc-login2 ~]$ scontrol show node hpc-node1
NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18
CPUAlloc=0 CPUTot=36 CPULoad=0.00
AvailableFeatures=cpu_intel
ActiveFeatures=cpu_intel
Gres=(null)
NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6
OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021
RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1
State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=defaultPartition
BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48
LastBusyTime=2022-03-07T14:34:12
CfgTRES=cpu=36,mem=187645M,billing=36
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
==== Partitions ====
Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.
# Show partition info:
hpc-login2 ~]$ sinfo
defaultPartition* up infinite 11 idle hpc-fat1,hpc-gpu[1-4],hpc-node[1-9]
==== Jobs ====
Jobs in SLURM are resource allocations to a user for a given time. Jobs are identified by a sequential number or JOBID. \\
A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes sequentially in a JOB and there is one TASK for each program that executes in parallel. Therefore in the simplest case such as launching a job consisting of executing the hostname command the JOB has a single STEP and a single TASK.
==== Queue system (QOS) ====
The queue to which each job is submitted defines the priority, the limits and also the relative "cost" to the user.
# Show queues
hpc-login2 ~]$ sacctmgr show qos
# There is an alias that shows only the relevant info:
hpc-login2 ~]$ ver_colas
Name Priority MaxTRES MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU
---------- ---------- ---------------------------------------- ----------- -------------------- --------- -----------
regular 100 cpu=200,gres/gpu=1,node=4 4-04:00:00 cpu=200,node=4 10 50
interactive 200 node=1 04:00:00 node=1 1 1
urgent 300 gres/gpu=1,node=1 04:00:00 cpu=36 5 15
long 100 gres/gpu=1,node=4 8-04:00:00 1 5
large 100 cpu=200,gres/gpu=2 4-04:00:00 2 10
admin 500
small 150 cpu=6,node=2 04:00:00 cpu=400 40 100
# Priority: is the relative priority of each queue. \\
# DenyonLimit: job will not be executed if it doesn't comply with the queue limits \\
# UsageFactor: relive cost for the user to execute jobs on that queue \\
# MaxTRES: limnits applied to each job \\
# MaxWall: maximum time the job can run \\
# MaxTRESPU: global limits per user \\
# MaxJobsPU: Maximum number of jobs a user can have running simultaneously. \\
# MaxSubmitPU: Maximum number of jobs that a user can have in total both queued and running.\\
==== Sending a job to the queue system ====
== Requesting resources ==
By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours).
This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
- %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
- %%Memory (--mem) per node or memory per cpu (--mem-per-cpu).%%
- %%Job execution time ( --time )%%
In addition, it may be interesting to add the following parameters:
| -J | %%--job-name%% |Job name. Default: executable name |
| -q | %%--qos%% |Name of the queue to which the job is sent. Default: regular |
| -o | %%--output%% |File or file pattern to which all standard and error output is redirected. |
| | %%--gres%% |Type and/or number of GPUs requested for the job. |
| -C | %%--constraint%% |Para especificar que se quieren nodos con procesadores Intel o AMD (cpu_intel o cpu_amd) |
| | %%--exclusive%% |To specify that you want nodes with Intel or AMD processors (cpu_intel or cpu_amd) |
| -w | %%--nodelist%% |List of nodes to run the job on |
== How resources are allocated ==
The default allocation method between nodes is block allocation (all available cores on a node are allocated before using another node). The default allocation method within each node is cyclic allocation (the required cores are distributed equally among the available sockets in the node).
== Priority calculation ==
When a job is submitted to the queuing system, the first thing that happens is that the requested resources are checked to see if they fall within the limits set in the corresponding queue. If it exceeds any of them, the submission is cancelled. \\
If resources are available, the job is executed directly, but if not, it is queued. Each job is assigned a priority that determines the order in which the jobs in the queue are executed when resources are available. To determine the priority of each job, 3 factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%) and the user's fairshare (50%). \\
The fairshare is a dynamic calculation made by SLURM for each user and is the difference between the resources allocated and the resources consumed over the last 14 days.
hpc-login2 ~]$ sshare -l
User RawShares NormShares RawUsage NormUsage FairShare
---------- ---------- ----------- ----------- ----------- ----------
1.000000 2872400 0.500000
1 0.500000 2872400 1.000000 0.250000
user_name 100 0.071429 4833 0.001726 0.246436
# RawShares: Is the amount of resources allocated to the user in absolute terms . It is the same for all users.\\
# NormShares: This is the above amount normalised to the total allocated resources.\\
# RawUsage: The number of seconds/cpu consumed by all user jobs.\\
# NormUsage: RawUsage normalised to total seconds/cpu consumed in the cluster.\\
# FairShare: The FairShare factor between 0 and 1. The higher the cluster usage, the closer to 0 and the lower the priority.\\
== Job submission ==
- sbatch
- salloc
- srun
1. SBATCH \\
Used to send a script to the queuing system. It is batch-processing and non-blocking.
# Crear el script:
hpc-login2 ~]$ vim test_job.sh
#!/bin/bash
#SBATCH --job-name=test # Job name
#SBATCH --nodes=1 # -N Run all processes on a single node
#SBATCH --ntasks=1 # -n Run a single task
#SBATCH --cpus-per-task=1 # -c Run 1 processor per task
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --qos=urgent # Queue
#SBATCH --output=test%j.log # Standard output and error log
echo "Hello World!"
hpc-login2 ~]$ sbatch test_job.sh
2. SALLOC \\
It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
# Get 5 nodes and launch a job.
hpc-login2 ~]$ salloc -N5 myprogram
# Get interactive access to a node (Press Ctrl+D to exit):
hpc-login2 ~]$ salloc -N1
# Get interactive EXCLUSIVE access to a node
hpc-login2 ~]$ salloc -N1 --exclusive
3. SRUN \\
It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
# Launch the hostname command on 2 nodes
hpc-login2 ~]$ srun -N2 hostname
hpc-node1
hpc-node2
==== GPU use ====
To specifically request a GPU allocation for a job, options must be added to sbatch or srun:
| %%--gres%% | Request gpus per NODE | %%--gres=gpu[[:type]:count],...%% |
| %%--gpus o -G%% | Request gpus per JOB | %%--gpus=[type]:count,...%% |
There are also the options %% --gpus-per-socket,--gpus-per-node y --gpus-per-task%%,\\
Ejemplos:
## See the list of nodes and gpus:
hpc-login2 ~]$ ver_recursos
## Request any 2 GPUs for a JOB, add:
--gpus=2
## Request a 40G A100 at one node and an 80G A100 at another node, add:
--gres=gpu:A100_40:1,gpu:A100_80:1
==== Job monitoring ====
## List all jobs in the queue
hpc-login2 ~]$ squeue
## Listing a user's jobs
hpc-login2 ~]$ squeue -u
## Cancel a job:
hpc-login2 ~]$ scancel
## List of recent jobs:
hpc-login2 ~]$ sacct -b
## Detailed historical information for a job:
hpc-login2 ~]$ sacct -l -j
## Debug information of a job for troubleshooting:
hpc-login2 ~]$ scontrol show jobid -dd
## View the resource usage of a running job:
hpc-login2 ~]$ sstat
==== Configure job output ====
== Exit codes ==
By default these are the output codes of the commands:
^ SLURM command ^ Exit code ^
| salloc | 0 success, 1 if the user's command cannot be executed |
| srun | The highest among all executed tasks or 253 for an out-of-mem error. |
| sbatch | 0 success, if not, the corresponding exit code of the failed process |
== STDIN, STDOUT y STDERR ==
**SRUN:**\\
By default stdout and stderr are redirected from all TASKS to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with:
| %%-i, --input=