This is an old revision of the document!

High Performance Computing (HPC) Cluster ctcomp3

Video of the service presentation (7/3/22)

Description

The cluster consists in the compute part of:

9 general computing servers.
1 “fat node” for jobs that require a lot of memory.
6 servers for computing with GPU.

Users only have direct access to the login node, which has more limited capabilities and should not be used for computing.
All nodes are interconnected by a 10Gb network.
There is distributed storage accessible from all nodes with 220 TB of capacity connected via a dual 25Gb fiber network.

Name	Model	Processor	Memory	GPU
hpc-login2	Dell R440	1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)	16 GB	-
hpc-node[1-2]	Dell R740	2 x Intel Xeon Gold 5220 @2.2 GHz (18c)	192 GB	-
hpc-node[3-9]	Dell R740	2 x Intel Xeon Gold 5220R @2.2 GHz (24c)	192 GB	-
hpc-fat1	Dell R840	4 x Xeon Gold 6248 @ 2.50GHz (20c)	1 TB	-
hpc-gpu[1-2]	Dell R740	2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)	192 GB	2x Nvidia Tesla V100S 32GB
hpc-gpu3	Dell R7525	2 x AMD EPYC 7543 @2.80 GHz (32c)	256 GB	2x Nvidia Ampere A100 40GB
hpc-gpu4	Dell R7525	2 x AMD EPYC 7543 @2.80 GHz (32c)	256 GB	1x Nvidia Ampere A100 80GB
hpc-gpu5	Dell R7725	2 x AMD EPYC 9255 @3.25 GHz (24c)	364 GB	2x Nvidia L4 24GB
hpc-gpu6	Dell R7725	2 x AMD EPYC 9255 @3.25 GHz (24c)	384 GB	2x Nvidia L4 24GB

Connection to the system

To access the cluster, it must be requested beforehand through the incident form. Users without access permission will receive a “wrong password” message.

Access is done via SSH to the login node (172.16.242.211):

ssh <username>@hpc-login2.inv.usc.es

Storage, directories, and file systems

No backup is made of any file systems in the cluster!!

Users' HOME in the cluster is in the shared file system, so it is accessible from all nodes of the cluster. Route defined in the environment variable $HOME.
Each node has a local 1 TB partition for scratch, which is deleted after each job completes. It can be accessed through the environment variable $LOCAL_SCRATCH in scripts.
For data that need to be shared among groups of users, a request must be made to create a folder in shared storage that will only be accessible by group members.

Directory	Variable	Mount point	Capacity
Home	$HOME	/mnt/beegfs/home/<username>	220 TB*
Local Scratch	$LOCAL_SCRATCH	varies	1 TB
Group folder	$GROUPS/<name>	/mnt/beegfs/groups/<name>	220 TB*

* the storage is shared

IMPORTANT NOTICE

The shared file system performs poorly when working with many small files. To improve performance in such scenarios, it is necessary to create a file system in an image file and mount it to work directly on it. The procedure is as follows:

Create the image file in your home:

## truncate image.name -s SIZE_IN_BYTES
truncate example.ext4 -s 20G

Create a file system in the image file:

## mkfs.ext4 -T small -m 0 image.name
## -T small optimized options for small files
## -m 0 Do not reserve space for root 
mkfs.ext4 -T small -m 0 example.ext4

Mount the image (using SUDO) with the script mount_image.py :

## By default mounted at /mnt/images/<username>/ in read-only mode.
sudo mount_image.py example.ext4

To unmount the image, use the script umount_image.py (using SUDO)

sudo umount_image.py

The file can only be mounted from one node if done in readwrite mode, but it can be mounted from any number of nodes in readonly mode.

The mount script has the following options:

--mount-point path   <-- (optional) With this option, it creates subdirectories under /mnt/images/<username>/<path>
--rw                  <-- (optional) By default, it is mounted readonly; with this option, it is mounted readwrite.

The unmount script has these options:

only accepts as an optional parameter the same path you used for the mount with the option 
--mount-point  <-- (optional)

File and data transfer

SCP

From your local machine to the cluster:

scp filename <username>@hpc-login2:/<path>

From the cluster to your local machine:

scp filename <username>@<hostname>:/<path>

SCP manual page

SFTP

To transfer multiple files or to navigate through the file system.

<hostname>:~$ sftp <user_name>@hpc-login2
sftp>
sftp> ls
sftp> cd <path>
sftp> put <file>
sftp> get <file>
sftp> quit

SFTP manual page

RSYNC

RSYNC documentation

SSHFS

Requires the installation of the sshfs package.
Allows for instance, mounting the home of the user's machine on hpc-login2:

## Mount
sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <mount_point>
## Unmount
fusermount -u <mount_point>

SSHFS manual page

Available Software

All nodes have the basic software installed by default with AlmaLinux 8.4, particularly:

GCC 8.5.0
Python 3.6.8
Perl 5.26.3

On the nodes with GPU, additionally:

nVidia Driver 560.35.03
CUDA 11.6
libcudnn 8.7

To use any other software not installed on the system or another version of it, there are three options:

Use Modules with the modules already installed (or request the installation of a new module if it is not available)
Use a container (uDocker or Apptainer/Singularity)
Use Conda

A module is the simplest solution to use software without modifications or difficult-to-satisfy dependencies.
A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if reproducibility, ease of distribution, and teamwork are what you're looking for.
Conda is the best solution if you need the latest version of a library or program or packages that are not available otherwise.

Using modules/Lmod

Lmod documentation

# View available modules:
module avail
# Load a module:
module <module_name>
# Unload a module:
module unload <module_name>
# View modules loaded in your environment:
module list
# ml can be used as an abbreviation for the module command:
ml avail
# To get information about a module:
ml spider <module_name>

Running software containers

uDocker

uDocker manual
uDocker is installed as a module, so it is necessary to load it in the environment:

ml udocker

Apptainer/Singularity

Apptainer documentation
Apptainer is installed in the system of each node, so nothing needs to be done to use it.

CONDA

Conda documentation
Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python, and a few necessary packages. From there, each user simply downloads and installs the packages they need.

# Get miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install it 
bash Miniconda3-latest-Linux-x86_64.sh
# Initialize miniconda for the bash shell
~/miniconda3/bin/conda init bash

Using SLURM

The queue manager in the cluster is SLURM .

The term CPU identifies a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).

Available resources

hpc-login2 ~]# view_status.sh
=============================================================================================================
  NODE     STATUS                        CORES IN USE                           MEM USE     GPUS(Use/Total)
=============================================================================================================
 hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
 hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
 hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
 hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
=============================================================================================================
TOTAL: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 
hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
# There is an alias for this command:
hpc-login2 ~]$ view_resources
NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES                           
hpc-fat1                        80                    1027273               cpu_intel             (null)                         
hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:2                    
hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:2                  
hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)             
hpc-node[1-2]                   36                    187645                cpu_intel             (null)                         
hpc-node[3-9]                   48                    187645                cpu_intel             (null)
 
# To see current resource usage: (CPUS (Allocated/Idle/Other/Total))
hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
# There is an alias for this command:
hpc-login2 ~]$ view_usage
NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:2       gpu:A100_40:2(IDX:0-
hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
hpc-node3           36/12/0/48          187645              126739              (null)              gpu:0,mps:0
hpc-node4           36/12/0/48          187645              126959              (null)              gpu:0,mps:0
hpc-node5           36/12/0/48          187645              128572              (null)              gpu:0,mps:0
hpc-node6           36/12/0/48          187645              127699              (null)              gpu:0,mps:0
hpc-node7           36/12/0/48          187645              127002              (null)              gpu:0,mps:0
hpc-node8           36/12/0/48          187645              128182              (null)              gpu:0,mps:0
hpc-node9           36/12/0/48          187645              127312              (null)              gpu:0,mps:0

Nodes

A node is the SLURM computing unit and corresponds to a physical server.

# Show information about a node:
hpc-login2 ~]$ scontrol show node hpc-node1
NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 
   CPUAlloc=0 CPUTot=36 CPULoad=0.00
   AvailableFeatures=cpu_intel
   ActiveFeatures=cpu_intel
   Gres=(null)
   NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6
   OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021 
   RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=defaultPartition 
   BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48
   LastBusyTime=2022-03-07T14:34:12
   CfgTRES=cpu=36,mem=187645M,billing=36
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Partitions

Partitions in SLURM are logical groups of nodes. In the cluster, there is only one partition to which all nodes belong, so there is no need to specify it when submitting jobs.

# Show partition information:
hpc-login2 ~]$ sinfo
defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9]
# When ctgpgpu7 and 8 are added to the cluster, they will appear as nodes hpc-gpu1 and 2 respectively.

Jobs

Jobs in SLURM are resource allocations to a user for a specified time. Jobs are identified by a sequential number or JOBID.
A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that runs sequentially in a JOB and there is one TASK for each program that runs in parallel. Therefore, in the simplest case, such as launching a job that consists of executing the hostname command, the JOB has a single STEP and a single TASK.

Queue system (QOS)

The queue to which each job is sent defines the priority, limits, and also the relative “cost” for the user.

# Show queues
hpc-login2 ~]$ sacctmgr show qos
# There is an alias that shows only the most relevant information:
hpc-login2 ~]$ view_queues
      Name   Priority                        MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU 
---------- ---------- ------------------------------ ----------- -------------------- --------- ----------- 
   regular        100      cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50 
interacti+        200                         node=1    04:00:00               node=1         1           1 
    urgent        300              gres/gpu=1,node=1    04:00:00               cpu=36         5          15 
      long        100              gres/gpu=1,node=4  8-04:00:00                              1           5 
     large        100             cpu=200,gres/gpu=2  4-04:00:00                              2          10 
     admin        500                                                                                       
     small        100        cpu=6,gres/gpu=0,node=2  6-00:00:00              cpu=400        400         800 
     short        150                   cpu=6,node=2    04:00:00                              40         100

# Priority: it is the relative priority of each queue.
# DenyLimit: the job does not run if it does not meet the limits of the queue
# UsageFactor: the relative cost for the user of running a job in that queue
# MaxTRES: limits per job
# MaxWall: maximum time the job can run
# MaxTRESPU: global limits per user
# MaxJobsPU: maximum number of jobs that a user can have running.
# MaxSubmitPU: maximum number of jobs that a user can have queued and running in total.

Submitting a job to the queue system

Resource specification

By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns it a node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours). This is very inefficient; ideally, at least three parameters should be specified when submitting jobs:

The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).
The memory (--mem) per node or memory per cpu (--mem-per-cpu).
The estimated execution time of the job ( --time )

Additionally, it may be interesting to add the following parameters:

-J	--job-name	Name for the job. Default: name of the executable
-q	--qos	Name of the queue to which the job is sent. Default: regular
-o	--output	File or file pattern where all standard and error output is redirected.
	--gres	Type and/or number of GPUs requested for the job.
-C	--constraint	To specify that nodes with Intel or AMD processors (cpu_intel or cpu_amd) are wanted
	--exclusive	To request that the job does not share nodes with other jobs.
-w	--nodelist	List of nodes on which to execute the job

How resources are allocated

By default, the allocation method among nodes is block allocation (all available cores on a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are evenly distributed among the available sockets in the node).

Calculation of priority

When a job is submitted to the queue system, the first thing that happens is that it checks whether the requested resources fit within the limits set in the corresponding queue. If it exceeds any, the submission is canceled.
If resources are available, the job runs directly, but if not, it is queued. Each job is assigned a priority that determines the order in which jobs in the queue are executed when resources become available. To determine the priority of each job, three factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%), and the user's fairshare (50%).
The fairshare is a dynamic calculation made by SLURM for each user and is the difference between resources allocated and resources consumed over the last 14 days.

hpc-login2 ~]$ sshare -l 
      User  RawShares  NormShares    RawUsage   NormUsage   FairShare 
---------- ---------- ----------- ----------- -----------  ---------- 
                         1.000000     2872400                0.500000 
                    1    0.500000     2872400    1.000000    0.250000 
user_name         100    0.071429        4833    0.001726    0.246436

# RawShares: is the amount of resources in absolute terms allocated to the user. It is the same for all users.
# NormShares: is the previous amount normalized to the total allocated resources.
# RawUsage: is the number of seconds/cpu consumed by all the user's jobs.
# NormUsage: is the previous amount normalized to the total seconds/cpu consumed in the cluster.
# FairShare: the FairShare factor between 0 and 1. The more you use the cluster, the closer it gets to 0 and the lower the priority.

Job submission

sbatch
salloc
srun

1. SBATCH
Used to submit a script to the queue system. It is non-blocking batch processing.

# Create the script:
hpc-login2 ~]$ vim example_job.sh
    #!/bin/bash
    #SBATCH --job-name=test            # Job name
    #SBATCH --nodes=1                    # -N Run all processes on a single node   
    #SBATCH --ntasks=1                   # -n Run a single task   
    #SBATCH --cpus-per-task=1            # -c Run 1 processor per task       
    #SBATCH --mem=1gb                    # Job memory request
    #SBATCH --time=00:05:00              # Time limit hrs:min:sec
    #SBATCH --qos=urgent                 # Queue
    #SBATCH --output=test_%j.log       # Standard output and error log
 
    echo "Hello World!"
 
hpc-login2 ~]$ sbatch example_job.sh

2. SALLOC
Used to get an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell runs by default.

# Get 5 nodes and launch a job.
hpc-login2 ~]$ salloc -N5 myprogram
# Get interactive access to a node (Press Ctrl+D to end access):
hpc-login2 ~]$ salloc -N1 
# Get EXCLUSIVE interactive access to a node
hpc-login2 ~]$ salloc -N1 --exclusive

3. SRUN
Used to launch a parallel job (it is preferable to use mpirun). It is interactive and blocking.

# Launch a hostname on 2 nodes
hpc-login2 ~]$ srun -N2 hostname
hpc-node1
hpc-node2

Using nodes with GPU

To specifically request a GPU allocation for a job, add to sbatch or srun the options:

--gres	Request for gpus by NODE	--gres=gpu[[:type]:count],...
--gpus or -G	Request for gpus by JOB	--gpus=[type]:count,...

There are also the options --gpus-per-socket,--gpus-per-node, and --gpus-per-task,
Examples:

## View the list of nodes and gpus:
hpc-login2 ~]$ view_resources
## Request 2 any GPU for a JOB, add:
--gpus=2
## Request one A100 of 40G on one node and one A100 of 80G on another, add:
--gres=gpu:A100_40:1,gpu:A100_80:1

Monitoring jobs

## List all jobs in the queue
hpc-login2 ~]$ squeue
## List jobs of a user            
hpc-login2 ~]$ squeue -u <login>
## Cancel a job:
hpc-login2 ~]$ scancel <JOBID>
## List recent jobs
hpc-login2 ~]$ sacct -b
## Detailed historical information about a job:
hpc-login2 ~]$ sacct -l -j <JOBID>
## Debug information of a job for troubleshooting:
hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
## View resource usage of a running job:
hpc-login2 ~]$ sstat <JOBID>

Controlling job output

Exit codes

By default, these are the exit codes of the commands:

SLURM command	Exit code
salloc	0 on success, 1 if the user's command could not run
srun	The highest among all executed tasks or 253 for an out-of-memory error
sbatch	0 on success; otherwise, the exit code corresponding to the failed process

STDIN, STDOUT, and STDERR

SRUN:
By default, stdout and stderr from all TASKS are redirected to the stdout and stderr of srun, and stdin is redirected from the stdin of srun to all TASKS. This can be changed with:

-i, --input=<option>

-o, --output=<option>

-e, --error=<option>

And the options are:

all: default option.
none: Redirects nothing.
taskid: Redirects only from/to the specified TASK id.
filename: Redirects everything from/to the specified file.
filename pattern: Same as filename but with a file defined by a pattern

SBATCH:
By default, “/dev/null” is open on the stdin of the script and stdout and stderror are redirected to a file named “slurm-%j.out”. This can be changed with:

-i, --input=<filename_pattern>

-o, --output=<filename_pattern>

-e, --error=<filename_pattern>

The filename_pattern reference is here .

Sending emails

JOBS can be configured to send emails under certain circumstances using these two parameters (BOTH ARE REQUIRED):

--mail-type=<type>	Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.
--mail-user=<user>	The destination email address.

Job states in the queue system

hpc-login2 ~]# squeue -l
JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
 
## Check the state of queue usage in the cluster:
hpc-login2 ~]$ queue_status.sh
JOBS PER USER:
--------------
       usuario.uno:  3
       usuario.dos:  1
 
JOBS PER QOS:
--------------
             regular:  3
                long:  1
 
JOBS PER STATE:
--------------
             RUNNING:  3
             PENDING:  1
==========================================
Total JOBS in cluster:  4

Common job states (STATE):

R RUNNING Job currently has an allocation.
CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
F FAILED Job terminated with a non-zero exit code or another failure condition.
PD PENDING Job is awaiting resource allocation.

Complete list of possible job states .

If a job is not running, a reason will appear below REASON: List of reasons why a job may be waiting for execution.

High-Performance Computing Cluster (HPC) ctcomp3