====== High Performance Computing Cluster (HPC) ctcomp3 ====== [[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the service presentation (7/3/22) ]] ===== Description ===== The cluster is composed in the computing part by: * 9 general-purpose servers. * 1 "fat node" for memory-intensive jobs. * 6 servers for GPU computing. Users only have direct access to the login node, with more limited features and which should not be used for computing. \\ All nodes are interconnected by a 10Gb network. \\ There is distributed storage accessible from all nodes with a capacity of 220 TB connected via a dual 25Gb fiber network. \\ \\ ^ Name ^ Model ^ Processor ^ Memory ^ GPU ^ | hpc-login2 | Dell R440 | 1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c) | 16 GB | - | | hpc-node[1-2] | Dell R740 | 2 x Intel Xeon Gold 5220 @2.2 GHz (18c) | 192 GB | - | | hpc-node[3-9] | Dell R740 | 2 x Intel Xeon Gold 5220R @2.2 GHz (24c) | 192 GB | - | | hpc-fat1 | Dell R840 | 4 x Xeon Gold 6248 @ 2.50GHz (20c) | 1 TB | - | | hpc-gpu[1-2] | Dell R740 | 2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c) | 192 GB | 2x Nvidia Tesla V100S 32GB | | hpc-gpu3 | Dell R7525 | 2 x AMD EPYC 7543 @2.80 GHz (32c) | 256 GB | 2x Nvidia Ampere A100 40GB | | hpc-gpu4 | Dell R7525 | 2 x AMD EPYC 7543 @2.80 GHz (32c) | 256 GB | 1x Nvidia Ampere A100 80GB | | hpc-gpu5 | Dell R7725 | 2 x AMD EPYC 9255 @3.25 GHz (24c) | 364 GB | 2x Nvidia L4 24GB | | hpc-gpu6 | Dell R7725 | 2 x AMD EPYC 9255 @3.25 GHz (24c) | 384 GB | 2x Nvidia L4 24GB | ===== Connection to the system ===== To access the cluster, it is necessary to request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users without access permission will receive a "wrong password" message. Access is made through an SSH connection to the login node (172.16.242.211): ssh @hpc-login2.inv.usc.es ===== Storage, directories and file systems ===== No backups are made of any of the cluster's file systems!! The HOME of users in the cluster is on the shared file system, so it is accessible from all nodes of the cluster. Path defined in the environment variable %%$HOME%%. \\ Each node has a local partition of 1 TB for scratch, which is deleted at the end of each job. This can be accessed using the environment variable %%$LOCAL_SCRATCH%% in scripts. \\ For data that need to be shared among groups of users, it is necessary to request the creation of a folder in the shared storage that will only be accessible by group members.\\ ^ Directory ^ Variable ^ Mount point ^ Capacity ^ | Home | %%$HOME%% | /mnt/beegfs/home/ | 220 TB* | | Local scratch | %%$LOCAL_SCRATCH%% | varies | 1 TB | | Group folder | %% $GRUPOS/%% | /mnt/beegfs/groups/ | 220 TB* | %%* storage is shared%% === IMPORTANT NOTICE === The shared file system performs poorly when working with many small files. To improve performance in such scenarios, it is necessary to create a file system in an image file and mount it to work directly on it. The procedure is as follows: * Create the image file in your home: ## truncate image.name -s SIZE_IN_BYTES truncate example.ext4 -s 20G * Create a file system in the image file: ## mkfs.ext4 -T small -m 0 image.name ## -T small optimized options for small files ## -m 0 Do not reserve space for root mkfs.ext4 -T small -m 0 example.ext4 * Mount the image (using SUDO) with the script //mount_image.py// : ## By default, it is mounted in /mnt/images// in read-only mode. sudo mount_image.py example.ext4 * To unmount the image use the script //umount_image.py// (using SUDO) sudo umount_image.py The file can only be mounted from a single node if done in read-write mode, but it can be mounted from any number of nodes in read-only mode. The mounting script has the following options: --mount-point path <-- (optional) This option creates subdirectories under /mnt/images// --rw <-- (optional) By default it is mounted read-only, with this option it is mounted read-write. The unmounting script has the following options: only accepts as an optional parameter the same path that you used for mounting with the option --mount-point <-- (optional) ===== File and data transfer ===== === SCP === From your local machine to the cluster: scp filename @hpc-login2:/ From the cluster to your local machine: scp filename @:/ [[https://man7.org/linux/man-pages/man1/scp.1.html | SCP manual page]] === SFTP === To transfer multiple files or to navigate through the file system. :~$ sftp @hpc-login2 sftp> sftp> ls sftp> cd sftp> put sftp> get sftp> quit [[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP manual page]] === RSYNC === [[ https://rsync.samba.org/documentation.html | RSYNC Documentation ]] === SSHFS === Requires the installation of the sshfs package.\\ It allows for example to mount the user's home on hpc-login2: ## Mount sshfs @ctdeskxxx.inv.usc.es:/home/ ## Unmount fusermount -u [[https://linux.die.net/man/1/sshfs | SSHFS manual page]] ===== Available Software ===== All nodes have the basic software installed by default with AlmaLinux 8.4, particularly: * GCC 8.5.0 * Python 3.6.8 * Perl 5.26.3 On the GPU nodes, in addition: * nVidia Driver 560.35.03 * CUDA 11.6 * libcudnn 8.7 To use any other software not installed on the system or another version of it, there are three options: - Use Modules with the already installed modules (or request the installation of a new module if not available) - Use a container (uDocker or Apptainer/Singularity) - Use Conda A module is the simplest solution to use software without modifications or hard-to-satisfy dependencies.\\ A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if what is sought is reproducibility, ease of distribution, and teamwork.\\ Conda is the best solution if the latest version of a library or program is needed, or packages that are not available in any other way.\\ ==== Use of modules/Lmod ==== [[ https://lmod.readthedocs.io/en/latest/010_user.html | Lmod Documentation ]] # View available modules: module avail # Load a module: module # Unload a module: module unload # View loaded modules in your environment: module list # ml can be used as an abbreviation of the module command: ml avail # To obtain information about a module: ml spider ==== Running Software Containers ==== === uDocker ==== [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker Manual]] \\ uDocker is installed as a module, so it is necessary to load it into the environment: ml udocker === Apptainer/Singularity === [[ https://apptainer.org/docs/user/1.4/ | Apptainer Documentation ]] \\ Apptainer is installed on every node's system, so nothing is needed to use it. ==== CONDA ==== [[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\ Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python, and a few necessary packages. From there, each user only downloads and installs the packages they need. # Get miniconda wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # Install it bash Miniconda3-latest-Linux-x86_64.sh # Initialize miniconda for the bash shell ~/miniconda3/bin/conda init bash ===== Use of SLURM ===== The queue manager in the cluster is [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\ The term CPU identifies a physical core of a socket. Hyperthreading is disabled, so each node has as many available CPUs as (number of sockets) * (number of physical cores per socket) it has. == Available Resources == hpc-login2 ~]# view_status.sh ============================================================================================================= NODE STATE CORES IN USE MEMORY USE GPUS(Use/Total) ============================================================================================================= hpc-fat1 up 0%[--------------------------------------------------]( 0/80) RAM: 0% --- hpc-gpu1 up 2%[||------------------------------------------------]( 1/36) RAM: 47% V100S (1/2) hpc-gpu2 up 2%[||------------------------------------------------]( 1/36) RAM: 47% V100S (1/2) hpc-gpu3 up 0%[--------------------------------------------------]( 0/64) RAM: 0% A100_40 (0/2) hpc-gpu4 up 1%[|-------------------------------------------------]( 1/64) RAM: 35% A100_80 (1/1) hpc-gpu5 up 0%[--------------------------------------------------]( 0/48) RAM: 0% L4 (0/2) hpc-gpu6 up 0%[--------------------------------------------------]( 0/48) RAM: 0% L4 (0/2) hpc-node1 up 0%[--------------------------------------------------]( 0/36) RAM: 0% --- hpc-node2 up 0%[--------------------------------------------------]( 0/36) RAM: 0% --- hpc-node3 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node4 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node5 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node6 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node7 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node8 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- hpc-node9 up 0%[--------------------------------------------------]( 0/48) RAM: 0% --- ============================================================================================================= TOTAL: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7] hpc-login2 ~]$ sinfo -e -o "%30N %20c %20m %20f %30G " --sort=N # There is an alias for this command: hpc-login2 ~]$ view_resources NODELIST CPUS MEMORY AVAIL_FEATURES GRES hpc-fat1 80 1027273 cpu_intel (null) hpc-gpu[1-2] 36 187911 cpu_intel gpu:V100S:2 hpc-gpu3 64 253282 cpu_amd gpu:A100_40:2 hpc-gpu4 64 253282 cpu_amd gpu:A100_80:1(S:0) hpc-gpu[5-6] 48 375484 cpu_amd gpu:L4:2(S:1) hpc-node[1-2] 36 187645 cpu_intel (null) hpc-node[3-9] 48 187645 cpu_intel (null) # To see the current resource usage: (CPUS (Allocated/Idle/Other/Total)) hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed # There is an alias for this command: hpc-login2 ~]$ view_usage NODELIST CPUS(A/I/O/T) MEMORY FREE_MEM GRES GRES_USED hpc-fat1 80/0/0/80 1027273 900850 (null) gpu:0,mps:0 hpc-gpu1 16/20/0/36 187911 181851 gpu:V100S:2(S:0-1) gpu:V100S:2(IDX:0-1) hpc-gpu2 4/32/0/36 187911 183657 gpu:V100S:2(S:0-1) gpu:V100S:1(IDX:0),m hpc-gpu3 2/62/0/64 253282 226026 gpu:A100_40:2 gpu:A100_40:2(IDX:0- hpc-gpu4 1/63/0/64 253282 244994 gpu:A100_80:1(S:0) gpu:A100_80:1(IDX:0) hpc-gpu5 8/40/0/48 375484 380850 gpu:L4:2(S:1) gpu:L4:1(IDX:1),mps: hpc-gpu6 0/48/0/48 375484 380969 gpu:L4:2(S:1) gpu:L4:0(IDX:N/A),mp hpc-node1 36/0/0/36 187645 121401 (null) gpu:0,mps:0 hpc-node2 36/0/0/36 187645 130012 (null) gpu:0,mps:0 hpc-node3 36/12/0/48 187645 126739 (null) gpu:0,mps:0 hpc-node4 36/12/0/48 187645 126959 (null) gpu:0,mps:0 hpc-node5 36/12/0/48 187645 128572 (null) gpu:0,mps:0 hpc-node6 36/12/0/48 187645 127699 (null) gpu:0,mps:0 hpc-node7 36/12/0/48 187645 127002 (null) gpu:0,mps:0 hpc-node8 36/12/0/48 187645 128182 (null) gpu:0,mps:0 hpc-node9 36/12/0/48 187645 127312 (null) gpu:0,mps:0 ==== Nodes ==== A node is the unit of computation of SLURM and corresponds to a physical server. # Show information about a node: hpc-login2 ~]$ scontrol show node hpc-node1 NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 CPUAlloc=0 CPUTot=36 CPULoad=0.00 AvailableFeatures=cpu_intel ActiveFeatures=cpu_intel Gres=(null) NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6 OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021 RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=defaultPartition BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48 LastBusyTime=2022-03-07T14:34:12 CfgTRES=cpu=36,mem=187645M,billing=36 AllocTRES= CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s ==== Partitions ==== Partitions in SLURM are logical groups of nodes. In the cluster, there is only one partition to which all nodes belong, so it is not necessary to specify it when submitting jobs. # Show partition information: hpc-login2 ~]$ sinfo defaultPartition* up infinite 11 idle hpc-fat1,hpc-gpu[1-6],hpc-node[1-9] # When ctgpgpu7 and 8 are added to the cluster, they will appear as nodes hpc-gpu1 and 2 respectively. ==== Jobs ==== Jobs in SLURM are resource assignments to a user for a specified time. Jobs are identified by a sequential number or JOBID. \\ A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that is run sequentially in a JOB and there is one TASK for each program that is run in parallel. Therefore, in the simplest case, such as launching a job consisting of executing the command hostname, the JOB has a single STEP and a single TASK. ==== Queue System (QOS) ==== The queue to which each job is sent defines its priority, limits, and also the relative "cost" for the user. # Show the queues hpc-login2 ~]$ sacctmgr show qos # There is an alias that shows only the most relevant information: hpc-login2 ~]$ view_queues Name Priority Flags UsageFactor MaxTRES MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU ---------- ---------- --------------- ----------- ------------------------------ ----------- ------------------------------ --------- ----------- regular 100 DenyOnLimit 1.000000 cpu=200,gres/gpu=1,node=4 4-04:00:00 cpu=200,gres/gpu=4,node=4 10 50 interacti+ 200 DenyOnLimit 1.000000 gres/gpu=1,node=1 04:00:00 gres/gpu=1,node=1 1 1 urgent 300 DenyOnLimit 2.000000 gres/gpu=1,node=1 04:00:00 cpu=36,gres/gpu=2 5 15 long 100 DenyOnLimit 1.000000 gres/gpu=1,node=4 8-04:00:00 gres/gpu=2 1 5 large 100 DenyOnLimit 1.000000 cpu=200,gres/gpu=2 4-04:00:00 gres/gpu=2 2 10 small 100 DenyOnLimit 1.000000 cpu=6,gres/gpu=0,node=2 6-00:00:00 cpu=400 400 800 short 150 DenyOnLimit 1.000000 cpu=6,gres/gpu=0,node=2 04:00:00 40 100 # Priority: is the relative priority of each queue. \\ # DenyOnLimit: the job does not run if it does not meet the limits of the queue. \\ # UsageFactor: the relative cost for the user of running a job in that queue. \\ # MaxTRES: limits for each job. \\ # MaxWall: maximum time a job can be running. \\ # MaxTRESPU: global limits per user. \\ # MaxJobsPU: Maximum number of jobs a user can have running. \\ # MaxSubmitPU: Maximum number of jobs a user can have queued and running in total.\\ ==== Submitting a Job to the Queue System ==== == Resource Specification == By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns it one node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours). This is very inefficient; ideally, at least three parameters should be specified when submitting jobs: - %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%% - %%The memory (--mem) per node or memory per CPU (--mem-per-cpu).%% - %%The estimated execution time of the job ( --time )%% Additionally, it may be interesting to add the following parameters: | -J | %%--job-name%% |Name for the job. Default: name of the executable | | -q | %%--qos%% |Name of the queue to which the job is sent. Default: regular | | -o | %%--output%% |File or file pattern to which all standard output and error is redirected. | | | %%--gres%% |Type and/or number of GPUs requested for the job. | | -C | %%--constraint%% |To specify that nodes with Intel or AMD processors (cpu_intel or cpu_amd) are desired. | | | %%--exclusive%% |To request that the job does not share nodes with other jobs. | | -w | %%--nodelist%% |List of nodes to execute the job on | == How Resources are Assigned == By default, the method of assignment between nodes is block allocation (all available cores in a node are allocated before using another). The default method of allocation within each node is cyclic allocation (CPU resources requested are evenly distributed across the available sockets in the node). == Calculating Priority == When a job is submitted to the queue system, the first thing that happens is a check to see if the requested resources fall within the limits set in the corresponding queue. If it exceeds any limit, the submission is canceled. \\ If resources are available, the job executes directly, but if not, it is queued. Each job has an assigned priority that determines the order in which jobs in the queue are executed when resources become available. To determine the priority of each job, three factors are weighted: the time spent waiting in the queue (25%), the fixed priority of the queue (25%), and the user's fairshare (50%). \\ Fairshare is a dynamic calculation that SLURM makes for each user and is the difference between the resources allocated and the resources consumed over the last 14 days. hpc-login2 ~]$ sshare -l User RawShares NormShares RawUsage NormUsage FairShare ---------- ---------- ----------- ----------- ----------- ---------- 1.000000 2872400 0.500000 1 0.500000 2872400 1.000000 0.250000 user_name 100 0.071429 4833 0.001726 0.246436 # RawShares: is the amount of resources in absolute terms assigned to the user. It is the same for all users.\\ # NormShares: Is the amount normalized to the total allocated resources.\\ # RawUsage: Is the number of seconds/cpu consumed by all the user's jobs.\\ # NormUsage: The previous amount normalized to the total seconds/cpu consumed in the cluster.\\ # FairShare: The FairShare factor between 0 and 1. The greater the use of the cluster, the closer it will be to 0, and the lower the priority.\\ == Submitting Jobs == - sbatch - salloc - srun 1. SBATCH \\ Used to submit a script to the queue system. It is batch processing and non-blocking. # Create the script: hpc-login2 ~]$ vim example_job.sh #!/bin/bash #SBATCH --job-name=test # Job name #SBATCH --nodes=1 # -N Run all processes on a single node #SBATCH --ntasks=1 # -n Run a single task #SBATCH --cpus-per-task=1 # -c Run 1 processor per task #SBATCH --mem=1gb # Job memory request #SBATCH --time=00:05:00 # Time limit hrs:min:sec #SBATCH --qos=urgent # Queue #SBATCH --output=test_%j.log # Standard output and error log echo "Hello World!" hpc-login2 ~]$ sbatch example_job.sh 2. SALLOC \\ Used to obtain an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell runs by default. # Get 5 nodes and launch a job. hpc-login2 ~]$ salloc -N5 myprogram # Get interactive access to a node (Press Ctrl+D to exit): hpc-login2 ~]$ salloc -N1 # Get exclusive interactive access to a node hpc-login2 ~]$ salloc -N1 --exclusive 3. SRUN \\ Used to launch a parallel job (it is preferable to use mpirun). It is interactive and blocking. # Launch a hostname on 2 nodes hpc-login2 ~]$ srun -N2 hostname hpc-node1 hpc-node2 ==== Use of Nodes with GPU ==== To specifically request the allocation of GPUs for a job, the options need to be added to sbatch or srun: | %%--gres%% | GPU request per NODE | %%--gres=gpu[[:type]:count],...%% | | %%--gpus or -G%% | GPU request per JOB | %%--gpus=[type]:count,...%% | There are also options %% --gpus-per-socket,--gpus-per-node and --gpus-per-task%%,\\ Examples: ## View the list of nodes and gpus: hpc-login2 ~]$ view_resources ## Request 2 any GPUs for a JOB, add: --gpus=2 ## Request an A100 of 40G on one node and an A100 of 80G on another, add: --gres=gpu:A100_40:1,gpu:A100_80:1 ==== Monitoring Jobs ==== ## List of all jobs in the queue hpc-login2 ~]$ squeue ## List of a user's jobs hpc-login2 ~]$ squeue -u ## Cancel a job: hpc-login2 ~]$ scancel ## List of recent jobs hpc-login2 ~]$ sacct -b ## Detailed historical information about a job: hpc-login2 ~]$ sacct -l -j ## Debug information about a job for troubleshooting: hpc-login2 ~]$ scontrol show jobid -dd ## View resource usage of a running job: hpc-login2 ~]$ sstat ==== Controlling Job Outputs ==== == Exit Codes == By default, these are the exit codes of the commands: ^ SLURM command ^ Exit code ^ | salloc | 0 in case of success, 1 if the user's command could not be executed | | srun | The highest among all executed tasks or 253 for an out-of-memory error | | sbatch | 0 in case of success, otherwise the corresponding exit code of the failed process | == STDIN, STDOUT and STDERR == **SRUN:**\\ By default stdout and stderr are redirected from all TASKS to the stdout and stderr of srun, and stdin is redirected from the stdin of srun to all TASKS. This can be changed with: | %%-i, --input=