Differences

This shows you the differences between two versions of the page.

--- en:centro:servizos:hpc [2022/07/01 12:56] – [Status of work in the queuing system] fernando.guillen
+++ en:centro:servizos:hpc [2025/12/05 10:04] (current) – [Using SLURM] fernando.guillen
@@ Line 1: / Line 1: @@
-====== High Performance Computing (HPC) cluster ctcomp3  ======
+====== High-Performance Computing Cluster (HPC) ctcomp3 ======
-[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the presentation of the service (7/3/22) (Spanish only) ]]
+[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the service presentation (3/7/22) ]]
 ===== Description =====
-The computing part of the cluster is made up of:
+The cluster is composed in the computing part by:
-  * 9 servers for general computing.
+  *  9 servers for general computation.
-  * 1 "fat node" for memory-intensive jobs.
+  *  1 "fat node" for memory-intensive tasks.
-  * 4 servers for GPU computing.
+  *  6 servers for GPU computing.
-Users only have direct access to the login node, which has more limited features and should not be used for computing. \\
+Users only have direct access to the login node, which has more limited specifications and should not be used for computing. \\
 All nodes are interconnected by a 10Gb network. \\
-There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network. \\
+There is distributed storage accessible from all nodes with a capacity of 220 TB connected through a dual 25Gb fiber network. \\
 \\
 ^  Name                    ^  Model      ^  Processor                                     ^  Memory  ^  GPU                         ^
 |  hpc-login2                |  Dell R440   |  1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)  |  16 GB    |  -                           |
-|  hpc-node[1-2]             |  Dell R740   |  2 x Intel Xeon Gold 5220 @2,2 GHz (18c)        |  192 GB   |  -                           |
+|  hpc-node[1-2]             |  Dell R740   |  2 x Intel Xeon Gold 5220 @2.2 GHz (18c)        |  192 GB   |  -                           |
-|  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2,2 GHz (24c)       |  192 GB   |  -                           |
+|  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2.2 GHz (24c)       |  192 GB   |  -                           |
 |  hpc-fat1                  |  Dell R840   |  4 x Xeon Gold 6248 @ 2.50GHz (20c)             |  1 TB     |  -                           |
-|  <del>hpc-gpu1</del>*  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu[1-2]  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S 32GB       |
-|  hpc-gpu2  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
-|  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
+|  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
-|  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
+|  hpc-gpu5                  |  Dell R7725  |  2 x AMD EPYC 9255 @3.25 GHz (24c)              |  364 GB   |  2x Nvidia L4 24GB  |
-* Now ctgpgpu8. It will be integrated in the cluster soon.
+|  hpc-gpu6                  |  Dell R7725  |  2 x AMD EPYC 9255 @3.25 GHz (24c)              |  384 GB   |  2x Nvidia L4 24GB  |
-===== Accessing the system =====
-To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
-The access is done through an SSH connection to the login node:
+===== Connection to the system =====
+To access the cluster, you must request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|issue report form]]. Users without access permission will receive a "wrong password" message.
+Access is done via an SSH connection to the login node (172.16.242.211):
 <code bash>
-ssh <nombre_de_usuario>@hpc-login2.inv.usc.es
+ssh <username>@hpc-login2.inv.usc.es
 </code>
-=====  Storage, directories and filesystems  =====
+===== Storage, directories, and file systems =====
-<note warning> None of the file systems in the cluster are backed up!!!</note>
+<note warning> No backup is made of any of the cluster's file systems!!</note>
-The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\
+Users' HOME on the cluster is on the shared file system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\
-Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the %%$LOCAL_SCRATCH%% environment variable in the scripts. \\
+Each node has a local scratch partition of 1 TB, which is deleted after each job. It can be accessed using the environment variable %%$LOCAL_SCRATCH%% in the scripts. \\
-For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.\\
+For data that must be shared by groups of users, a request must be made to create a folder in shared storage that will only be accessible by group members.\\
-^  Directory        ^  Variable               ^  Mount point             ^  Capacity  ^
+^  Directory        ^  Variable               ^  Mount point              ^  Capacity  ^
 |  Home              |  %%$HOME%%              |  /mnt/beegfs/home/<username>  |  220 TB*    |
-|  local Scratch      |  %%$LOCAL_SCRATCH%%     |  varía                        |  1 TB       |
+|  Local Scratch     |  %%$LOCAL_SCRATCH%%     |  varies                       |  1 TB       |
-|  Group folder  |  %% $GRUPOS/<nombre>%%  |  /mnt/beegfs/groups/<nombre>  |  220 TB*    |
+|  Group Folder      |  %%$GRUPOS/<name>%%     |  /mnt/beegfs/groups/<name>    |  220 TB*    |
-%%* storage is shared %%
+%%* storage is shared%%
-=== WARNING ===
+=== IMPORTANT NOTICE ===
-The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows:
+The shared file system has poor performance when working with many small-sized files. To improve performance in such scenarios, a file system must be created in an image file and mounted to work directly on it. The procedure is as follows:
-  * Create the image file at your home folder:
+  * Create the image file in your home:
 <code bash>
 ## truncate image.name -s SIZE_IN_BYTES
 truncate example.ext4 -s 20G
 </code>
-  *  Create a filesystem in the image file:
+  *  Create a file system in the image file:
 <code bash>
 ## mkfs.ext4 -T small -m 0 image.name
 ## -T small optimized options for small files
-## -m 0 Do not reserve capacity for root user
+## -m 0 No space reserved for root
 mkfs.ext4 -T small -m 0 example.ext4
 </code>
   * Mount the image (using SUDO) with the script  //mount_image.py// :
 <code bash>
-## By default it is mounted at /mnt/imagenes/<username>/ in read-only mode.
+## By default, it is mounted in /mnt/images/<username>/ in read-only mode.
 sudo mount_image.py example.ext4
 </code>
-  * To unmount the image use the script //umount_image.py// (using SUDO)
+  * To unmount the image, use the script //umount_image.py// (using SUDO)
+<code bash>
-The mount script has this options:
+sudo umount_image.py
+</code>
+<note warning>
+The file can only be mounted from a single node if done in readwrite mode, but can be mounted from any number of nodes in readonly mode.
+</note>
+The mount script has these options:
 <code>
---mount-point path   <-- (optional) This option creates subdirectories under /mnt/imagenes/<username>/<path>
+--mount-point path   <-- (optional) With this option creates subdirectories under /mnt/images/<username>/<path>
---rw                  <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.
+--rw                  <-- (optional) By default, it is mounted readonly, with this option, it is mounted readwrite.
 </code>
-<note warning> Do not mount the image file readwrite from more than one node!!!</note>
+The unmount script has these options:
+<code>only accepts as an optional parameter the same path you have used for mounting with the
-The unmounting script has this options:
-<code>only supports as an optional parameter the same path you have used when mounting with the option
 --mount-point  <-- (optional)
 </code>
-=====  Transference of files and data  =====
+===== File and data transfer =====
 === SCP ===
 From your local machine to the cluster:
@@ Line 83: / Line 87: @@
 scp filename <username>@<hostname>:/<path>
 </code>
-[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP man page]]
+[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP manual page]]
 === SFTP ===
-To transfer several files or to navigate through the filesystem.
+To transfer multiple files or to navigate the file system.
 <code bash>
 <hostname>:~$ sftp <user_name>@hpc-login2
@@ Line 95: / Line 99: @@
 sftp> quit
 </code>
-[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP man page]]
+[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP manual page]]
 === RSYNC ===
-[[ https://rsync.samba.org/documentation.html | RSYNC documentation ]]
+[[ https://rsync.samba.org/documentation.html | RSYNC Documentation ]]
 === SSHFS ===
-Requires local installation of the sshfs package.\\
+Requires installation of the sshfs package.\\
-Allows for example to mount the user's local home in hpc-login2:
+Allows, for example, to mount the user's home on hpc-login2:
 <code bash>
 ## Mount
@@ Line 107: / Line 111: @@
 fusermount -u <mount_point>
 </code>
-[[https://linux.die.net/man/1/sshfs | SSHFS man page]]
+[[https://linux.die.net/man/1/sshfs | SSHFS manual page]]
 ===== Available Software =====
-All nodes have the basic software that is installed by default in AlmaLinux 8.4, in particular:
+All nodes have the basic software that is installed by default with AlmaLinux 8.4, particularly:
   * GCC 8.5.0
   * Python 3.6.8
   * Perl 5.26.3
+On the nodes with GPU, additionally:
-To use any other software not installed on the system or another version of the system, there are three options:
+  * nVidia Driver 560.35.03
-  - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
+  * CUDA 11.6
+  * libcudnn 8.7
+To use any other software not installed on the system or another version of it, there are three options:
+  - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available)
   - Use a container (uDocker or Apptainer/Singularity)
   - Use Conda
-A module is the simplest solution for using software without modifications or difficult to satisfy dependencies.\\
+A module is the simplest solution to use software without modifications or difficult dependencies to satisfy.\\
-A container is ideal when dependencies are complicated and/or the software is highly customised. It is also the best solution if you are looking for reproducibility, ease of distribution and teamwork.\\
+A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if the goal is reproducibility, ease of distribution, and teamwork.\\
-Conda is the best solution if you need the latest version of a library or program or packages not otherwise available.\\
+Conda is the best solution if what is needed is the latest version of a library or program or packages not available otherwise.\\
-==== Modules/Lmod use====
-[[ https://lmod.readthedocs.io/en/latest/010_user.html | Lmod documentation]]
+==== Using modules/Lmod ====
+[[ https://lmod.readthedocs.io/en/latest/010_user.html | Lmod Documentation ]]
 <code bash>
-# See available modules:
+# View available modules:
 module avail
-# Module load:
+# Load a module:
 module <module_name>
 # Unload a module:
 module unload <module_name>
-# List modules loaded in your environment:
+# View loaded modules in your environment:
 module list
-# ml can be used as a shorthand of the module command:
+# Can use ml as an abbreviation for the module command:
 ml avail
-# To get info of a module:
+# To get information about a module:
 ml spider <module_name>
 </code>
-==== Software containers execution ====
+==== Running software containers ====
 === uDocker ====
-[[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
+[[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker Manual]] \\
-uDocker is installed as a module, so it needs to be loaded into the environment:
+uDocker is installed as a module, so it is necessary to load it into the environment:
 <code bash>
-ml uDocker
+ml udocker
 </code>
 === Apptainer/Singularity ===
-[[ https://sylabs.io/guides/3.8/user-guide/ | Apptainer/Singularity documentation]] \\
+[[ https://apptainer.org/docs/user/1.4/ | Apptainer Documentation ]] \\
-Apptainer/Singularity is installed on each node's system, so you don't need to do anything to use it.
+Apptainer is installed on the system of each node, so nothing needs to be done to use it.
 ==== CONDA ====
 [[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\
-Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python and a few necessary packages. From there on, each user only downloads and installs the packages they need.
+Miniconda is the minimum version of Anaconda and only includes the conda environment manager, Python, and a few necessary packages. From there, each user only downloads and installs the packages they need.
 <code bash>
-# Getting miniconda
+# Get miniconda
-wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-# Install
+# Install it
-sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
+bash Miniconda3-latest-Linux-x86_64.sh
+# Initialize miniconda for the bash shell
+~/miniconda3/bin/conda init bash
 </code>
 ===== Using SLURM =====
-The cluster queue manager is[[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
+The job manager in the cluster is [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
-<note tip>The term CPU identifies a physical core in a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket) it has.</note>
+<note tip>The term CPU refers to a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).</note>
-== Available resources ==
+== Available Resources ==
 <code bash>
+hpc-login2 ~]# ver_estado.sh
+=============================================================================================================
+  NODE     STATUS                        CORES IN USE                           MEMORY USE     GPUS(Use/Total)
+=============================================================================================================
+ hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
+ hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
+ hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
+ hpc-gpu5    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
+ hpc-gpu6    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
+ hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+=============================================================================================================
+TOTALS: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
-# There is an alias for that command:
+# There is an alias for this command:
 hpc-login2 ~]$ ver_recursos
 NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES
@@ Line 175: / Line 209: @@
 hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:2
 hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:2
 hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)
+hpc-gpu[5-6]                    48                    375484                cpu_amd               gpu:L4:2(S:1)
 hpc-node[1-2]                   36                    187645                cpu_intel             (null)
 hpc-node[3-9]                   48                    187645                cpu_intel             (null)
-# To see current resource use: (CPUS (Allocated/Idle/Other/Total))
+# To see current resource usage: (CPUS (Allocated/Idle/Other/Total))
 hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
-# There is an alias for that command:
+# There is an alias for this command:
 hpc-login2 ~]$ ver_uso
 NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
 hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
+hpc-gpu1            16/20/0/36          187911              181851              gpu:V100S:2(S:0-1)  gpu:V100S:2(IDX:0-1)
+hpc-gpu2            4/32/0/36           187911              183657              gpu:V100S:2(S:0-1)  gpu:V100S:1(IDX:0),m
 hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:2       gpu:A100_40:2(IDX:0-
 hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
+hpc-gpu5            8/40/0/48           375484              380850              gpu:L4:2(S:1)       gpu:L4:1(IDX:1),mps:
+hpc-gpu6            0/48/0/48           375484              380969              gpu:L4:2(S:1)       gpu:L4:0(IDX:N/A),mp
 hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
 hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
@@ Line 198: / Line 237: @@
 </code>
 ==== Nodes ====
-A node is SLURM's computation unit and corresponds to a physical server.
+A node is the computing unit of SLURM and corresponds to a physical server.
 <code bash>
-# Show node info:
+# Show node information:
 hpc-login2 ~]$ scontrol show node hpc-node1
 NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18
@@ Line 221: / Line 260: @@
 </code>
 ==== Partitions ====
-Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.
+Partitions in SLURM are logical groups of nodes. In the cluster, there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.
 <code bash>
-# Show partition info:
+# Show partition information:
 hpc-login2 ~]$ sinfo
-defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[1-4],hpc-node[1-9]
+defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9]
+# When ctgpgpu7 and 8 are incorporated into the cluster, they will appear as nodes hpc-gpu1 and 2 respectively.
 </code>
 ==== Jobs ====
-Jobs in SLURM are resource allocations to a user for a given time. Jobs are identified by a sequential number or JOBID. \\
+Jobs in SLURM are resources allocations to a user for a specified time. Jobs are identified by a sequential number or JOBID. \\
-A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes sequentially in a JOB and there is one TASK for each program that executes in parallel. Therefore in the simplest case such as launching a job consisting of executing the hostname command the JOB has a single STEP and a single TASK.
+A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that is executed sequentially in a JOB and there is one TASK for every program that is executed in parallel. Therefore, in the simplest case, such as launching a job consisting of executing the command hostname, the JOB has a single STEP and a single TASK.
-==== Queue system (QOS) ====
+==== Queue System (QOS) ====
-The queue to which each job is submitted defines the priority, the limits and also the relative "cost" to the user.
+The queue to which each job is sent defines the priority, limits, and also the "relative cost" for the user.
 <code bash>
 # Show queues
 hpc-login2 ~]$ sacctmgr show qos
-# There is an alias that shows only the relevant info:
+# There is an alias that shows only the most relevant information:
-hpc-login2 ~]$ ver_colas
+hpc-login2 ~]$ show_queues
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU
+      Name   Priority                        MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------
+---------- ---------- ------------------------------ ----------- -------------------- --------- -----------
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50
+   regular        100      cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1
+interacti+        200                         node=1    04:00:00               node=1         1           1
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15
+    urgent        300              gres/gpu=1,node=1    04:00:00               cpu=36         5          15
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00
+      long        100              gres/gpu=1,node=4  8-04:00:00                              1           5
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25
+     large        100             cpu=200,gres/gpu=2  4-04:00:00                              2          10
-     admin        500                    0.000000
+     admin        500
+     small        100        cpu=6,gres/gpu=0,node=2  6-00:00:00              cpu=400        400         800
+     short        150                   cpu=6,node=2    04:00:00                              40         100
 </code>
 # Priority: is the relative priority of each queue. \\
-# DenyonLimit: job will not be executed if it doesn't comply with the queue limits \\
+# DenyLimit: the job does not execute if it does not meet the limits of the queue \\
-# UsageFactor: relive cost for the user to execute jobs on that queue \\
+# UsageFactor: the relative cost for the user of running a job in that queue \\
-# MaxTRES: limnits applied to each job \\
+# MaxTRES: resource limits per job \\
-# MaxWall: maximum time the job can run \\
+# MaxWall: maximum time that the job can run \\
 # MaxTRESPU: global limits per user \\
-# MaxJobsPU: Maximum number of jobs a user can have running simultaneously. \\
+# MaxJobsPU: Maximum number of jobs that a user can have running. \\
-# MaxSubmitPU: Maximum number of jobs that a user can have in total both queued and running.\\
+# MaxSubmitPU: Maximum number of jobs that a user can have queued and running in total.\\
-==== Sending a job to the queue system ====
+==== Submitting a job to the queue system ====
-== Requesting resources ==
+== Resource Specification ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours).
+By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns a node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours).
-This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
+This is very inefficient; ideally, at least three parameters should be specified when submitting jobs:
-  -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
+  -  %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%%
-  -  %%Memory (--mem) per node or memory per cpu (--mem-per-cpu).%%
+  -  %%The memory (--mem) per node or memory per CPU (--mem-per-cpu).%%
-  -  %%Job execution time ( --time )%%
+  -  %%The estimated execution time of the job (--time)%%
-In addition, it may be interesting to add the following parameters:
+Additionally, it may be interesting to add the following parameters:
-|  -J   |  %%--job-name%%  |Job name. Default: executable name  |
+|  -J   |  %%--job-name%%  |Name for the job. Default: name of the executable  |
 |  -q   |  %%--qos%%       |Name of the queue to which the job is sent. Default: regular  |
 |  -o   |  %%--output%%    |File or file pattern to which all standard and error output is redirected.  |
-|       |  %%--gres%%      |Type and/or number of GPUs requested for the job.   |
+|       |  %%--gres%%      |Type and/or number of GPUs requested for the job.  |
-|  -C   |  %%--constraint%%  |Para especificar que se quieren nodos con procesadores Intel o AMD (cpu_intel o cpu_amd)  |
+|  -C   |  %%--constraint%%  |To specify that nodes with Intel or AMD processors (cpu_intel or cpu_amd) are wanted  |
-|    |  %%--exclusive%%  |To specify that you want nodes with Intel or AMD processors (cpu_intel or cpu_amd)  |
+|    |  %%--exclusive%%  |To request that the job does not share nodes with other jobs.  |
-|  -w  |  %%--nodelist%%   |List of nodes to run the job on  |
+|  -w  |  %%--nodelist%%   |List of nodes on which to execute the job  |
-== How resources are allocated ==
+== How resources are assigned ==
-The default allocation method between nodes is block allocation (all available cores on a node are allocated before using another node). The default allocation method within each node is cyclic allocation (the required cores are distributed equally among the available sockets in the node).
+By default, the allocation method between nodes is block allocation (all available cores in a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are evenly distributed among the available sockets in the node).
-== Priority calculation ==
+== Calculating priority ==
-When a job is submitted to the queuing system, the first thing that happens is that the requested resources are checked to see if they fall within the limits set in the corresponding queue. If it exceeds any of them, the submission is cancelled. \\
+When a job is sent to the queue system, the first thing that happens is that it checks whether the requested resources fall within the limits set in the corresponding queue. If it exceeds any of them, the submission is canceled. \\
-If resources are available, the job is executed directly, but if not, it is queued. Each job is assigned a priority that determines the order in which the jobs in the queue are executed when resources are available. To determine the priority of each job, 3 factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%) and the user's fairshare (50%). \\
+If resources are available, the job executes directly, but if not, it gets queued. Each job has an assigned priority that determines the order in which jobs are executed in the queue when resources become available. To determine the priority of each job, three factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%), and the user's fair share (50%). \\
-The fairshare is a dynamic calculation made by SLURM for each user and is the difference between the resources allocated and the resources consumed over the last 14 days.
+The fair share is a dynamic calculation that SLURM makes for each user and is the difference between the resources allocated and the resources consumed over the last 14 days.
 <code bash>
 hpc-login2 ~]$ sshare -l
@@ Line 288: / Line 330: @@
 user_name         100    0.071429        4833    0.001726    0.246436
 </code>
-# RawShares: Is the amount of resources allocated to the user in absolute terms . It is the same for all users.\\
+# RawShares: is the quantity of resources in absolute terms allocated to the user. It is the same for all users.\\
-# NormShares: This is the above amount normalised to the total allocated resources.\\
+# NormShares: Is the previous amount normalized to the total allocated resources.\\
-# RawUsage: The number of seconds/cpu consumed by all user jobs.\\
+# RawUsage: Is the number of seconds/cpu consumed by all the user's jobs.\\
-# NormUsage: RawUsage normalised to total seconds/cpu consumed in the cluster.\\
+# NormUsage: Quantity previously normalized to the total seconds/cpu consumed in the cluster.\\
-# FairShare: The FairShare factor between 0 and 1. The higher the cluster usage, the closer to 0 and the lower the priority.\\
+# FairShare: The FairShare factor between 0 and 1. The more the cluster is used, the closer it will approach 0 and the lower the priority.\\
-== Job submission ==
+== Submitting jobs ==
+  - sbatch
   - salloc
   - srun
-  - sbatch
-. SALLOC \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
+. SBATCH \\
-<code bash>
+Used to submit a script to the queue system. It is batch processing and non-blocking.
-# Get 5 nodes and launch a job.
-hpc-login2 ~]$ salloc -N5 myprogram
-# Get interactive access to a node (Press Ctrl+D to exit):
-hpc-login2 ~]$ salloc -N1
-</code>
-. SRUN \\
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
 <code bash>
-# Launch the hostname command on 2 nodes
+# Create the script:
-hpc-login2 ~]$ srun -N2 hostname
+hpc-login2 ~]$ vim example_job.sh
-hpc-node1
-hpc-node2
-</code>
-. SBATCH \\
-Used to send a script to the queuing system. It is batch-processing and non-blocking.
-<code bash>
-# Crear el script:
-hpc-login2 ~]$ vim test_job.sh
     #!/bin/bash
-    #SBATCH --job-name=test              # Job name
+    #SBATCH --job-name=test            # Job name
     #SBATCH --nodes=1                    # -N Run all processes on a single node
     #SBATCH --ntasks=1                   # -n Run a single task
@@ Line 328: / Line 355: @@
     #SBATCH --time=00:05:00              # Time limit hrs:min:sec
     #SBATCH --qos=urgent                 # Queue
-    #SBATCH --output=test%j.log          # Standard output and error log
+    #SBATCH --output=test_%j.log       # Standard output and error log
     echo "Hello World!"
-hpc-login2 ~]$ sbatch test_job.sh
+hpc-login2 ~]$ sbatch example_job.sh
+</code>
+. SALLOC \\
+Used to obtain an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell will run instead.
+<code bash>
+# Obtain 5 nodes and launch a job.
+hpc-login2 ~]$ salloc -N5 myprogram
+# Obtain interactive access to a node (Press Ctrl+D to end access):
+hpc-login2 ~]$ salloc -N1
+# Obtain exclusive interactive access to a node
+hpc-login2 ~]$ salloc -N1 --exclusive
+</code>
+. SRUN \\
+Used to launch a parallel job (it is preferable to use mpirun). It is interactive and blocking.
+<code bash>
+# Launch a hostname on 2 nodes
+hpc-login2 ~]$ srun -N2 hostname
+hpc-node1
+hpc-node2
 </code>
-==== GPU use ====
-To specifically request a GPU allocation for a job, options must be added to sbatch or srun:
+==== Using nodes with GPU ====
-|  %%--gres%%  |  Request gpus per NODE  |  %%--gres=gpu[[:type]:count],...%%  |
+To specifically request an allocation of GPUs for a job, you must add to sbatch or srun the options:
-|  %%--gpus o -G%%  |  Request gpus per JOB  |  %%--gpus=[type]:count,...%%  |
+|  %%--gres%%  |  Request for GPUs by NODE  |  %%--gres=gpu[[:type]:count],...%%  |
-There are also the options %% --gpus-per-socket,--gpus-per-node y --gpus-per-task%%,\\
+|  %%--gpus or -G%%  |  Request for GPUs by JOB  |  %%--gpus=[type]:count,...%%  |
-Ejemplos:
+There are also the options %% --gpus-per-socket,--gpus-per-node and --gpus-per-task%%,\\
+Examples:
 <code bash>
-## See the list of nodes and gpus:
+## View the list of nodes and GPUs:
-hpc-login2 ~]$ ver_recursos
+hpc-login2 ~]$ show_resources
-## Request any 2 GPUs for a JOB, add:
+## Request 2 any GPUs for a JOB, add:
 --gpus=2
-## Request a 40G A100 at one node and an 80G A100 at another node, add:
+## Request one A100 of 40G on one node and one A100 of 80G on another, add:
 --gres=gpu:A100_40:1,gpu:A100_80:1
 </code>
-==== Job monitoring ====
+==== Monitoring jobs ====
 <code bash>
-## List all jobs in the queue
+## Listing all jobs in the queue
 hpc-login2 ~]$ squeue
-## Listing a user's jobs
+## Listing jobs of a user
 hpc-login2 ~]$ squeue -u <login>
 ## Cancel a job:
 hpc-login2 ~]$ scancel <JOBID>
-## List of recent jobs:
+## List recent jobs
 hpc-login2 ~]$ sacct -b
-## Detailed historical information for a job:
+## Detailed historical information of a job:
 hpc-login2 ~]$ sacct -l -j <JOBID>
 ## Debug information of a job for troubleshooting:
 hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
-## View the resource usage of a running job:
+## View resource usage of a running job:
 hpc-login2 ~]$ sstat <JOBID>
 </code>
-==== Configure job output ====
+==== Controlling job output ====
 == Exit codes ==
-By default these are the output codes of the commands:
+By default, these are the exit codes of the commands:
 ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 success, 1 if the user's command cannot be executed  |
+|  salloc  |  0 in case of success, 1 if the user's command could not be executed  |
-|  srun  |  The highest among all executed tasks or 253 for an out-of-mem error.  |
+|  srun  |  The highest among all tasks executed or 253 for an out-of-memory error  |
-|  sbatch  |  0 success, if not, the corresponding exit code of the failed process  |
+|  sbatch  |  0 in case of success; otherwise, the exit code corresponding to the failed process  |
-== STDIN, STDOUT y STDERR ==
+== STDIN, STDOUT, and STDERR ==
 **SRUN:**\\
-By default stdout and stderr are redirected from all TASKS to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with:
+By default, stdout and stderr are redirected from all TASKS to the stdout and stderr of srun, and stdin is redirected from the stdin of srun to all TASKS. This can be changed with:
 |  %%-i, --input=<option>%%    |
 |  %%-o, --output=<option>%%   |
 |  %%-e, --error=<option>%%   |
-And options are:
+And the options are:
-  * //all//: by default.
+  * //all//: default option.
-  * //none//: Nothing is redirected.
+  * //none//: Does not redirect anything.
-  * //taskid//: Redirects only to and/or from the specified TASK id.
+  * //taskid//: Only redirects from/to the specified TASK id.
-  * //filename//: Redirects everything to and/or from the specified file.
+  * //filename//: Redirects everything from/to the specified file.
-  * //filename pattern//: Same as the filename option but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]].
+  * //filename pattern//: Similar to filename but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]]
 **SBATCH:**\\
-By default "/dev/null" is open in the script's stdin and stdout and stderror are redirected to a file named "slurm-%j.out". This can be changed with:
+By default, "/dev/null" is open in the stdin of the script, and stdout and stderr are redirected to a file named "slurm-%j.out". This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  |
-The reference of filename_pattern is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
+The reference to filename_pattern can be found [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
-==== Sending mail ====
+==== Sending emails ====
-JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**):
+Jobs can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**):
 |  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
-|  %%--mail-user=<user>%%  |  The destination mailing address.  |
+|  %%--mail-user=<user>%%  |  The destination email address.  |
-==== Status of Jobs in the queuing system ====
+==== Job statuses in the queue system ====
 <code bash>
 hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
   defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
+## Check queue usage status of the cluster:
+hpc-login2 ~]$ queue_status.sh
+JOBS PER USER:
+--------------
+       user.one:  3
+       user.two:  1
+JOBS PER QOS:
+--------------
+             regular:  3
+                long:  1
+JOBS PER STATE:
+--------------
+             RUNNING:  3
+             PENDING:  1
+==========================================
+Total JOBS in cluster:  4
 </code>
-Common job states:
+Most common states (STATE) of a job:
   * R RUNNING Job currently has an allocation.
   * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
@@ Line 415: / Line 481: @@
   * PD PENDING Job is awaiting resource allocation.
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job statuses ]].\\
+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Complete list of possible job states ]].\\
-If a job is not running, a reason will be displayed underneath REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | reason list ]] for which a job may be awaiting execution.
+If a job is not running, there will be a reason listed under REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | List of reasons ]] why a job may be waiting for execution.