Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
en:centro:servizos:hpc [2025/12/05 09:27] – edición externa 127.0.0.1en:centro:servizos:hpc [2025/12/05 10:04] (current) – [Using SLURM] fernando.guillen
Line 1: Line 1:
-====== High Performance Computing Cluster (HPC) ctcomp3 ======+====== High-Performance Computing Cluster (HPC) ctcomp3 ======
 [[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the service presentation (3/7/22) ]] [[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video of the service presentation (3/7/22) ]]
 ===== Description ===== ===== Description =====
  
-The cluster comprises the following computing components+The cluster is composed in the computing part by
-  *  9 general computing servers. +  *  9 servers for general computation
-  *  1 "fat node" for jobs requiring a lot of memory.+  *  1 "fat node" for memory-intensive tasks.
   *  6 servers for GPU computing.   *  6 servers for GPU computing.
    
-Users only have direct access to the login node, which has more limited performance and should not be used for computing. \\+Users only have direct access to the login node, which has more limited specifications and should not be used for computing. \\
 All nodes are interconnected by a 10Gb network. \\ All nodes are interconnected by a 10Gb network. \\
-There is distributed storage accessible from all nodes with 220 TB capacity connected via a dual 25Gb fiber network. \\+There is distributed storage accessible from all nodes with a capacity of 220 TB connected through a dual 25Gb fiber network. \\
 \\ \\
-^  Name                     ^  Model      ^  Processor                                      Memory  ^  GPU                         ^ +^  Name                    ^  Model      ^  Processor                                      Memory  ^  GPU                         ^ 
-|  hpc-login2               |  Dell R440    1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)  |  16 GB    |  -                           | +|  hpc-login2                |  Dell R440    1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)  |  16 GB    |  -                           | 
-|  hpc-node[1-2]            |  Dell R740    2 x Intel Xeon Gold 5220 @2.2 GHz (18c)        |  192 GB    -                           | +|  hpc-node[1-2]             |  Dell R740    2 x Intel Xeon Gold 5220 @2.2 GHz (18c)        |  192 GB    -                           | 
-|  hpc-node[3-9]            |  Dell R740    2 x Intel Xeon Gold 5220R @2.2 GHz (24c)        192 GB    -                           | +|  hpc-node[3-9]             |  Dell R740    2 x Intel Xeon Gold 5220R @2.2 GHz (24c)        192 GB    -                           | 
-|  hpc-fat1                 |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           | +|  hpc-fat1                  |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           | 
-|  hpc-gpu[1-2]             |  Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S 32GB       | +|  hpc-gpu[1-2]  |  Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S 32GB       | 
-|  hpc-gpu3                 |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  | +|  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  | 
-|  hpc-gpu4                 |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  |+|  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2.80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  | 
 +|  hpc-gpu5                  |  Dell R7725  |  2 x AMD EPYC 9255 @3.25 GHz (24c)              |  364 GB    2x Nvidia L4 24GB  | 
 +|  hpc-gpu6                  |  Dell R7725  |  2 x AMD EPYC 9255 @3.25 GHz (24c)              |  384 GB    2x Nvidia L4 24GB  | 
  
 ===== Connection to the system ===== ===== Connection to the system =====
-To access the cluster, you must request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users without access permission will receive a "wrong password" message.+To access the cluster, you must request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|issue report form]]. Users without access permission will receive a "wrong password" message.
  
-Access is made via SSH to the login node (172.16.242.211):+Access is done via an SSH connection to the login node (172.16.242.211):
 <code bash> <code bash>
 ssh <username>@hpc-login2.inv.usc.es ssh <username>@hpc-login2.inv.usc.es
Line 30: Line 33:
  
 ===== Storage, directories, and file systems ===== ===== Storage, directories, and file systems =====
-<note warning> No backup is made of any of the file systems in the cluster!!</note> +<note warning> No backup is made of any of the cluster'file systems!!</note> 
-Users' HOME in the cluster is on the shared file system, so it is accessible from all nodes in the cluster. The path is defined in the environment variable %%$HOME%%. \\ +Users' HOME on the cluster is on the shared file system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\ 
-Each node has a local scratch partition of 1 TB, which is deleted upon completion of each job. It can be accessed using the environment variable %%$LOCAL_SCRATCH%% in scripts. \\ +Each node has a local scratch partition of 1 TB, which is deleted after each job. It can be accessed using the environment variable %%$LOCAL_SCRATCH%% in the scripts. \\ 
-For data that needs to be shared among user groups, a request must be made to create a folder in the shared storage that will only be accessible by group members.\\ +For data that must be shared by groups of users, a request must be made to create a folder in shared storage that will only be accessible by group members.\\ 
-^  Directory         ^  Variable                Mount Point                 ^  Capacity  ^+^  Directory        ^  Variable                Mount point              ^  Capacity  ^
 |  Home              |  %%$HOME%%              |  /mnt/beegfs/home/<username>  |  220 TB*    | |  Home              |  %%$HOME%%              |  /mnt/beegfs/home/<username>  |  220 TB*    |
 |  Local Scratch      %%$LOCAL_SCRATCH%%      varies                        1 TB       | |  Local Scratch      %%$LOCAL_SCRATCH%%      varies                        1 TB       |
 |  Group Folder      |  %%$GRUPOS/<name>%%      /mnt/beegfs/groups/<name>    |  220 TB*    | |  Group Folder      |  %%$GRUPOS/<name>%%      /mnt/beegfs/groups/<name>    |  220 TB*    |
-%%* storage is shared%%  +%%* storage is shared%%
 === IMPORTANT NOTICE === === IMPORTANT NOTICE ===
-The shared file system performs poorly when working with many small-sized files. To improve performance in such scenarios, you should create a file system on an image file and mount it to work directly on it. The procedure is as follows:+The shared file system has poor performance when working with many small-sized files. To improve performance in such scenarios, a file system must be created in an image file and mounted to work directly on it. The procedure is as follows:
   * Create the image file in your home:   * Create the image file in your home:
 <code bash> <code bash>
Line 46: Line 49:
 truncate example.ext4 -s 20G truncate example.ext4 -s 20G
 </code> </code>
-  * Create a file system on the image file:+  *  Create a file system in the image file:
 <code bash> <code bash>
 ## mkfs.ext4 -T small -m 0 image.name ## mkfs.ext4 -T small -m 0 image.name
 ## -T small optimized options for small files ## -T small optimized options for small files
-## -m 0 Do not reserve space for root +## -m 0 No space reserved for root 
 mkfs.ext4 -T small -m 0 example.ext4 mkfs.ext4 -T small -m 0 example.ext4
 </code> </code>
   * Mount the image (using SUDO) with the script  //mount_image.py// :   * Mount the image (using SUDO) with the script  //mount_image.py// :
 <code bash> <code bash>
-## By default, it will be mounted at /mnt/images/<username>/ in read-only mode.+## By default, it is mounted in /mnt/images/<username>/ in read-only mode.
 sudo mount_image.py example.ext4 sudo mount_image.py example.ext4
 </code> </code>
Line 65: Line 68:
 The file can only be mounted from a single node if done in readwrite mode, but can be mounted from any number of nodes in readonly mode. The file can only be mounted from a single node if done in readwrite mode, but can be mounted from any number of nodes in readonly mode.
 </note> </note>
-The mount script has the following options:+The mount script has these options:
 <code> <code>
---mount-point path   <-- (optional) With this option, it creates subdirectories below /mnt/images/<username>/<path> +--mount-point path   <-- (optional) With this option creates subdirectories under /mnt/images/<username>/<path> 
---rw                  <-- (optional) By default, it is mounted readonlywith this option, it is mounted readwrite.+--rw                  <-- (optional) By default, it is mounted readonlywith this option, it is mounted readwrite.
 </code> </code>
-The unmount script has the following options: +The unmount script has these options: 
-<code>only accepts as an optional parameter the same path you used for mounting with the option +<code>only accepts as an optional parameter the same path you have used for mounting with the 
 --mount-point  <-- (optional) --mount-point  <-- (optional)
 </code> </code>
-===== File and Data Transfer =====+===== File and data transfer =====
 === SCP === === SCP ===
 From your local machine to the cluster: From your local machine to the cluster:
Line 84: Line 87:
 scp filename <username>@<hostname>:/<path> scp filename <username>@<hostname>:/<path>
 </code> </code>
-[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP Manual Page]]+[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP manual page]]
 === SFTP === === SFTP ===
-To transfer multiple files or to navigate through the file system.+To transfer multiple files or to navigate the file system.
 <code bash> <code bash>
 <hostname>:~$ sftp <user_name>@hpc-login2 <hostname>:~$ sftp <user_name>@hpc-login2
Line 96: Line 99:
 sftp> quit sftp> quit
 </code> </code>
-[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP Manual Page]]+[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP manual page]]
 === RSYNC === === RSYNC ===
 [[ https://rsync.samba.org/documentation.html | RSYNC Documentation ]] [[ https://rsync.samba.org/documentation.html | RSYNC Documentation ]]
 === SSHFS === === SSHFS ===
 Requires installation of the sshfs package.\\ Requires installation of the sshfs package.\\
-Allows, for example, to mount the user's home from hpc-login2:+Allows, for example, to mount the user's home on hpc-login2:
 <code bash> <code bash>
 ## Mount ## Mount
Line 108: Line 111:
 fusermount -u <mount_point> fusermount -u <mount_point>
 </code> </code>
-[[https://linux.die.net/man/1/sshfs | SSHFS Manual Page]]+[[https://linux.die.net/man/1/sshfs | SSHFS manual page]]
  
 ===== Available Software ===== ===== Available Software =====
-All nodes have the basic software installed by default with AlmaLinux 8.4, particularly:+All nodes have the basic software that is installed by default with AlmaLinux 8.4, particularly:
   * GCC 8.5.0   * GCC 8.5.0
   * Python 3.6.8   * Python 3.6.8
   * Perl 5.26.3   * Perl 5.26.3
-On GPU nodes, additionally: +On the nodes with GPU, additionally: 
-  * nVidia Driver 510.47.03+  * nVidia Driver 560.35.03
   * CUDA 11.6   * CUDA 11.6
   * libcudnn 8.7   * libcudnn 8.7
-To use any other software not installed in the system or another version, there are three options: +To use any other software not installed on the system or another version of it, there are three options: 
-  - Use Modules with the already installed modules (or request the installation of a new module if not available)+  - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available)
   - Use a container (uDocker or Apptainer/Singularity)   - Use a container (uDocker or Apptainer/Singularity)
   - Use Conda   - Use Conda
-A module is the simplest solution to use software without modifications or dependencies that are difficult to meet.\\ +A module is the simplest solution to use software without modifications or difficult dependencies to satisfy.\\ 
-A container is ideal when the dependencies are complicated and/or the software is highly customized. It is also the best solution if what you seek is reproducibility, ease of distribution, and teamwork.\\ +A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if the goal is reproducibility, ease of distribution, and teamwork.\\ 
-Conda is the best solution if you need the latest version of a library or program or packages not available otherwise.\\+Conda is the best solution if what is needed is the latest version of a library or program or packages not available otherwise.\\
  
  
Line 139: Line 142:
 # View loaded modules in your environment: # View loaded modules in your environment:
 module list module list
-# ml can be used as an abbreviation for the command module:+Can use ml as an abbreviation for the module command:
 ml avail ml avail
-# To obtain information about a module:+# To get information about a module:
 ml spider <module_name> ml spider <module_name>
 </code> </code>
  
-==== Running Software Containers ====+ 
 + 
 +==== Running software containers ====
 === uDocker ==== === uDocker ====
 [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker Manual]] \\ [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker Manual]] \\
Line 154: Line 159:
  
 === Apptainer/Singularity === === Apptainer/Singularity ===
-[[ https://sylabs.io/guides/3.8/user-guide/ | Apptainer/Singularity Documentation ]] \\ +[[ https://apptainer.org/docs/user/1.4/ | Apptainer Documentation ]] \\ 
-Apptainer/Singularity is installed on each node's system, so no action is required to use it.+Apptainer is installed on the system of each node, so nothing needs to be done to use it. 
  
 ==== CONDA ==== ==== CONDA ====
 [[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\ [[ https://docs.conda.io/en/latest/miniconda.html | Conda Documentation ]] \\
-Miniconda is the minimal version of Anaconda and only includes the conda environment manager, Python, and a few necessary packages. From there, each user only downloads and installs the packages they need.+Miniconda is the minimum version of Anaconda and only includes the conda environment manager, Python, and a few necessary packages. From there, each user only downloads and installs the packages they need.
 <code bash> <code bash>
 # Get miniconda # Get miniconda
Line 169: Line 175:
 </code> </code>
 ===== Using SLURM ===== ===== Using SLURM =====
-The queue manager in the cluster is [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\ +The job manager in the cluster is [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\ 
-<note tip>The term CPU identifies a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs as (number of sockets) * (number of physical cores per socket) it has.</note>+<note tip>The term CPU refers to a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).</note>
 == Available Resources == == Available Resources ==
 <code bash> <code bash>
-hpc-login2 ~]# check_status.sh+hpc-login2 ~]# ver_estado.sh
 ============================================================================================================= =============================================================================================================
-  NODE     STATE                        CORES IN USE                           MEMORY USAGE     GPUS(Usage/Total)+  NODE     STATUS                        CORES IN USE                           MEMORY USE     GPUS(Use/Total)
 ============================================================================================================= =============================================================================================================
  hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---  hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
Line 182: Line 188:
  hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)  hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
  hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)  hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
 + hpc-gpu5    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
 + hpc-gpu6    up   0%[--------------------------------------------------]( 0/48) RAM:  0%   L4 (0/2)
  hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---  hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
  hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---  hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
Line 192: Line 200:
  hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---  hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 ============================================================================================================= =============================================================================================================
-TOTAL: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]+TOTALS: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
  
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
 # There is an alias for this command: # There is an alias for this command:
-hpc-login2 ~]$ check_resources+hpc-login2 ~]$ ver_recursos
 NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES                            NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES                           
 hpc-fat1                        80                    1027273               cpu_intel             (null)                          hpc-fat1                        80                    1027273               cpu_intel             (null)                         
 hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:                    hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:                   
 hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:                  hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:                 
-hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)             +hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0) 
 +hpc-gpu[5-6]                    48                    375484                cpu_amd               gpu:L4:2(S:1)          
 hpc-node[1-2]                   36                    187645                cpu_intel             (null)                          hpc-node[1-2]                   36                    187645                cpu_intel             (null)                         
 hpc-node[3-9]                   48                    187645                cpu_intel             (null) hpc-node[3-9]                   48                    187645                cpu_intel             (null)
  
-# To view current resource usage: (CPUS (Allocated/Idle/Other/Total))+# To see current resource usage: (CPUS (Allocated/Idle/Other/Total))
 hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
 # There is an alias for this command: # There is an alias for this command:
-hpc-login2 ~]$ check_usage+hpc-login2 ~]$ ver_uso
 NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
 hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0 hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
 +hpc-gpu1            16/20/0/36          187911              181851              gpu:V100S:2(S:0-1)  gpu:V100S:2(IDX:0-1)
 +hpc-gpu2            4/32/0/36           187911              183657              gpu:V100S:2(S:0-1)  gpu:V100S:1(IDX:0),m
 hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:      gpu:A100_40:2(IDX:0- hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:      gpu:A100_40:2(IDX:0-
 hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0) hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
 +hpc-gpu5            8/40/0/48           375484              380850              gpu:L4:2(S:1)       gpu:L4:1(IDX:1),mps:
 +hpc-gpu6            0/48/0/48           375484              380969              gpu:L4:2(S:1)       gpu:L4:0(IDX:N/A),mp
 hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0 hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
 hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0 hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
Line 226: Line 239:
 A node is the computing unit of SLURM and corresponds to a physical server. A node is the computing unit of SLURM and corresponds to a physical server.
 <code bash> <code bash>
-# Show information about a node:+# Show node information:
 hpc-login2 ~]$ scontrol show node hpc-node1 hpc-login2 ~]$ scontrol show node hpc-node1
 NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18  NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 
Line 247: Line 260:
 </code> </code>
 ==== Partitions ==== ==== Partitions ====
-Partitions in SLURM are logical groups of nodes. There is only one partition in the cluster to which all nodes belong, so it is not necessary to specify it when submitting jobs.+Partitions in SLURM are logical groups of nodes. In the cluster, there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs.
 <code bash> <code bash>
-# Show information about the partitions:+# Show partition information:
 hpc-login2 ~]$ sinfo hpc-login2 ~]$ sinfo
 defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9] defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9]
-# When ctgpgpu7 and 8 are added to the cluster, they will appear as nodes hpc-gpu1 and 2 respectively.+# When ctgpgpu7 and 8 are incorporated into the cluster, they will appear as nodes hpc-gpu1 and 2 respectively.
 </code> </code>
 ==== Jobs ==== ==== Jobs ====
-Jobs in SLURM are resource assignments to a user during a specified time. Jobs are identified by a sequential number or JOBID. \\ +Jobs in SLURM are resources allocations to a user for a specified time. Jobs are identified by a sequential number or JOBID. \\ 
-A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that runs sequentially in a JOB and one TASK for each program that runs in parallel. Therefore, in the simplest case, such as launching a job consisting of executing the hostname command, the JOB has a single STEP and a single TASK.+A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that is executed sequentially in a JOB and there is one TASK for every program that is executed in parallel. Therefore, in the simplest case, such as launching a job consisting of executing the command hostname, the JOB has a single STEP and a single TASK.
  
 ==== Queue System (QOS) ==== ==== Queue System (QOS) ====
 The queue to which each job is sent defines the priority, limits, and also the "relative cost" for the user. The queue to which each job is sent defines the priority, limits, and also the "relative cost" for the user.
 <code bash> <code bash>
-# Show the queues+# Show queues
 hpc-login2 ~]$ sacctmgr show qos hpc-login2 ~]$ sacctmgr show qos
 # There is an alias that shows only the most relevant information: # There is an alias that shows only the most relevant information:
-hpc-login2 ~]$ check_queues+hpc-login2 ~]$ show_queues
       Name   Priority                        MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU        Name   Priority                        MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU 
 ---------- ---------- ------------------------------ ----------- -------------------- --------- -----------  ---------- ---------- ------------------------------ ----------- -------------------- --------- ----------- 
Line 277: Line 290:
 </code> </code>
 # Priority: is the relative priority of each queue. \\ # Priority: is the relative priority of each queue. \\
-DenyonLimit: the job does not run if it does not meet the queue limits\\ +DenyLimit: the job does not execute if it does not meet the limits of the queue \\ 
-# UsageFactor: the relative cost for the user of running a job in that queue\\ +# UsageFactor: the relative cost for the user of running a job in that queue \\ 
-# MaxTRES: limits per job\\ +# MaxTRES: resource limits per job \\ 
-# MaxWall: maximum time job can run\\ +# MaxWall: maximum time that the job can run \\ 
-# MaxTRESPU: global limits per user\\ +# MaxTRESPU: global limits per user \\ 
-# MaxJobsPU: Maximum number of jobs a user can have running. \\ +# MaxJobsPU: Maximum number of jobs that a user can have running. \\ 
-# MaxSubmitPU: Maximum number of jobs a user can have queued and running at any time.\\+# MaxSubmitPU: Maximum number of jobs that a user can have queued and running in total.\\
    
 ==== Submitting a job to the queue system ==== ==== Submitting a job to the queue system ====
 == Resource Specification == == Resource Specification ==
-By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns it a node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours).  +By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns a node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours).  
-This is very inefficient; ideally, you should specify at least three parameters when submitting jobs:+This is very inefficient; ideally, at least three parameters should be specified when submitting jobs:
   -  %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%%   -  %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%%
-  -  %%The memory (--mem) per node or the memory per CPU (--mem-per-cpu).%%+  -  %%The memory (--mem) per node or memory per CPU (--mem-per-cpu).%%
   -  %%The estimated execution time of the job (--time)%%   -  %%The estimated execution time of the job (--time)%%
  
Line 296: Line 309:
 |  -J    %%--job-name%%  |Name for the job. Default: name of the executable  | |  -J    %%--job-name%%  |Name for the job. Default: name of the executable  |
 |  -q    %%--qos%%       |Name of the queue to which the job is sent. Default: regular  | |  -q    %%--qos%%       |Name of the queue to which the job is sent. Default: regular  |
-|  -o    %%--output%%    |File or file pattern where all standard and error output is redirected.  |+|  -o    %%--output%%    |File or file pattern to which all standard and error output is redirected.  |
 |        %%--gres%%      |Type and/or number of GPUs requested for the job.  | |        %%--gres%%      |Type and/or number of GPUs requested for the job.  |
-|  -C    %%--constraint%%  |To specify you want nodes with Intel or AMD processors (cpu_intel or cpu_amd)  |+|  -C    %%--constraint%%  |To specify that nodes with Intel or AMD processors (cpu_intel or cpu_amd) are wanted  |
 |    |  %%--exclusive%%  |To request that the job does not share nodes with other jobs.  | |    |  %%--exclusive%%  |To request that the job does not share nodes with other jobs.  |
-|  -w  |  %%--nodelist%%   |List of nodes on which to run the job  |+|  -w  |  %%--nodelist%%   |List of nodes on which to execute the job  |
  
-== How resources are allocated == +== How resources are assigned == 
-By default, the resource allocation method among nodes is block allocation (all available cores in a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are evenly distributed among the available sockets on the node). +By default, the allocation method between nodes is block allocation (all available cores in a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are evenly distributed among the available sockets in the node). 
  
 == Calculating priority == == Calculating priority ==
-When a job is submitted to the queue system, the first thing that happens is that it checks if the requested resources fall within the limits set in the corresponding queue. If it exceeds any limit, the submission is canceled. \\ +When a job is sent to the queue system, the first thing that happens is that it checks whether the requested resources fall within the limits set in the corresponding queue. If it exceeds any of them, the submission is canceled. \\ 
-If resources are available, the job runs immediately, but if not, it is queued. Each job has an assigned priority that determines the order in which jobs in the queue are executed when resources become available. The priority of each job is weighted by 3 factors: the time spent waiting in the queue (25%), the fixed priority of the queue (25%), and the user's fair share (50%). \\ +If resources are available, the job executes directly, but if not, it gets queued. Each job has an assigned priority that determines the order in which jobs are executed in the queue when resources become available. To determine the priority of each job, three factors are weighted: the time it has been waiting in the queue (25%), the fixed priority of the queue (25%), and the user's fair share (50%). \\ 
-The fair share is a dynamic calculation made by SLURM for each user and is the difference between the assigned resources and the resources consumed over the last 14 days. +The fair share is a dynamic calculation that SLURM makes for each user and is the difference between the resources allocated and the resources consumed over the last 14 days. 
 <code bash> <code bash>
 hpc-login2 ~]$ sshare -l  hpc-login2 ~]$ sshare -l 
Line 317: Line 330:
 user_name         100    0.071429        4833    0.001726    0.246436 user_name         100    0.071429        4833    0.001726    0.246436
 </code> </code>
-# RawShares: is the amount of resources in absolute terms assigned to the user. It is the same for all users.\\ +# RawShares: is the quantity of resources in absolute terms allocated to the user. It is the same for all users.\\ 
-# NormShares: Is the previous amount normalized to the total assigned resources.\\ +# NormShares: Is the previous amount normalized to the total allocated resources.\\ 
-# RawUsage: Is the amount of seconds/cpu consumed by all the user's jobs.\\ +# RawUsage: Is the number of seconds/cpu consumed by all the user's jobs.\\ 
-# NormUsage: Previous amount normalized to the total of seconds/cpu consumed in the cluster.\\ +# NormUsage: Quantity previously normalized to the total seconds/cpu consumed in the cluster.\\ 
-# FairShare: The FairShare factor between 0 and 1. The more the cluster is used, the closer it will be to 0 and the lower the priority.\\+# FairShare: The FairShare factor between 0 and 1. The more the cluster is used, the closer it will approach 0 and the lower the priority.\\
  
 == Submitting jobs == == Submitting jobs ==
Line 349: Line 362:
 </code> </code>
 2. SALLOC \\ 2. SALLOC \\
-Used to immediately obtain a resource assignment (nodes). As soon as it is obtained, the specified command or a shell is executed by default+Used to obtain an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell will run instead
 <code bash> <code bash>
-Get 5 nodes and launch a job.+Obtain 5 nodes and launch a job.
 hpc-login2 ~]$ salloc -N5 myprogram hpc-login2 ~]$ salloc -N5 myprogram
-Get interactive access to a node (Press Ctrl+D to end access):+Obtain interactive access to a node (Press Ctrl+D to end access):
 hpc-login2 ~]$ salloc -N1  hpc-login2 ~]$ salloc -N1 
-Get exclusive interactive access to a node+Obtain exclusive interactive access to a node
 hpc-login2 ~]$ salloc -N1 --exclusive hpc-login2 ~]$ salloc -N1 --exclusive
 </code> </code>
 3. SRUN \\ 3. SRUN \\
-Used to launch a parallel job (it is preferable to using mpirun). It is interactive and blocking.+Used to launch a parallel job (it is preferable to use mpirun). It is interactive and blocking.
 <code bash> <code bash>
 # Launch a hostname on 2 nodes # Launch a hostname on 2 nodes
Line 369: Line 382:
  
 ==== Using nodes with GPU ==== ==== Using nodes with GPU ====
-To specifically request a GPU allocation for a job, you must add to sbatch or srun the options: +To specifically request an allocation of GPUs for a job, you must add to sbatch or srun the options: 
 |  %%--gres%%  |  Request for GPUs by NODE  |  %%--gres=gpu[[:type]:count],...%%  | |  %%--gres%%  |  Request for GPUs by NODE  |  %%--gres=gpu[[:type]:count],...%%  |
 |  %%--gpus or -G%%  |  Request for GPUs by JOB  |  %%--gpus=[type]:count,...%%  | |  %%--gpus or -G%%  |  Request for GPUs by JOB  |  %%--gpus=[type]:count,...%%  |
-There are also options %% --gpus-per-socket, --gpus-per-nodeand --gpus-per-task%%,\\+There are also the options %% --gpus-per-socket,--gpus-per-node and --gpus-per-task%%,\\
 Examples: Examples:
 <code bash> <code bash>
 ## View the list of nodes and GPUs: ## View the list of nodes and GPUs:
-hpc-login2 ~]$ check_resources +hpc-login2 ~]$ show_resources 
-## Request 2 arbitrary GPUs for a JOB, add:+## Request 2 any GPUs for a JOB, add:
 --gpus=2 --gpus=2
 ## Request one A100 of 40G on one node and one A100 of 80G on another, add: ## Request one A100 of 40G on one node and one A100 of 80G on another, add:
Line 384: Line 397:
  
  
-==== Monitoring Jobs ====+==== Monitoring jobs ====
 <code bash> <code bash>
-## List all jobs in the queue+## Listing all jobs in the queue
 hpc-login2 ~]$ squeue hpc-login2 ~]$ squeue
-## List the jobs of a user            +## Listing jobs of a user            
 hpc-login2 ~]$ squeue -u <login> hpc-login2 ~]$ squeue -u <login>
 ## Cancel a job: ## Cancel a job:
Line 394: Line 407:
 ## List recent jobs ## List recent jobs
 hpc-login2 ~]$ sacct -b hpc-login2 ~]$ sacct -b
-## Detailed historical information about a job:+## Detailed historical information of a job:
 hpc-login2 ~]$ sacct -l -j <JOBID> hpc-login2 ~]$ sacct -l -j <JOBID>
-## Debug information for a job for troubleshooting:+## Debug information of a job for troubleshooting:
 hpc-login2 ~]$ scontrol show jobid -dd <JOBID> hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
 ## View resource usage of a running job: ## View resource usage of a running job:
Line 402: Line 415:
  
 </code> </code>
-==== Controlling Job Output ==== +==== Controlling job output ==== 
-== Exit Codes ==+== Exit codes ==
 By default, these are the exit codes of the commands: By default, these are the exit codes of the commands:
 ^  SLURM command  ^  Exit code  ^ ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 on success, 1 if the user's command could not be executed +|  salloc  |  0 in case of success, 1 if the user's command could not be executed 
-|  srun  |  The highest among all executed tasks or 253 for an out-of-memory error  | +|  srun  |  The highest among all tasks executed or 253 for an out-of-memory error  | 
-|  sbatch  |  0 on successotherwise the corresponding exit code from the failed process  |+|  sbatch  |  0 in case of successotherwisethe exit code corresponding to the failed process  |
  
 == STDIN, STDOUT, and STDERR == == STDIN, STDOUT, and STDERR ==
Line 418: Line 431:
 And the options are: And the options are:
   * //all//: default option.   * //all//: default option.
-  * //none//: Nothing is redirected.+  * //none//: Does not redirect anything.
   * //taskid//: Only redirects from/to the specified TASK id.   * //taskid//: Only redirects from/to the specified TASK id.
   * //filename//: Redirects everything from/to the specified file.   * //filename//: Redirects everything from/to the specified file.
-  * //filename pattern//: Same as filename but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]]+  * //filename pattern//: Similar to filename but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]]
  
 **SBATCH:**\\ **SBATCH:**\\
-By default "/dev/null" is open in the script'stdin, and stdout and stderr are redirected to a file named "slurm-%j.out". This can be changed with:+By default"/dev/null" is open in the stdin of the script, and stdout and stderr are redirected to a file named "slurm-%j.out". This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  | |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  | |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  | |  %%-e, --error=<filename_pattern>%%  |
-The filename_pattern reference is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].+The reference to filename_pattern can be found [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
  
-==== Sending Emails ====+==== Sending emails ====
 Jobs can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**): Jobs can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**):
 |  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  | |  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
Line 437: Line 450:
  
  
-==== Job States in the Queue System ====+==== Job statuses in the queue system ====
 <code bash> <code bash>
 hpc-login2 ~]# squeue -l hpc-login2 ~]# squeue -l
Line 443: Line 456:
 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
  
-## View the usage state of the cluster's queues+## Check queue usage status of the cluster: 
-hpc-login2 ~]$ check_queues_status.sh+hpc-login2 ~]$ queue_status.sh
 JOBS PER USER: JOBS PER USER:
 -------------- --------------
Line 468: Line 481:
   * PD PENDING Job is awaiting resource allocation.   * PD PENDING Job is awaiting resource allocation.
    
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job states ]].\\+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Complete list of possible job states ]].\\
  
-If a job is not running, a reason will appear below REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | List of reasons ]] a job may be waiting to execute.+If a job is not running, there will be a reason listed under REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | List of reasons ]] why a job may be waiting for execution.