Differences

This shows you the differences between two versions of the page.

--- en:centro:servizos:hpc [2022/07/01 12:46] – fernando.guillen
+++ en:centro:servizos:hpc [2024/10/08 09:55] (current) – [CONDA] jorge.suarez
@@ Line 18: / Line 18: @@
 |  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2,2 GHz (24c)       |  192 GB   |  -                           |
 |  hpc-fat1                  |  Dell R840   |  4 x Xeon Gold 6248 @ 2.50GHz (20c)             |  1 TB     |  -                           |
-|  <del>hpc-gpu1</del>*  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu[1-2]  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
-|  hpc-gpu2  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
 |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
 |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
-* Now ctgpgpu8. It will be integrated in the cluster soon.
-===== Accessing the system =====
+===== Accessing the cluster =====
 To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
-The access is done through an SSH connection to the login node:
+The access is done through an SSH connection to the login node (172.16.242.211):
 <code bash>
 ssh <nombre_de_usuario>@hpc-login2.inv.usc.es
@@ Line 114: / Line 113: @@
   * Python 3.6.8
   * Perl 5.26.3
+GPU nodes, in addition:
+  * nVidia Driver 510.47.03
+  * CUDA 11.6
+  * libcudnn 8.7
 To use any other software not installed on the system or another version of the system, there are three options:
   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
@@ Line 143: / Line 145: @@
 === uDocker ====
 [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
-uDocker is installed as a module, so it needs to be loaded into the environment:
+udocker is installed as a module, so it needs to be loaded into the environment:
 <code bash>
 ml uDocker
@@ Line 158: / Line 160: @@
 <code bash>
 # Getting miniconda
-wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 # Install
-sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
+bash Miniconda3-latest-Linux-x86_64.sh
+#  Initialize for bash shell
+~/miniconda3/bin/conda init bash
 </code>
@@ Line 168: / Line 172: @@
 == Available resources ==
 <code bash>
+hpc-login2 ~]# ver_estado.sh
+=============================================================================================================
+  NODO     ESTADO                        CORES EN USO                           USO MEM     GPUS(Uso/Total)
+=============================================================================================================
+ hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
+ hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
+ hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
+ hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+=============================================================================================================
+TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
 # There is an alias for that command:
@@ Line 238: / Line 262: @@
 # There is an alias that shows only the relevant info:
 hpc-login2 ~]$ ver_colas
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU
+      Name    Priority                                  MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------
+----------  ---------- ---------------------------------------- ----------- -------------------- --------- -----------
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50
+   regular         100                cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1
+interactive        200                                   node=1    04:00:00               node=1         1           1
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15
+    urgent         300                        gres/gpu=1,node=1    04:00:00               cpu=36         5          15
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00
+      long         100                        gres/gpu=1,node=4  8-04:00:00                              1           5
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25
+     large         100                       cpu=200,gres/gpu=2  4-04:00:00                              2          10
-     admin        500                    0.000000
+     admin         500
+     small         150                             cpu=6,node=2    04:00:00              cpu=400        40         100
 </code>
 # Priority: is the relative priority of each queue. \\
@@ Line 258: / Line 283: @@
 ==== Sending a job to the queue system ====
 == Requesting resources ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours).
+By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours).
 This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
@@ Line 295: / Line 320: @@
 == Job submission ==
+  - sbatch
   - salloc
   - srun
-  - sbatch
-. SALLOC \\
+. SBATCH \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
-<code bash>
-# Get 5 nodes and launch a job.
-hpc-login2 ~]$ salloc -N5 myprogram
-# Get interactive access to a node (Press Ctrl+D to exit):
-hpc-login2 ~]$ salloc -N1
-</code>
-. SRUN \\
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
-<code bash>
-# Launch the hostname command on 2 nodes
-hpc-login2 ~]$ srun -N2 hostname
-hpc-node1
-hpc-node2
-</code>
-. SBATCH \\
 Used to send a script to the queuing system. It is batch-processing and non-blocking.
 <code bash>
@@ Line 334: / Line 343: @@
 hpc-login2 ~]$ sbatch test_job.sh
 </code>
+. SALLOC \\
+It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
+<code bash>
+# Get 5 nodes and launch a job.
+hpc-login2 ~]$ salloc -N5 myprogram
+# Get interactive access to a node (Press Ctrl+D to exit):
+hpc-login2 ~]$ salloc -N1
+# Get interactive EXCLUSIVE access to a node
+hpc-login2 ~]$ salloc -N1 --exclusive
+</code>
+. SRUN \\
+It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
+<code bash>
+# Launch the hostname command on 2 nodes
+hpc-login2 ~]$ srun -N2 hostname
+hpc-node1
+hpc-node2
+</code>
 ==== GPU use ====
@@ Line 372: / Line 400: @@
 By default these are the output codes of the commands:
 ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario  |
+|  salloc  |  0 success, 1 if the user's command cannot be executed  |
-|  srun  |  El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem  |
+|  srun  |  The highest among all executed tasks or 253 for an out-of-mem error.  |
-|  sbatch  |  0 en caso de éxito, si no, el código de salida correspondiente del proceso que falló  |
+|  sbatch  |  0 success, if not, the corresponding exit code of the failed process  |
 == STDIN, STDOUT y STDERR ==
 **SRUN:**\\
-Por defecto stdout y stderr se redirigen de todos los TASKS a el stdout y stderr de srun, y stdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con:
+By default stdout and stderr are redirected from all TASKS to srun's stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with:
-|  %%-i, --input=<opcion>%%    |
+|  %%-i, --input=<option>%%    |
-|  %%-o, --output=<opcion>%%   |
+|  %%-o, --output=<option>%%   |
-|  %%-e, --error=<opcion>%%   |
+|  %%-e, --error=<option>%%   |
-Y las opciones son:
+And options are:
-  * //all//: opción por defecto.
+  * //all//: by default.
-  * //none//: No se redirecciona nada.
+  * //none//: Nothing is redirected.
-  * //taskid//: Solo se redirecciona desde y/o al TASK id especificado.
+  * //taskid//: Redirects only to and/or from the specified TASK id.
-  * //filename//: Se redirecciona todo desde y/o al fichero especificado.
+  * //filename//: Redirects everything to and/or from the specified file.
-  * //filename pattern//: Igual que filename pero con un fichero definido por un [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | patrón ]]
+  * //filename pattern//: Same as the filename option but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]].
 **SBATCH:**\\
-Por defecto "/dev/null" está abierto en el stdin del script y stdout y stderror se redirigen a un fichero de nombre "slurm-%j.out". Esto se puede cambiar con:
+By default "/dev/null" is open in the script's stdin and stdout and stderror are redirected to a file named "slurm-%j.out". This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  |
-La referencia de filename_pattern está [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | aquí ]].
+The reference of filename_pattern is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
-==== Envío de correos ====
+==== Sending mail ====
-Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (**SON NECESARIOS AMBOS**):
+JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**):
-|  %%--mail-type=<type>%%  |  Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
+|  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
-|  %%--mail-user=<user>%%  |  La dirección de correo de destino.  |
+|  %%--mail-user=<user>%%  |  The destination mailing address.  |
-==== Estados de los trabajos en el sistema de colas ====
+==== Status of Jobs in the queuing system ====
 <code bash>
 hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
   defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
+## Check status of queue use:
+hpc-login2 ~]$ estado_colas.sh
+JOBS PER USER:
+--------------
+       usuario.uno:  3
+       usuario.dos:  1
+JOBS PER QOS:
+--------------
+             regular:  3
+                long:  1
+JOBS PER STATE:
+--------------
+             RUNNING:  3
+             PENDING:  1
+==========================================
+Total JOBS in cluster:  4
 </code>
-Estados (STATE) más comunes de un trabajo:
+Common job states:
   * R RUNNING Job currently has an allocation.
   * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
@@ Line 415: / Line 462: @@
   * PD PENDING Job is awaiting resource allocation.
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Lista completa de posibles estados de un trabajo ]].\\
+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job statuses ]].\\
-Si un trabajo no está en ejecución aparecerá una razón debajo de REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | Lista de las razones ]] por las que un trabajo puede estar esperando su ejecución.
+If a job is not running, a reason will be displayed underneath REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | reason list ]] for which a job may be awaiting execution.