Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:centro:servizos:hpc [2022/07/01 12:46] fernando.guillenen:centro:servizos:hpc [2024/03/12 11:12] – [CONDA] fernando.guillen
Line 18: Line 18:
 |  hpc-node[3-9]              Dell R740    2 x Intel Xeon Gold 5220R @2,2 GHz (24c)        192 GB    -                           | |  hpc-node[3-9]              Dell R740    2 x Intel Xeon Gold 5220R @2,2 GHz (24c)        192 GB    -                           |
 |  hpc-fat1                  |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           | |  hpc-fat1                  |  Dell R840    4 x Xeon Gold 6248 @ 2.50GHz (20c)              1 TB      -                           |
-|  <del>hpc-gpu1</del> |  Dell R740   |  x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       | +|  hpc-gpu[1-2  Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       |
-|  hpc-gpu2   Dell R740    2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)    192 GB    2x Nvidia Tesla V100S       |+
 |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  | |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    2x Nvidia Ampere A100 40GB  |
 |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  | |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB    1x Nvidia Ampere A100 80GB  |
-* Now ctgpgpu8. It will be integrated in the cluster soon. + 
-===== Accessing the system =====+===== Accessing the cluster =====
 To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message. To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
  
Line 114: Line 113:
   * Python 3.6.8   * Python 3.6.8
   * Perl 5.26.3   * Perl 5.26.3
 +GPU nodes, in addition: 
 +  * nVidia Driver 510.47.03 
 +  * CUDA 11.6 
 +  * libcudnn 8.7
 To use any other software not installed on the system or another version of the system, there are three options: To use any other software not installed on the system or another version of the system, there are three options:
   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
Line 143: Line 145:
 === uDocker ==== === uDocker ====
 [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\ [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
-uDocker is installed as a module, so it needs to be loaded into the environment:+udocker is installed as a module, so it needs to be loaded into the environment:
 <code bash> <code bash>
 ml uDocker ml uDocker
Line 161: Line 163:
 # Install  # Install 
 sh Miniconda3-py39_4.11.0-Linux-x86_64.sh sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
 +#  Initialize for bash shell
 +~/miniconda3/bin/conda init bash
 </code> </code>
  
Line 168: Line 172:
 == Available resources == == Available resources ==
 <code bash> <code bash>
 +hpc-login2 ~]# ver_estado.sh
 +=============================================================================================================
 +  NODO     ESTADO                        CORES EN USO                           USO MEM     GPUS(Uso/Total)
 +=============================================================================================================
 + hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
 + hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 + hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
 + hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
 + hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
 + hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 + hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
 + hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 + hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
 +=============================================================================================================
 +TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
 # There is an alias for that command: # There is an alias for that command:
Line 238: Line 262:
 # There is an alias that shows only the relevant info: # There is an alias that shows only the relevant info:
 hpc-login2 ~]$ ver_colas hpc-login2 ~]$ ver_colas
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU  +      Name    Priority                                  MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU  
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------  +----------  ---------- ---------------------------------------- ----------- -------------------- --------- -----------  
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50  +   regular         100                cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50  
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1                   1  +interactive        200                                   node=1    04:00:00               node=1                   1  
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36                  15  +    urgent         300                        gres/gpu=1,node=1    04:00:00               cpu=36                  15  
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00                                      +      long         100                        gres/gpu=1,node=4  8-04:00:00                                         
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25  +     large         100                       cpu=200,gres/gpu=2  4-04:00:00                                       10  
-     admin        500                    0.000000 +     admin         500                                                                                                  
 +     small         150                             cpu=6,node=2    04:00:00              cpu=400        40         100 
 </code> </code>
 # Priority: is the relative priority of each queue. \\ # Priority: is the relative priority of each queue. \\
Line 258: Line 283:
 ==== Sending a job to the queue system ==== ==== Sending a job to the queue system ====
 == Requesting resources == == Requesting resources ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours). +By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours). 
 This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs: This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
Line 295: Line 320:
  
 == Job submission == == Job submission ==
 +  - sbatch
   - salloc   - salloc
   - srun   - srun
-  - sbatch 
  
-1. SALLOC \\ +1. SBATCH \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.  +
-<code bash> +
-# Get 5 nodes and launch a job. +
-hpc-login2 ~]$ salloc -N5 myprogram +
-# Get interactive access to a node (Press Ctrl+D to exit): +
-hpc-login2 ~]$ salloc -N1  +
-</code> +
-2. SRUN \\ +
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking. +
-<code bash> +
-# Launch the hostname command on 2 nodes +
-hpc-login2 ~]$ srun -N2 hostname +
-hpc-node1 +
-hpc-node2 +
-</code> +
-3. SBATCH \\+
 Used to send a script to the queuing system. It is batch-processing and non-blocking. Used to send a script to the queuing system. It is batch-processing and non-blocking.
 <code bash> <code bash>
Line 334: Line 343:
 hpc-login2 ~]$ sbatch test_job.sh  hpc-login2 ~]$ sbatch test_job.sh 
 </code> </code>
 +2. SALLOC \\
 +It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. 
 +<code bash>
 +# Get 5 nodes and launch a job.
 +hpc-login2 ~]$ salloc -N5 myprogram
 +# Get interactive access to a node (Press Ctrl+D to exit):
 +hpc-login2 ~]$ salloc -N1 
 +</code>
 +3. SRUN \\
 +It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
 +<code bash>
 +# Launch the hostname command on 2 nodes
 +hpc-login2 ~]$ srun -N2 hostname
 +hpc-node1
 +hpc-node2
 +</code>
 +
  
 ==== GPU use ==== ==== GPU use ====
Line 372: Line 398:
 By default these are the output codes of the commands: By default these are the output codes of the commands:
 ^  SLURM command  ^  Exit code  ^ ^  SLURM command  ^  Exit code  ^
-|  salloc  |  0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario  | +|  salloc  |  0 success, 1 if the user's command cannot be executed  | 
-|  srun  |  El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem +|  srun  |  The highest among all executed tasks or 253 for an out-of-mem error.  | 
-|  sbatch  |  0 en caso de éxitosi noel código de salida correspondiente del proceso que falló  |+|  sbatch  |  0 successif notthe corresponding exit code of the failed process  |
  
 == STDIN, STDOUT y STDERR == == STDIN, STDOUT y STDERR ==
 **SRUN:**\\ **SRUN:**\\
-Por defecto stdout stderr se redirigen de todos los TASKS a el stdout stderr de srunstdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con+By default stdout and stderr are redirected from all TASKS to srun'stdout and stderr, and stdin is redirected from srun's stdin to all TASKS. This can be changed with
-|  %%-i, --input=<opcion>%%    |  +|  %%-i, --input=<option>%%    |  
-|  %%-o, --output=<opcion>%%   | +|  %%-o, --output=<option>%%   | 
-|  %%-e, --error=<opcion>%%   | +|  %%-e, --error=<option>%%   | 
-Y las opciones son+And options are
-  * //all//: opción por defecto+  * //all//: by default
-  * //none//: No se redirecciona nada+  * //none//: Nothing is redirected
-  * //taskid//: Solo se redirecciona desde y/o al TASK id especificado+  * //taskid//: Redirects only to and/or from the specified TASK id. 
-  * //filename//: Se redirecciona todo desde y/o al fichero especificado+  * //filename//: Redirects everything to and/or from the specified file
-  * //filename pattern//: Igual que filename pero con un fichero definido por un [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | patrón ]]+  * //filename pattern//: Same as the filename option but with a file defined by a [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | pattern ]].
  
 **SBATCH:**\\ **SBATCH:**\\
-Por defecto "/dev/null" está abierto en el stdin del script stdout stderror se redirigen un fichero de nombre "slurm-%j.out"Esto se puede cambiar con:+By default "/dev/null" is open in the script's stdin and stdout and stderror are redirected to file named "slurm-%j.out"This can be changed with:
 |  %%-i, --input=<filename_pattern>%%  | |  %%-i, --input=<filename_pattern>%%  |
 |  %%-o, --output=<filename_pattern>%%  | |  %%-o, --output=<filename_pattern>%%  |
 |  %%-e, --error=<filename_pattern>%%  | |  %%-e, --error=<filename_pattern>%%  |
-La referencia de filename_pattern está [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | aquí ]].+The reference of filename_pattern is [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | here ]].
  
-==== Envío de correos ==== +==== Sending mail ==== 
-Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (**SON NECESARIOS AMBOS**): +JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**): 
-|  %%--mail-type=<type>%%  |  Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50. +|  %%--mail-type=<type>%%  |  Options: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50. 
-|  %%--mail-user=<user>%%  |  La dirección de correo de destino.  |+|  %%--mail-user=<user>%%  |  The destination mailing address.  |
  
  
  
-==== Estados de los trabajos en el sistema de colas ====+==== Status of Jobs in the queuing system ====
 <code bash> <code bash>
 hpc-login2 ~]# squeue -l hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON) JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
 +
 +## Check status of queue use:
 +hpc-login2 ~]$ estado_colas.sh
 +JOBS PER USER:
 +--------------
 +       usuario.uno:  3
 +       usuario.dos:  1
 +
 +JOBS PER QOS:
 +--------------
 +             regular:  3
 +                long:  1
 +
 +JOBS PER STATE:
 +--------------
 +             RUNNING:  3
 +             PENDING:  1
 +==========================================
 +Total JOBS in cluster:  4
 </code> </code>
-Estados (STATE) más comunes de un trabajo:+Common job states:
   * R RUNNING Job currently has an allocation.   * R RUNNING Job currently has an allocation.
   * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.    * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. 
Line 415: Line 460:
   * PD PENDING Job is awaiting resource allocation.   * PD PENDING Job is awaiting resource allocation.
    
-[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Lista completa de posibles estados de un trabajo ]].\\+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Full list of possible job statuses ]].\\
  
-Si un trabajo no está en ejecución aparecerá una razón debajo de REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | Lista de las razones ]] por las que un trabajo puede estar esperando su ejecución.+If a job is not running, a reason will be displayed underneath REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | reason list ]] for which a job may be awaiting execution.