Differences

This shows you the differences between two versions of the page.

--- en:centro:servizos:hpc [2016/05/24 09:23] – [Quick usage instructions] fernando.guillen
+++ en:centro:servizos:hpc [2022/06/30 11:46] – fernando.guillen
@@ Line 1: / Line 1: @@
-====== High Performance Computing (HPC) ======
+====== High Performance Computing (HPC) cluster ctcomp3  ======
+[[ https://web.microsoftstream.com/video/f5eba154-b597-4440-9307-3befd7597d78 | Video de la presentación del servicio (7/3/22) (Spanish only) ]]
+===== Description =====
-===== Quick usage instructions =====
+The computing part of the cluster is made up of:
-----------------
+  * 9 servers for general computing.
-A summary of the steps necessary to get a job done:
+  * 1 "fat node" for memory-intensive jobs.
+  * 4 servers for GPU computing.
+Users only have direct access to the login node, which has more limited features and should not be used for computing. \\
+All nodes are interconnected by a 10Gb network. \\
+There is distributed storage accessible from all nodes with 220 TB of capacity connected by a dual 25Gb fibre network. \\
-  - [[ en:centro:servizos:hpc:acceso_al_cluster | Log into the cluster and copy the necessary files.]]
+\\
-  - [[ en:centro:servizos:hpc:escribir_script | Prepare the job for submission to the queue manager.]]
+^  Name                    ^  Model      ^  Processor                                     ^  Memory  ^  GPU                         ^
-  - [[ en:centro:servizos:hpc:envio_trabajo | Submit and manage the job in the queue manager.]]
+|  hpc-login2                |  Dell R440   |  1 x Intel Xeon Silver 4208 CPU @ 2.10GHz (8c)  |  16 GB    |  -                           |
+|  hpc-node[1-2]             |  Dell R740   |  2 x Intel Xeon Gold 5220 @2,2 GHz (18c)        |  192 GB   |  -                           |
+|  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2,2 GHz (24c)       |  192 GB   |  -                           |
+|  hpc-fat1                  |  Dell R840   |  4 x Xeon Gold 6248 @ 2.50GHz (20c)             |  1 TB     |  -                           |
+|  <del>hpc-gpu1</del>*  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu2  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
+|  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
+* Now ctgpgpu8. It will be integrated in the cluster soon.
+===== Accessing the system =====
+To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
+The access is done through an SSH connection to the login node:
+<code bash>
+ssh <nombre_de_usuario>@hpc-login2.inv.usc.es
+</code>
+=====  Storage, directories and filesystems  =====
+<note warning> None of the file systems in the cluster are backed up!!!</note>
+The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\
+Each node has a local 1TB scratch partition, which is deleted at the end of each job. It can be accessed through the %%$LOCAL_SCRATCH%% environment variable in the scripts. \\
+For data to be shared by groups of users, you must request the creation of a folder in the shared storage that will only be accessible by members of the group.\\
+^  Directory        ^  Variable               ^  Mount point             ^  Capacity  ^
+|  Home              |  %%$HOME%%              |  /mnt/beegfs/home/<username>  |  220 TB*    |
+|  local Scratch      |  %%$LOCAL_SCRATCH%%     |  varía                        |  1 TB       |
+|  Group folder  |  %% $GRUPOS/<nombre>%%  |  /mnt/beegfs/groups/<nombre>  |  220 TB*    |
+%%* storage is shared %%
+=== WARNING ===
+The file share system performs poorly when working with many small files. To improve performance in such scenarios, create a file system in an image file and mount it to work directly on it. The procedure is as follows:
+  * Create the image file at your home folder:
+<code bash>
+## truncate image.name -s SIZE_IN_BYTES
+truncate example.ext4 -s 20G
+</code>
+  *  Create a filesystem in the image file:
+<code bash>
+## mkfs.ext4 -T small -m 0 image.name
+## -T small optimized options for small files
+## -m 0 Do not reserve capacity for root user
+mkfs.ext4 -T small -m 0 example.ext4
+</code>
+  * Mount the image (using SUDO) with the script  //mount_image.py// :
+<code bash>
+## By default it is mounted at /mnt/imagenes/<username>/ in read-only mode.
+sudo mount_image.py example.ext4
+</code>
+  * To unmount the image use the script //umount_image.py// (using SUDO)
-===== Introduction =====
+The mount script has this options:
--------------
+<code>
-High Performance Computing (HPC from now on) infrastructures offer CITIUS researchers a platform to resolve problems with high computational requirements. A computational cluster is an set of nodes interconnected by a dedicated network that can act as a single computational element. This offers a huge computational power (allowing the execution of a big parallel job or several concurrent small executions) in a shared infrastructure.
+--mount-point path   <-- (optional) This option creates subdirectories under /mnt/imagenes/<username>/<path>
+--rw                  <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite.
+</code>
+<note warning> Do not mount the image file readwrite from more than one node!!!</note>
-A queue management system is a program that plans how and when jobs will execute using the available computational resources.  Allows for an efficient use of computational resources in systems with multiple users. In the our cluster we use PBS/TORQUE.
+The unmounting script has this options:
+<code>only supports as an optional parameter the same path you have used when mounting with the option
+--mount-point  <-- (optional)
+</code>
+=====  Transference of files and data  =====
+=== SCP ===
+From your local machine to the cluster:
+<code bash>
+scp filename <username>@hpc-login2:/<path>
+</code>
+From the cluster to your local machine:
+<code bash>
+scp filename <username>@<hostname>:/<path>
+</code>
+[[https://man7.org/linux/man-pages/man1/scp.1.html | SCP man page]]
+=== SFTP ===
+To transfer several files or to navigate through the filesystem.
+<code bash>
+<hostname>:~$ sftp <user_name>@hpc-login2
+sftp>
+sftp> ls
+sftp> cd <path>
+sftp> put <file>
+sftp> get <file>
+sftp> quit
+</code>
+[[https://www.unix.com/man-page/redhat/1/sftp/ | SFTP man page]]
+=== RSYNC ===
+[[ https://rsync.samba.org/documentation.html | RSYNC documentation ]]
+=== SSHFS ===
+Requires local installation of the sshfs package.\\
+Allows for example to mount the user's local home in hpc-login2:
+<code bash>
+## Mount
+sshfs  <username>@ctdeskxxx.inv.usc.es:/home/<username> <mount_point>
+## Unmount
+fusermount -u <mount_point>
+</code>
+[[https://linux.die.net/man/1/sshfs | SSHFS man page]]
-The way these systems work is:
+===== Available Software =====
-        - The user requests some resources to the queue manager for a computational task. This task is a set of instructions written in a script.
+Todos los nodos tienen el software básico que se instala por defecto con AlmaLinux 8.4, particularmente:
-        - The queue manager assigns the request to one of its queues.
+  * GCC 8.5.0
-        - When the requested resources are available and depending on the priorities established by the system, the queue manager executes the task and stores the output.
+  * Python 3.6.8
+  * Perl 5.26.3
-It is important to note that the request and the execution of a given task are independent actions that are not resolved atomically. In fact it is usual that the execution of the task has to wait in one of the queues until the requested resources are available. Also, interactive use is impossible.
+Para usar cualquier otro software no instalado en el sistema u otra versión del mismo hay tres opciones:
+  - Usar Modules con los módulos que ya están instalados (o solicitar la instalación de un nuevo módulo si no está disponible)
+  - Usar un contenedor (uDocker o Apptainer/Singularity)
+  - Usar Conda
+Un módulo es la solución más sencilla para usar software sin modificaciones o dependencias difíciles de satisfacer.\\
+Un contenedor es ideal cuando las dependencias son complicadas y/o el software está muy personalizado. También es la mejor solución si lo que se busca es reproducibilidad, facilidad para su distribución y trabajo en equipo.\\
+Conda es la mejor solución si lo que se necesita es la última versión de una librería o programa o paquetes no disponibles de otra forma.\\
-==== Hardware description ====
-Ctcomp2 is a heterogeneous cluster, composed of 8 HP Proliant BL685c G7, 5 Dell PowerEdge M910 and 5 Dell PowerEdge M620 nodes.
+==== Uso de modules/Lmod ====
-  * Each HP Proliant node has 4 AMD Opteron 6262 HE (16 cores) processors and 256 GB RAM(except node1 and the master with 128GB).
+[[ https://lmod.readthedocs.io/en/latest/010_user.html | Documentación de Lmod ]]
-  * Each Dell PowerEdge M910 node has 2  Intel Xeon L7555 (8 cores, 16 threads) processors and 64 GB RAM.
+<code bash>
-  * Each Dell PowerEdge M620 node has 2 Intel Xeon E5-2650L (8 cores, 16 threads) processors and 64 GB RAM.
+# Ver los módulos disponibles:
-  * Connection with the cluster is made at 1Gb but nodes are connected between them by several 10 GbE networks.
+module avail
+# Cargar un módulo:
+module <nombre_modulo>
+# Descargar un módulo:
+module unload <nombre_modulo>
+# Ver módulos cargados en tu entorno:
+module list
+# Puede usarse ml como abreviatura del comando module:
+ml avail
+# Para obtener información sobre un módulo:
+ml spider <nombre_modulo>
+</code>
-==== Software description ====
-The job management is done by the queue manager PBS/TORQUE. To improve energetic efficiency an on demand power on and off system called CLUES has been implemented.
-  * [[http://docs.adaptivecomputing.com/maui/index.php|MAUI 3.3.1]]
+==== Ejecución de contenedores de software ====
-  * [[http://docs.adaptivecomputing.com/torque/4-1-7/help.htm|Torque 4.1.3]]
+=== uDocker ====
-  * [[http://www.grycap.upv.es/clues/eng/index.php|CLUES 0.88]]
+[[ https://indigo-dc.gitbook.io/udocker/user_manual | Manual de uDocker]] \\
+uDocker está instalado como un módulo, así que es necesario cargarlo en el entorno:
+<code bash>
+ml uDocker
+</code>
-===== User queues =====
+=== Apptainer/Singularity ===
--------------
+[[ https://sylabs.io/guides/3.8/user-guide/ | Documentacion de Apptainer/Singularity ]] \\
+Apptainer/Singularity está instalado en el sistema de cada nodo, por lo que no es necesario hacer nada para usarlo.
-There are four user and eight system queues. The user queues are //routing// queues that set, depending on the number of computational numbers requested, the system queue in which each job is going to be executed. Users can't send their jobs directly to the system queues, jobs have to be submitted to the user queues.
-Independently of the type of queue used for job submissions, an user can only specify the following parameters: **node number**, **process number per node** and ** execution time**. Size of memory assigned and maximum execution time of a job are determined by the system queue in which the job gets routed. Jobs that exceed those limits during execution will be canceled.
+==== CONDA ====
-Therefore for jobs in which both memory and execution time are critical it is recommended to modify the number of process requested (even though not all of them get used during the execution) to guarantee that the job needs are fulfilled. The system queue also determines the maximum number of jobs per user and their priority. Users are allowed to specify the job execution time because a precise estimation of execution times allows the queue management system to use resources efficiently without disturbing established priorities. Anyway it is advisable to set an execution time long enough as to guarantee the correct execution of the job and avoid its cancellation.
+[[ https://docs.conda.io/en/latest/miniconda.html | Documentacion de Conda ]] \\
- __To execute jobs that don't adjust to queue parameters get in touch with the IT department.__
+Miniconda es la versíon mínima de Anaconda y solo incluye el gestor de entornos conda, Python y unos pocos paquetes necesarios. A partir de ahí cada usuario solo descarga e instala los paquetes que necesita.
+<code bash>
+# Obtener miniconda
+wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
+# Instalarlo
+sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
+</code>
-User queues are ''batch'', ''short'', ''bigmem'' and ''interactive''.
+===== Uso de SLURM =====
-  *  ''batch''. It's the default queue.((If no queue is specified with the ''-q'' parameter of the ''qsub'' command job will be assigned to the ''batch'' queue.)) Accepts up to 10 jobs per user. Jobs sent to this queue can be executed by any system queue.
+El gestor de colas en el cluster es [[ https://slurm.schedmd.com/documentation.html | SLURM ]]. \\
-  *  ''short''. This queue is designed to reduce the waiting time of jobs that don't need much computational time (maximum 12 hours) and that don't use many resources (less than 16 computational cores). It has more priority than the ''batch'' queue and admits up to 40 jobs per user. Jobs sent to this queue can be executed by the system queues:''np16'',''np8'', ''np4'',''np2'' and ''np1''. To send a job to this queue it is necessary to use the ''-q'' option of the ''qsub'' command explicitly.
+<note tip>El término CPU identifica a un core físico de un socket. El hyperthreading está desactivado, por lo que cada nodo tiene disponibles tantas CPU como (nº sockets) * (nº cores físico por socket) tenga.</note>
-<code>
+== Recursos disponibles ==
-ct$ qsub -q short script.sh
+<code bash>
+hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
+# Hay un alias para este comando:
+hpc-login2 ~]$ ver_recursos
+NODELIST                        CPUS                  MEMORY                AVAIL_FEATURES        GRES
+hpc-fat1                        80                    1027273               cpu_intel             (null)
+hpc-gpu[1-2]                    36                    187911                cpu_intel             gpu:V100S:2
+hpc-gpu3                        64                    253282                cpu_amd               gpu:A100_40:2
+hpc-gpu4                        64                    253282                cpu_amd               gpu:A100_80:1(S:0)
+hpc-node[1-2]                   36                    187645                cpu_intel             (null)
+hpc-node[3-9]                   48                    187645                cpu_intel             (null)
+# Para ver el uso actual de los recursos: (CPUS (Allocated/Idle/Other/Total))
+hpc-login2 ~]$ sinfo -N -r -O NodeList,CPUsState,Memory,FreeMem,Gres,GresUsed
+# Hay un alias para este comando:
+hpc-login2 ~]$ ver_uso
+NODELIST            CPUS(A/I/O/T)       MEMORY              FREE_MEM            GRES                GRES_USED
+hpc-fat1            80/0/0/80           1027273             900850              (null)              gpu:0,mps:0
+hpc-gpu3            2/62/0/64           253282              226026              gpu:A100_40:2       gpu:A100_40:2(IDX:0-
+hpc-gpu4            1/63/0/64           253282              244994              gpu:A100_80:1(S:0)  gpu:A100_80:1(IDX:0)
+hpc-node1           36/0/0/36           187645              121401              (null)              gpu:0,mps:0
+hpc-node2           36/0/0/36           187645              130012              (null)              gpu:0,mps:0
+hpc-node3           36/12/0/48          187645              126739              (null)              gpu:0,mps:0
+hpc-node4           36/12/0/48          187645              126959              (null)              gpu:0,mps:0
+hpc-node5           36/12/0/48          187645              128572              (null)              gpu:0,mps:0
+hpc-node6           36/12/0/48          187645              127699              (null)              gpu:0,mps:0
+hpc-node7           36/12/0/48          187645              127002              (null)              gpu:0,mps:0
+hpc-node8           36/12/0/48          187645              128182              (null)              gpu:0,mps:0
+hpc-node9           36/12/0/48          187645              127312              (null)              gpu:0,mps:0
 </code>
-  *  ''bigmem''. This queue is designed for jobs that need a lot of memory. This queue will set aside a full 64 core node for the job, so ''nodes=1:ppn=64'' in the ''-l'' option of ''qsub'' is required. This queue has more priority than the ''batch'' queue and is limited to two jobs per user. To send a job to this queue it is necessary to use the ''-q'' option of the ''qsub'' command explicitly:
+==== Nodos ====
-<code>
+Un nodo es la unidad de computación de SLURM, y se corresponde con un servidor físico.
-ct$ qsub -q bigmem script.sh
+<code bash>
+# Mostrar la información de un nodo:
+hpc-login2 ~]$ scontrol show node hpc-node1
+NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18
+   CPUAlloc=0 CPUTot=36 CPULoad=0.00
+   AvailableFeatures=cpu_intel
+   ActiveFeatures=cpu_intel
+   Gres=(null)
+   NodeAddr=hpc-node1 NodeHostName=hpc-node1 Version=21.08.6
+   OS=Linux 4.18.0-305.el8.x86_64 #1 SMP Wed May 19 18:55:28 EDT 2021
+   RealMemory=187645 AllocMem=0 FreeMem=166801 Sockets=2 Boards=1
+   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=defaultPartition
+   BootTime=2022-03-01T13:13:56 SlurmdStartTime=2022-03-01T15:36:48
+   LastBusyTime=2022-03-07T14:34:12
+   CfgTRES=cpu=36,mem=187645M,billing=36
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 </code>
-  *  ''interactive''. This is the only queue that admits interactive sessions in the computational nodes. Also only one job per user is allowed, with a maximum execution time of one hour and access to a single core of one node. Use of the ''interactive'' queue doesn't require the use of a //script//, but it is necessary to denote the interactivity of the job using the ''-I'' option:
+==== Particiones ====
-<code>
+Las particiones en SLURM son grupos lógicos de nodos. En el cluster hay una única partición a la que pertenecen todos los nodos, por lo que no es necesario especificarla a la hora de enviar trabajos.
-ct$ qsub -q interactive -I
+<code bash>
+# Mostrar la información de las particiones:
+hpc-login2 ~]$ sinfo
+defaultPartition*    up   infinite     11   idle hpc-fat1,hpc-gpu[3-4],hpc-node[1-9]
+# Cuando se incorporen al cluster ctgpgpu7 y 8 apareceran como los nodos hpc-gpu1 y 2 respectivamente.
 </code>
+==== Trabajos ====
+Los trabajos en SLURM son asignaciones de recursos a un usuario durante un tiempo determinado. Los trabajos se identifican por un número correlativo o JOBID. \\
+Un trabajo (JOB) consiste en uno o más pasos (STEPS), cada uno consistente en una o más tareas (TASKS) que usan una o más CPU. Hay un STEP por cada programa que se ejecute de forma secuencial en un JOB y hay un TASK por cada programa que se ejecute en paralelo. Por lo tanto en el caso más simple como por ejemplo lanzar un trabajo consistente en ejecutar el comando hostname el JOB tiene un único STEP y una única TASK.
-The system queues are ''np1'', ''np2'', ''np4'', ''np8'', ''np16'', ''np32'', ''np64'' y ''parallel''.
+==== Sistema de colas (QOS) ====
-  *  ''np1''. Jobs that require 1 process and 1 node. Maximum memory for jobs in this queue is 1,99 GB and maximum execution time is 672 hours.
+La cola a la que se envíe cada trabajo define la prioridad,los límites y también el "coste" relativo para el usuario.
-  *  ''np2''. Jobs that require 2 processes. Maximum memory for jobs in this queue is 3,75 GB and maximum execution time is 192 hours.
+<code bash>
-  *  ''np4''.Jobs that require 4 processes. Maximum memory for jobs in this queue is 7,5 GB and maximum execution time is 192 hours.
+# Mostrar las colas
-  *  ''np8''. Jobs that require 8 processes and as much as 5 nodes. Maximum memory for jobs in this queue is 15 GB and maximum execution time is 192 hours.
+hpc-login2 ~]$ sacctmgr show qos
-  *  ''np16''. Jobs that require 16 processes and as much as 5 nodes. Maximum memory for jobs in this queue is 31 GB and maximum execution time is 192 hours.
+# Hay un alias que muestra solo la información más relevante:
-  *  ''np32''. Jobs that require 32 processes and as much as 5 nodes. Maximum memory for jobs in this queue is 63 GB and maximum execution time is 288 hours.
+hpc-login2 ~]$ ver_colas
-  *  ''np64''. Jobs that require 64 processes and as much as 5 nodes. Maximum memory for jobs in this queue is 127 GB and maximum execution time is 384 hours.
+      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU
-  *  ''parallel''. Jobs that require more than 32 processes in at least two separate nodes.Maximum memory for jobs in this queue is 64 GB and maximum execution time is 192 hours.
+---------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------
+   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50
+interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1
+    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15
+      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00
+     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25
+     admin        500                    0.000000
+</code>
+# Priority: es la prioridad relativa de cada cola. \\
+# DenyonLimit: el trabajo no se ejecuta si no cumple los límites de la cola \\
+# UsageFactor: el coste relativo para el usuario de ejecutar un trabajo en esa cola \\
+# MaxTRES: límites por cada trabajo \\
+# MaxWall: tiempo máximo que puede estar el trabajo en ejecución \\
+# MaxTRESPU: límites globales por usuario \\
+# MaxJobsPU: Número máximo de trabajos que un usuario puede tener en ejecución. \\
+# MaxSubmitPU: Número máximo de trabajos que un usuario puede tener en total encolados y en ejecucuón.\\
+==== Envío de un trabajo al sistema de colas ====
+== Especificación de recursos ==
+Por defecto, si se envía un trabajo sin especificar nada el sistema lo envia a la QOS por defecto (regular) y le asigna un nodo, una CPU y toda la memoria disponible. El límite de tiempo para la ejecución del trabajo es el de la cola (4 días y 4 horas).
+Esto es muy ineficiente, lo ideal es especificar en la medida de lo posible al menos tres parámetros a la hora de enviar los trabajos:
+  -  %%El número de nodos (-N o --nodes), tareas (-n o --ntasks) y/o CPU por tarea (-c o --cpus-per-task).%%
+  -  %%La memoria (--mem) por nodo o la memoria por cpu (--mem-per-cpu).%%
+  -  %%El tiempo estimado de ejecución del trabajo ( --time )%%
-The following table summarizes the characteristics of the user and system queues;
+A mayores puede ser interesante añadir los siguientes parámetros:
+|  -J   |  %%--job-name%%  |Nombre para el trabajo. Por defecto: nombre del ejecutable  |
+|  -q   |  %%--qos%%       |Nombre de la cola a la que se envía el trabajo. Por defecto: regular  |
+|  -o   |  %%--output%%    |Fichero o patrón de fichero al que se redirige toda la salida estandar y de error.  |
+|       |  %%--gres%%      |Tipo y/o número de GPUs que se solicitan para el trabajo.  |
+|  -C   |  %%--constraint%%  |Para especificar que se quieren nodos con procesadores Intel o AMD (cpu_intel o cpu_amd)  |
+|    |  %%--exclusive%%  |Para solicitar que el trabajo no comparta nodos con otros trabajos.  |
+|  -w  |  %%--nodelist%%   |Lista de nodos en los que ejecutar el trabajo  |
-^ Queue             ^ Limits                                                                                                                   ||||||
+== Cómo se asignan los recursos ==
-| :::              ^ Processes  ^ Nodes  ^ Memory (GB)  ^ Jobs/user  ^ Maximum time (hours)  ^ Priority((Higher = more priority))  ^
+Por defecto el método de asignación entre nodos es la asignación en bloque ( se asignan todos los cores disponibles en un nodo antes de usar otro). El método de asignación por defecto dentro de cada nodo es la asignación cíclica  (se van repartiendo por igual los cores requeridos entre los sockests disponibles en el nodo).
-| ''batch''        |      1-64 | -      | -             | 128               | -                      | 1                                            |
-| ''short''        |      1-16 | -      | -             | 256               | -                      | 3                                            |
+== Calculo de la prioridad ==
-| ''bigmem''       |        64 | -      | -             | 8                 | -                      | 2                                            |
+Cuando se envía un trabajo al sistema de colas, lo primero que ocurre es que se comprueba si los recursos solicitados entran dentro de los límites fijados en la cola correspondiente. Si supera alguno se cancela el envío. \\
-| ''interactive''  | 1         | 1      | 2             | 1                 | 1                      | 7                                            |
+Si hay recursos disponibles el trabajo se ejecuta directamente, pero si no es así se encola. Cada trabajo tiene asignada una prioridad que determina el orden en que se ejecutan los trabajos de la cola cuando quedan recursos disponibles. Para determinar la prioridad de cada trabajo se ponderan 3 factores: el tiempo que lleva esperando en la cola (25%), la prioridad fija que tiene la cola(25%) y el fairshare del usuario (50%). \\
-| ''np1''          | 1         | 1      | 1,99          | 120               | 672                    | 6                                            |
+El fairshare es un cálculo dinámico que hace SLURM para cada usuario y es la diferencia entre los recursos asignados y los recursos consumidos a lo largo de los últimos 14 días.
-| ''np2''          | 2         | 2      | 3,75          | 120               | 192                    | 5                                            |
+<code bash>
-| ''np4''          | 4         | 4      | 7,5           | 60                | 192                    | 4                                            |
+hpc-login2 ~]$ sshare -l
-| ''np8''          | 8         | 5      | 15            | 60                | 192                    | 4                                            |
+      User  RawShares  NormShares    RawUsage   NormUsage   FairShare
-| ''np16''         | 16        | 5      | 31            | 15                | 192                    | 3                                            |
+---------- ---------- ----------- ----------- -----------  ----------
-| ''np32''         | 32        | 5      | 63            | 15                | 288                    | 2                                            |
+.000000     2872400                0.500000
-| ''np64''         | 64        | 5      | 127           | 3                 | 384                    | 1                                            |
+    0.500000     2872400    1.000000    0.250000
-| ''parallel''     | 32-160    | 5      | 64            | 15                | 192                    | 3                                            |
+user_name         100    0.071429        4833    0.001726    0.246436
+</code>
+# RawShares: es la cantidad de recursos en términos absolutos asignada al usuario. Es igual para todos los usuarios.\\
+# NormShares: Es la cantidad anterior normalizada a los recursos asignados en total.\\
+# RawUsage: Es la cantidad de segundos/cpu consumida por todos los trabajos del usuario.\\
+# NormUsage: Cantidad anterior normalizada al total de segundos/cpu consumidos en el cluster.\\
+# FairShare: El factor FairShare entre 0 y 1. Cuanto mayor uso del cluster, más se aproximará a 0 y menor será la prioridad.\\
+== Envío de trabajos ==
+  - salloc
+  - srun
+  - sbatch
+. SALLOC \\
+Sirve para obtener de forma inmediata una asignación de recursos (nodos). En cuanto se obtiene se ejecuta el comando especificado o una shell en su defecto.
+<code bash>
+# Obtener 5 nodos y lanzar un trabajo.
+hpc-login2 ~]$ salloc -N5 myprogram
+# Obtener acceso interactivo a un nodo (Pulsar Ctrl+D para terminar el acceso):
+hpc-login2 ~]$ salloc -N1
+</code>
+. SRUN \\
+Sirve para lanzar un trabajo paralelo ( es preferible a usar mpirun ). Es interactivo y bloqueante.
+<code bash>
+# Lanzar un hostname en 2 nodos
+hpc-login2 ~]$ srun -N2 hostname
+hpc-node1
+hpc-node2
+</code>
+. SBATCH \\
+Sirve para enviar un script al sistema de colas. Es de procesamiento por lotes y no bloqueante.
+<code bash>
+# Crear el script:
+hpc-login2 ~]$ vim trabajo_ejemplo.sh
+    #!/bin/bash
+    #SBATCH --job-name=prueba            # Job name
+    #SBATCH --nodes=1                    # -N Run all processes on a single node
+    #SBATCH --ntasks=1                   # -n Run a single task
+    #SBATCH --cpus-per-task=1            # -c Run 1 processor per task
+    #SBATCH --mem=1gb                    # Job memory request
+    #SBATCH --time=00:05:00              # Time limit hrs:min:sec
+    #SBATCH --qos=urgent                 # Cola
+    #SBATCH --output=prueba_%j.log       # Standard output and error log
+    echo "Hello World!"
+hpc-login2 ~]$ sbatch trabajo_ejemplo.sh
+</code>
+==== Uso de los nodos con GPU ====
+Para solicitar específicamente una asignación de GPUs para un trabajo hay que añadir a sbatch o srun las opciones:
+|  %%--gres%%  |  Solicitud de gpus por NODE  |  %%--gres=gpu[[:type]:count],...%%  |
+|  %%--gpus o -G%%  |  Solicitud de gpus por JOB  |  %%--gpus=[type]:count,...%%  |
+También existen las opciones %% --gpus-per-socket,--gpus-per-node y --gpus-per-task%%,\\
+Ejemplos:
+<code bash>
+## Ver la lista de nodos y gpus:
+hpc-login2 ~]$ ver_recursos
+## Solicitar 2 GPU cualesquiera para un JOB, añadir:
+--gpus=2
+## Solicitar una A100 de 40G en un nodo y una A100 de 80G en otro, añadir:
+--gres=gpu:A100_40:1,gpu:A100_80:1
+</code>
+==== Monitorización de los trabajos ====
+<code bash>
+## Listado de todos los trabajos en la cola
+hpc-login2 ~]$ squeue
+## Listado de los trabajos de un usuario
+hpc-login2 ~]$ squeue -u <login>
+## Cancelar un trabajo:
+hpc-login2 ~]$ scancel <JOBID>
+## Lista de trabajos recientes
+hpc-login2 ~]$ sacct -b
+## Información histórica detallada de un trabajo:
+hpc-login2 ~]$ sacct -l -j <JOBID>
+## Información de debug de un trabajo para troubleshooting:
+hpc-login2 ~]$ scontrol show jobid -dd <JOBID>
+## Ver el uso de recursos de un trabajo en ejecución:
+hpc-login2 ~]$ sstat <JOBID>
+</code>
+==== Controlar la salida de los trabajos ====
+== Códigos de salida ==
+Por defecto estos son los códigos de salida de los comandos:
+^  SLURM command  ^  Exit code  ^
+|  salloc  |  0 en caso de éxito, 1 si no se puedo ejecutar el comando del usuario  |
+|  srun  |  El más alto de entre todas las tareas ejecutadas o 253 para un error out-of-mem  |
+|  sbatch  |  0 en caso de éxito, si no, el código de salida correspondiente del proceso que falló  |
+== STDIN, STDOUT y STDERR ==
+**SRUN:**\\
+Por defecto stdout y stderr se redirigen de todos los TASKS a el stdout y stderr de srun, y stdin se redirecciona desde el stdin de srun a todas las TASKS. Esto se puede cambiar con:
+|  %%-i, --input=<opcion>%%    |
+|  %%-o, --output=<opcion>%%   |
+|  %%-e, --error=<opcion>%%   |
+Y las opciones son:
+  * //all//: opción por defecto.
+  * //none//: No se redirecciona nada.
+  * //taskid//: Solo se redirecciona desde y/o al TASK id especificado.
+  * //filename//: Se redirecciona todo desde y/o al fichero especificado.
+  * //filename pattern//: Igual que filename pero con un fichero definido por un [[ https://slurm.schedmd.com/srun.html#OPT_filename-pattern | patrón ]]
+**SBATCH:**\\
+Por defecto "/dev/null" está abierto en el stdin del script y stdout y stderror se redirigen a un fichero de nombre "slurm-%j.out". Esto se puede cambiar con:
+|  %%-i, --input=<filename_pattern>%%  |
+|  %%-o, --output=<filename_pattern>%%  |
+|  %%-e, --error=<filename_pattern>%%  |
+La referencia de filename_pattern está [[ https://slurm.schedmd.com/sbatch.html#SECTION_%3CB%3Efilename-pattern%3C/B%3E | aquí ]].
+==== Envío de correos ====
+Se pueden configurar los JOBS para que envíen correos en determinadas circunstancias usando estos dos parámetros (**SON NECESARIOS AMBOS**):
+|  %%--mail-type=<type>%%  |  Opciones: BEGIN, END, FAIL, REQUEUE, ALL, TIME_LIMIT, TIME_LIMIT_90, TIME_LIMIT_50.  |
+|  %%--mail-user=<user>%%  |  La dirección de correo de destino.  |
+==== Estados de los trabajos en el sistema de colas ====
+<code bash>
+hpc-login2 ~]# squeue -l
+JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
+  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
+</code>
+Estados (STATE) más comunes de un trabajo:
+  * R RUNNING Job currently has an allocation.
+  * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero.
+  * F FAILED Job terminated with non-zero exit code or other failure condition.
+  * PD PENDING Job is awaiting resource allocation.
+[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-STATE-CODES | Lista completa de posibles estados de un trabajo ]].\\
+Si un trabajo no está en ejecución aparecerá una razón debajo de REASON:[[ https://slurm.schedmd.com/squeue.html#SECTION_JOB-REASON-CODES | Lista de las razones ]] por las que un trabajo puede estar esperando su ejecución.
-  * Processes: Maximum number of processes by job in this queue.
-  * Nodes: Maximum numbers of nodes in which the job will be executed.
-  * Memory: Maximum virtual memory concurrently used by all the job processes.
-  * Jobs/user: Maximum number of jobs per user regardless of their state.
-  * Maximum time (hours): Maximum real time during which the job can be in the execution state.
-  * Priority: Priority of the execution queue related to the other queues. A higher value means more priority. Please note that lacking other criteria, any job sent with qsub will by default be executed in np1 using its limits.