Differences

This shows you the differences between two versions of the page.

--- en:centro:servizos:hpc [2022/07/01 12:56] – fernando.guillen
+++ en:centro:servizos:hpc [2024/10/08 09:55] (current) – [CONDA] jorge.suarez
@@ Line 18: / Line 18: @@
 |  hpc-node[3-9]             |  Dell R740   |  2 x Intel Xeon Gold 5220R @2,2 GHz (24c)       |  192 GB   |  -                           |
 |  hpc-fat1                  |  Dell R840   |  4 x Xeon Gold 6248 @ 2.50GHz (20c)             |  1 TB     |  -                           |
-|  <del>hpc-gpu1</del>*  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
+|  hpc-gpu[1-2]  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
-|  hpc-gpu2  |  Dell R740   |  2 x Intel Xeon Gold 5220 CPU @ 2.20GHz (18c)   |  192 GB   |  2x Nvidia Tesla V100S       |
 |  hpc-gpu3                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  2x Nvidia Ampere A100 40GB  |
 |  hpc-gpu4                  |  Dell R7525  |  2 x AMD EPYC 7543 @2,80 GHz (32c)              |  256 GB   |  1x Nvidia Ampere A100 80GB  |
-* Now ctgpgpu8. It will be integrated in the cluster soon.
-===== Accessing the system =====
+===== Accessing the cluster =====
 To access the cluster, access must be requested in advance via [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an "incorrect password" message.
-The access is done through an SSH connection to the login node:
+The access is done through an SSH connection to the login node (172.16.242.211):
 <code bash>
 ssh <nombre_de_usuario>@hpc-login2.inv.usc.es
@@ Line 114: / Line 113: @@
   * Python 3.6.8
   * Perl 5.26.3
+GPU nodes, in addition:
+  * nVidia Driver 510.47.03
+  * CUDA 11.6
+  * libcudnn 8.7
 To use any other software not installed on the system or another version of the system, there are three options:
   - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available).
@@ Line 143: / Line 145: @@
 === uDocker ====
 [[ https://indigo-dc.gitbook.io/udocker/user_manual | uDocker manual ]] \\
-uDocker is installed as a module, so it needs to be loaded into the environment:
+udocker is installed as a module, so it needs to be loaded into the environment:
 <code bash>
 ml uDocker
@@ Line 158: / Line 160: @@
 <code bash>
 # Getting miniconda
-wget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
 # Install
-sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
+bash Miniconda3-latest-Linux-x86_64.sh
+#  Initialize for bash shell
+~/miniconda3/bin/conda init bash
 </code>
@@ Line 168: / Line 172: @@
 == Available resources ==
 <code bash>
+hpc-login2 ~]# ver_estado.sh
+=============================================================================================================
+  NODO     ESTADO                        CORES EN USO                           USO MEM     GPUS(Uso/Total)
+=============================================================================================================
+ hpc-fat1    up   0%[--------------------------------------------------]( 0/80) RAM:  0%     ---
+ hpc-gpu1    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu2    up   2%[||------------------------------------------------]( 1/36) RAM: 47%   V100S (1/2)
+ hpc-gpu3    up   0%[--------------------------------------------------]( 0/64) RAM:  0%   A100_40 (0/2)
+ hpc-gpu4    up   1%[|-------------------------------------------------]( 1/64) RAM: 35%   A100_80 (1/1)
+ hpc-node1   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node2   up   0%[--------------------------------------------------]( 0/36) RAM:  0%     ---
+ hpc-node3   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node4   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node5   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node6   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node7   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node8   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+ hpc-node9   up   0%[--------------------------------------------------]( 0/48) RAM:  0%     ---
+=============================================================================================================
+TOTALES: [Cores : 3/688] [Mem(MB): 270000/3598464] [GPU: 3/ 7]
 hpc-login2 ~]$ sinfo -e -o "%30N  %20c  %20m  %20f  %30G " --sort=N
 # There is an alias for that command:
@@ Line 238: / Line 262: @@
 # There is an alias that shows only the relevant info:
 hpc-login2 ~]$ ver_colas
-      Name   Priority           Flags UsageFactor                     MaxTRES     MaxWall     MaxTRESPU MaxJobsPU MaxSubmitPU
+      Name    Priority                                  MaxTRES     MaxWall            MaxTRESPU MaxJobsPU MaxSubmitPU
----------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- -----------
+----------  ---------- ---------------------------------------- ----------- -------------------- --------- -----------
-   regular        100     DenyOnLimit    1.000000   cpu=200,gres/gpu=1,node=4  4-04:00:00                      10          50
+   regular         100                cpu=200,gres/gpu=1,node=4  4-04:00:00       cpu=200,node=4        10          50
-interactive       200     DenyOnLimit    1.000000                      node=1    04:00:00        node=1         1           1
+interactive        200                                   node=1    04:00:00               node=1         1           1
-    urgent        300     DenyOnLimit    2.000000           gres/gpu=1,node=1    04:00:00        cpu=36         5          15
+    urgent         300                        gres/gpu=1,node=1    04:00:00               cpu=36         5          15
-      long        100     DenyOnLimit    1.000000           gres/gpu=1,node=4  8-08:00:00
+      long         100                        gres/gpu=1,node=4  8-04:00:00                              1           5
-     large        100     DenyOnLimit    1.000000          cpu=200,gres/gpu=2  4-04:00:00                      10          25
+     large         100                       cpu=200,gres/gpu=2  4-04:00:00                              2          10
-     admin        500                    0.000000
+     admin         500
+     small         150                             cpu=6,node=2    04:00:00              cpu=400        40         100
 </code>
 # Priority: is the relative priority of each queue. \\
@@ Line 258: / Line 283: @@
 ==== Sending a job to the queue system ====
 == Requesting resources ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours).
+By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours).
 This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
@@ Line 295: / Line 320: @@
 == Job submission ==
+  - sbatch
   - salloc
   - srun
-  - sbatch
-. SALLOC \\
+. SBATCH \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
-<code bash>
-# Get 5 nodes and launch a job.
-hpc-login2 ~]$ salloc -N5 myprogram
-# Get interactive access to a node (Press Ctrl+D to exit):
-hpc-login2 ~]$ salloc -N1
-</code>
-. SRUN \\
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
-<code bash>
-# Launch the hostname command on 2 nodes
-hpc-login2 ~]$ srun -N2 hostname
-hpc-node1
-hpc-node2
-</code>
-. SBATCH \\
 Used to send a script to the queuing system. It is batch-processing and non-blocking.
 <code bash>
@@ Line 334: / Line 343: @@
 hpc-login2 ~]$ sbatch test_job.sh
 </code>
+. SALLOC \\
+It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.
+<code bash>
+# Get 5 nodes and launch a job.
+hpc-login2 ~]$ salloc -N5 myprogram
+# Get interactive access to a node (Press Ctrl+D to exit):
+hpc-login2 ~]$ salloc -N1
+# Get interactive EXCLUSIVE access to a node
+hpc-login2 ~]$ salloc -N1 --exclusive
+</code>
+. SRUN \\
+It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
+<code bash>
+# Launch the hostname command on 2 nodes
+hpc-login2 ~]$ srun -N2 hostname
+hpc-node1
+hpc-node2
+</code>
 ==== GPU use ====
@@ Line 403: / Line 431: @@
-==== Status of work in the queuing system ====
+==== Status of Jobs in the queuing system ====
 <code bash>
 hpc-login2 ~]# squeue -l
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
   defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
+## Check status of queue use:
+hpc-login2 ~]$ estado_colas.sh
+JOBS PER USER:
+--------------
+       usuario.uno:  3
+       usuario.dos:  1
+JOBS PER QOS:
+--------------
+             regular:  3
+                long:  1
+JOBS PER STATE:
+--------------
+             RUNNING:  3
+             PENDING:  1
+==========================================
+Total JOBS in cluster:  4
 </code>
 Common job states: