Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
en:centro:servizos:hpc [2023/03/03 13:57] – [Queue system (QOS)] fernando.guillenen:centro:servizos:hpc [2024/03/12 11:12] – [CONDA] fernando.guillen
Line 163: Line 163:
 # Install  # Install 
 sh Miniconda3-py39_4.11.0-Linux-x86_64.sh sh Miniconda3-py39_4.11.0-Linux-x86_64.sh
 +#  Initialize for bash shell
 +~/miniconda3/bin/conda init bash
 </code> </code>
  
Line 281: Line 283:
 ==== Sending a job to the queue system ==== ==== Sending a job to the queue system ====
 == Requesting resources == == Requesting resources ==
-By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours). +By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours). 
 This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs: This is very inefficient, the ideal is to specify as much as possible at least three parameters when submitting jobs:
   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%   -  %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%%
Line 318: Line 320:
  
 == Job submission == == Job submission ==
 +  - sbatch
   - salloc   - salloc
   - srun   - srun
-  - sbatch 
  
-1. SALLOC \\ +1. SBATCH \\
-It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed.  +
-<code bash> +
-# Get 5 nodes and launch a job. +
-hpc-login2 ~]$ salloc -N5 myprogram +
-# Get interactive access to a node (Press Ctrl+D to exit): +
-hpc-login2 ~]$ salloc -N1  +
-</code> +
-2. SRUN \\ +
-It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking. +
-<code bash> +
-# Launch the hostname command on 2 nodes +
-hpc-login2 ~]$ srun -N2 hostname +
-hpc-node1 +
-hpc-node2 +
-</code> +
-3. SBATCH \\+
 Used to send a script to the queuing system. It is batch-processing and non-blocking. Used to send a script to the queuing system. It is batch-processing and non-blocking.
 <code bash> <code bash>
Line 357: Line 343:
 hpc-login2 ~]$ sbatch test_job.sh  hpc-login2 ~]$ sbatch test_job.sh 
 </code> </code>
 +2. SALLOC \\
 +It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. 
 +<code bash>
 +# Get 5 nodes and launch a job.
 +hpc-login2 ~]$ salloc -N5 myprogram
 +# Get interactive access to a node (Press Ctrl+D to exit):
 +hpc-login2 ~]$ salloc -N1 
 +</code>
 +3. SRUN \\
 +It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking.
 +<code bash>
 +# Launch the hostname command on 2 nodes
 +hpc-login2 ~]$ srun -N2 hostname
 +hpc-node1
 +hpc-node2
 +</code>
 +
  
 ==== GPU use ==== ==== GPU use ====
Line 431: Line 434:
 JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON) JOBID PARTITION     NAME     USER      STATE       TIME  NODES NODELIST(REASON)
 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1 6547  defaultPa  example <username>  RUNNING   22:54:55      1 hpc-fat1
 +
 +## Check status of queue use:
 +hpc-login2 ~]$ estado_colas.sh
 +JOBS PER USER:
 +--------------
 +       usuario.uno:  3
 +       usuario.dos:  1
 +
 +JOBS PER QOS:
 +--------------
 +             regular:  3
 +                long:  1
 +
 +JOBS PER STATE:
 +--------------
 +             RUNNING:  3
 +             PENDING:  1
 +==========================================
 +Total JOBS in cluster:  4
 </code> </code>
 Common job states: Common job states: