Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
en:centro:servizos:hpc [2022/07/01 12:56] – fernando.guillen | en:centro:servizos:hpc [2024/10/08 09:55] (current) – [CONDA] jorge.suarez | ||
---|---|---|---|
Line 18: | Line 18: | ||
| hpc-node[3-9] | | hpc-node[3-9] | ||
| hpc-fat1 | | hpc-fat1 | ||
- | | | + | | hpc-gpu[1-2] |
- | | hpc-gpu2 | + | |
| hpc-gpu3 | | hpc-gpu3 | ||
| hpc-gpu4 | | hpc-gpu4 | ||
- | * Now ctgpgpu8. It will be integrated in the cluster soon. | + | |
- | ===== Accessing the system | + | ===== Accessing the cluster |
To access the cluster, access must be requested in advance via [[https:// | To access the cluster, access must be requested in advance via [[https:// | ||
- | The access is done through an SSH connection to the login node: | + | The access is done through an SSH connection to the login node (172.16.242.211): |
<code bash> | <code bash> | ||
ssh < | ssh < | ||
Line 114: | Line 113: | ||
* Python 3.6.8 | * Python 3.6.8 | ||
* Perl 5.26.3 | * Perl 5.26.3 | ||
+ | GPU nodes, in addition: | ||
+ | * nVidia Driver 510.47.03 | ||
+ | * CUDA 11.6 | ||
+ | * libcudnn 8.7 | ||
To use any other software not installed on the system or another version of the system, there are three options: | To use any other software not installed on the system or another version of the system, there are three options: | ||
- Use Modules with the modules that are already installed (or request the installation of a new module if it is not available). | - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available). | ||
Line 143: | Line 145: | ||
=== uDocker ==== | === uDocker ==== | ||
[[ https:// | [[ https:// | ||
- | uDocker | + | udocker |
<code bash> | <code bash> | ||
ml uDocker | ml uDocker | ||
Line 158: | Line 160: | ||
<code bash> | <code bash> | ||
# Getting miniconda | # Getting miniconda | ||
- | wget https:// | + | wget https:// |
# Install | # Install | ||
- | sh Miniconda3-py39_4.11.0-Linux-x86_64.sh | + | bash Miniconda3-latest-Linux-x86_64.sh |
+ | # Initialize for bash shell | ||
+ | ~/ | ||
</ | </ | ||
Line 168: | Line 172: | ||
== Available resources == | == Available resources == | ||
<code bash> | <code bash> | ||
+ | hpc-login2 ~]# ver_estado.sh | ||
+ | ============================================================================================================= | ||
+ | NODO | ||
+ | ============================================================================================================= | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | ============================================================================================================= | ||
+ | TOTALES: [Cores : 3/688] [Mem(MB): 270000/ | ||
hpc-login2 ~]$ sinfo -e -o " | hpc-login2 ~]$ sinfo -e -o " | ||
# There is an alias for that command: | # There is an alias for that command: | ||
Line 238: | Line 262: | ||
# There is an alias that shows only the relevant info: | # There is an alias that shows only the relevant info: | ||
hpc-login2 ~]$ ver_colas | hpc-login2 ~]$ ver_colas | ||
- | Name | + | Name Priority |
- | ---------- | + | ---------- |
- | | + | |
- | interactive | + | interactive |
- | urgent | + | urgent |
- | long 100 DenyOnLimit | + | long |
- | | + | |
- | | + | |
+ | | ||
</ | </ | ||
# Priority: is the relative priority of each queue. \\ | # Priority: is the relative priority of each queue. \\ | ||
Line 258: | Line 283: | ||
==== Sending a job to the queue system ==== | ==== Sending a job to the queue system ==== | ||
== Requesting resources == | == Requesting resources == | ||
- | By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and all available memory. The time limit for job execution is that of the queue (4 days and 4 hours). | + | By default, if you submit a job without specifying anything, the system submits it to the default (regular) QOS and assigns it a node, a CPU and 4 GB. The time limit for job execution is that of the queue (4 days and 4 hours). |
This is very inefficient, | This is very inefficient, | ||
- %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%% | - %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%% | ||
Line 295: | Line 320: | ||
== Job submission == | == Job submission == | ||
+ | - sbatch | ||
- salloc | - salloc | ||
- srun | - srun | ||
- | - sbatch | ||
- | 1. SALLOC \\ | + | 1. SBATCH \\ |
- | It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. | + | |
- | <code bash> | + | |
- | # Get 5 nodes and launch a job. | + | |
- | hpc-login2 ~]$ salloc -N5 myprogram | + | |
- | # Get interactive access to a node (Press Ctrl+D to exit): | + | |
- | hpc-login2 ~]$ salloc -N1 | + | |
- | </ | + | |
- | 2. SRUN \\ | + | |
- | It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking. | + | |
- | <code bash> | + | |
- | # Launch the hostname command on 2 nodes | + | |
- | hpc-login2 ~]$ srun -N2 hostname | + | |
- | hpc-node1 | + | |
- | hpc-node2 | + | |
- | </ | + | |
- | 3. SBATCH \\ | + | |
Used to send a script to the queuing system. It is batch-processing and non-blocking. | Used to send a script to the queuing system. It is batch-processing and non-blocking. | ||
<code bash> | <code bash> | ||
Line 334: | Line 343: | ||
hpc-login2 ~]$ sbatch test_job.sh | hpc-login2 ~]$ sbatch test_job.sh | ||
</ | </ | ||
+ | 2. SALLOC \\ | ||
+ | It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. | ||
+ | <code bash> | ||
+ | # Get 5 nodes and launch a job. | ||
+ | hpc-login2 ~]$ salloc -N5 myprogram | ||
+ | # Get interactive access to a node (Press Ctrl+D to exit): | ||
+ | hpc-login2 ~]$ salloc -N1 | ||
+ | # Get interactive EXCLUSIVE access to a node | ||
+ | hpc-login2 ~]$ salloc -N1 --exclusive | ||
+ | </ | ||
+ | 3. SRUN \\ | ||
+ | It is used to launch a parallel job (preferable to using mpirun). It is interactive and blocking. | ||
+ | <code bash> | ||
+ | # Launch the hostname command on 2 nodes | ||
+ | hpc-login2 ~]$ srun -N2 hostname | ||
+ | hpc-node1 | ||
+ | hpc-node2 | ||
+ | </ | ||
+ | |||
==== GPU use ==== | ==== GPU use ==== | ||
Line 403: | Line 431: | ||
- | ==== Status of work in the queuing system ==== | + | ==== Status of Jobs in the queuing system ==== |
<code bash> | <code bash> | ||
hpc-login2 ~]# squeue -l | hpc-login2 ~]# squeue -l | ||
JOBID PARTITION | JOBID PARTITION | ||
6547 defaultPa | 6547 defaultPa | ||
+ | |||
+ | ## Check status of queue use: | ||
+ | hpc-login2 ~]$ estado_colas.sh | ||
+ | JOBS PER USER: | ||
+ | -------------- | ||
+ | | ||
+ | | ||
+ | |||
+ | JOBS PER QOS: | ||
+ | -------------- | ||
+ | | ||
+ | long: 1 | ||
+ | |||
+ | JOBS PER STATE: | ||
+ | -------------- | ||
+ | | ||
+ | | ||
+ | ========================================== | ||
+ | Total JOBS in cluster: | ||
</ | </ | ||
Common job states: | Common job states: |