Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:centro:servizos:hpc [2025/12/05 09:34] – fernando.guillen | en:centro:servizos:hpc [2025/12/05 10:04] (current) – [Using SLURM] fernando.guillen | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== High Performance Computing (HPC) Cluster | + | ====== High-Performance Computing |
| - | [[ https:// | + | [[ https:// |
| ===== Description ===== | ===== Description ===== | ||
| - | The cluster | + | The cluster |
| - | * 9 general computing | + | * 9 servers |
| - | * 1 "fat node" for jobs that require a lot of memory. | + | * 1 "fat node" for memory-intensive tasks. |
| - | * 6 servers for computing with GPU. | + | * 6 servers for GPU computing. |
| - | Users only have direct access to the login node, which has more limited | + | Users only have direct access to the login node, which has more limited |
| All nodes are interconnected by a 10Gb network. \\ | All nodes are interconnected by a 10Gb network. \\ | ||
| - | There is distributed storage accessible from all nodes with 220 TB of capacity | + | There is distributed storage accessible from all nodes with a capacity of 220 TB connected |
| \\ | \\ | ||
| ^ Name ^ Model ^ Processor | ^ Name ^ Model ^ Processor | ||
| Line 25: | Line 25: | ||
| ===== Connection to the system ===== | ===== Connection to the system ===== | ||
| - | To access the cluster, | + | To access the cluster, |
| - | Access is done via SSH to the login node (172.16.242.211): | + | Access is done via an SSH connection |
| <code bash> | <code bash> | ||
| ssh < | ssh < | ||
| </ | </ | ||
| - | ===== Storage, directories, | + | ===== Storage, directories, |
| - | <note warning> No backup is made of any file systems in the cluster!!</ | + | <note warning> No backup is made of any of the cluster's file systems!!</ |
| - | Users' HOME in the cluster is in the shared file system, so it is accessible from all nodes of the cluster. | + | Users' HOME on the cluster is on the shared file system, so it is accessible from all nodes in the cluster. |
| - | Each node has a local 1 TB partition for scratch, which is deleted after each job completes. It can be accessed | + | Each node has a local scratch partition of 1 TB, which is deleted after each job. It can be accessed |
| - | For data that need to be shared | + | For data that must be shared |
| - | ^ Directory | + | ^ Directory |
| | Home | %%$HOME%% | | Home | %%$HOME%% | ||
| - | | Local Scratch | + | | Local Scratch |
| - | | Group folder | + | | Group Folder |
| - | %%* the storage is shared%% | + | %%* storage is shared%% |
| === IMPORTANT NOTICE === | === IMPORTANT NOTICE === | ||
| - | The shared file system | + | The shared file system |
| * Create the image file in your home: | * Create the image file in your home: | ||
| <code bash> | <code bash> | ||
| Line 53: | Line 53: | ||
| ## mkfs.ext4 -T small -m 0 image.name | ## mkfs.ext4 -T small -m 0 image.name | ||
| ## -T small optimized options for small files | ## -T small optimized options for small files | ||
| - | ## -m 0 Do not reserve | + | ## -m 0 No space reserved |
| mkfs.ext4 -T small -m 0 example.ext4 | mkfs.ext4 -T small -m 0 example.ext4 | ||
| </ | </ | ||
| * Mount the image (using SUDO) with the script | * Mount the image (using SUDO) with the script | ||
| <code bash> | <code bash> | ||
| - | ## By default mounted | + | ## By default, it is mounted |
| sudo mount_image.py example.ext4 | sudo mount_image.py example.ext4 | ||
| </ | </ | ||
| Line 66: | Line 66: | ||
| </ | </ | ||
| <note warning> | <note warning> | ||
| - | The file can only be mounted from one node if done in readwrite mode, but it can be mounted from any number of nodes in readonly mode. | + | The file can only be mounted from a single |
| </ | </ | ||
| - | The mount script has the following | + | The mount script has these options: |
| < | < | ||
| - | --mount-point path < | + | --mount-point path < |
| - | --rw <-- (optional) By default, it is mounted readonly; with this option, it is mounted readwrite. | + | --rw <-- (optional) By default, it is mounted readonly, with this option, it is mounted readwrite. |
| </ | </ | ||
| The unmount script has these options: | The unmount script has these options: | ||
| - | < | + | < |
| --mount-point | --mount-point | ||
| </ | </ | ||
| - | ===== File and data transfer | + | ===== File and data transfer ===== |
| === SCP === | === SCP === | ||
| From your local machine to the cluster: | From your local machine to the cluster: | ||
| Line 89: | Line 89: | ||
| [[https:// | [[https:// | ||
| === SFTP === | === SFTP === | ||
| - | To transfer multiple files or to navigate | + | To transfer multiple files or to navigate the file system. |
| <code bash> | <code bash> | ||
| < | < | ||
| Line 101: | Line 101: | ||
| [[https:// | [[https:// | ||
| === RSYNC === | === RSYNC === | ||
| - | [[ https:// | + | [[ https:// |
| === SSHFS === | === SSHFS === | ||
| - | Requires | + | Requires installation of the sshfs package.\\ |
| - | Allows for instance, mounting the home of the user' | + | Allows, for example, to mount the user' |
| <code bash> | <code bash> | ||
| ## Mount | ## Mount | ||
| Line 114: | Line 114: | ||
| ===== Available Software ===== | ===== Available Software ===== | ||
| - | All nodes have the basic software installed by default with AlmaLinux 8.4, particularly: | + | All nodes have the basic software |
| * GCC 8.5.0 | * GCC 8.5.0 | ||
| * Python 3.6.8 | * Python 3.6.8 | ||
| Line 123: | Line 123: | ||
| * libcudnn 8.7 | * libcudnn 8.7 | ||
| To use any other software not installed on the system or another version of it, there are three options: | To use any other software not installed on the system or another version of it, there are three options: | ||
| - | - Use Modules with the modules already installed (or request the installation of a new module if it is not available) | + | - Use Modules with the modules |
| - Use a container (uDocker or Apptainer/ | - Use a container (uDocker or Apptainer/ | ||
| - Use Conda | - Use Conda | ||
| - | A module is the simplest solution to use software without modifications or difficult-to-satisfy | + | A module is the simplest solution to use software without modifications or difficult |
| - | A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if reproducibility, | + | A container is ideal when dependencies are complicated and/or the software is highly customized. It is also the best solution if the goal is reproducibility, |
| - | Conda is the best solution if you need the latest version of a library or program or packages | + | Conda is the best solution if what is needed is the latest version of a library or program or packages not available otherwise.\\ |
| ==== Using modules/ | ==== Using modules/ | ||
| - | [[ https:// | + | [[ https:// |
| <code bash> | <code bash> | ||
| # View available modules: | # View available modules: | ||
| Line 140: | Line 140: | ||
| # Unload a module: | # Unload a module: | ||
| module unload < | module unload < | ||
| - | # View modules | + | # View loaded |
| module list | module list | ||
| - | # ml can be used as an abbreviation for the module command: | + | # Can use ml as an abbreviation for the module command: |
| ml avail | ml avail | ||
| # To get information about a module: | # To get information about a module: | ||
| Line 152: | Line 152: | ||
| ==== Running software containers ==== | ==== Running software containers ==== | ||
| === uDocker ==== | === uDocker ==== | ||
| - | [[ https:// | + | [[ https:// |
| - | uDocker is installed as a module, so it is necessary to load it in the environment: | + | uDocker is installed as a module, so it is necessary to load it into the environment: |
| <code bash> | <code bash> | ||
| ml udocker | ml udocker | ||
| Line 159: | Line 159: | ||
| === Apptainer/ | === Apptainer/ | ||
| - | [[ https:// | + | [[ https:// |
| - | Apptainer is installed | + | Apptainer is installed |
| ==== CONDA ==== | ==== CONDA ==== | ||
| - | [[ https:// | + | [[ https:// |
| - | Miniconda is the minimal | + | Miniconda is the minimum |
| <code bash> | <code bash> | ||
| # Get miniconda | # Get miniconda | ||
| Line 175: | Line 175: | ||
| </ | </ | ||
| ===== Using SLURM ===== | ===== Using SLURM ===== | ||
| - | The queue manager in the cluster is [[ https:// | + | The job manager in the cluster is [[ https:// |
| - | <note tip>The term CPU identifies | + | <note tip>The term CPU refers to a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).</ |
| - | == Available | + | == Available |
| <code bash> | <code bash> | ||
| - | hpc-login2 ~]# view_status.sh | + | hpc-login2 ~]# ver_estado.sh |
| ============================================================================================================= | ============================================================================================================= | ||
| - | NODE | + | NODE |
| ============================================================================================================= | ============================================================================================================= | ||
| | | ||
| Line 188: | Line 188: | ||
| | | ||
| | | ||
| + | | ||
| + | | ||
| | | ||
| | | ||
| Line 198: | Line 200: | ||
| | | ||
| ============================================================================================================= | ============================================================================================================= | ||
| - | TOTAL: [Cores : 3/688] [Mem(MB): 270000/ | + | TOTALS: [Cores : 3/688] [Mem(MB): 270000/ |
| hpc-login2 ~]$ sinfo -e -o " | hpc-login2 ~]$ sinfo -e -o " | ||
| # There is an alias for this command: | # There is an alias for this command: | ||
| - | hpc-login2 ~]$ view_resources | + | hpc-login2 ~]$ ver_recursos |
| NODELIST | NODELIST | ||
| hpc-fat1 | hpc-fat1 | ||
| hpc-gpu[1-2] | hpc-gpu[1-2] | ||
| hpc-gpu3 | hpc-gpu3 | ||
| - | hpc-gpu4 | + | hpc-gpu4 |
| + | hpc-gpu[5-6] | ||
| hpc-node[1-2] | hpc-node[1-2] | ||
| hpc-node[3-9] | hpc-node[3-9] | ||
| Line 214: | Line 217: | ||
| hpc-login2 ~]$ sinfo -N -r -O NodeList, | hpc-login2 ~]$ sinfo -N -r -O NodeList, | ||
| # There is an alias for this command: | # There is an alias for this command: | ||
| - | hpc-login2 ~]$ view_usage | + | hpc-login2 ~]$ ver_uso |
| NODELIST | NODELIST | ||
| hpc-fat1 | hpc-fat1 | ||
| + | hpc-gpu1 | ||
| + | hpc-gpu2 | ||
| hpc-gpu3 | hpc-gpu3 | ||
| hpc-gpu4 | hpc-gpu4 | ||
| + | hpc-gpu5 | ||
| + | hpc-gpu6 | ||
| hpc-node1 | hpc-node1 | ||
| hpc-node2 | hpc-node2 | ||
| Line 230: | Line 237: | ||
| </ | </ | ||
| ==== Nodes ==== | ==== Nodes ==== | ||
| - | A node is the SLURM computing unit and corresponds to a physical server. | + | A node is the computing unit of SLURM and corresponds to a physical server. |
| <code bash> | <code bash> | ||
| - | # Show information about a node: | + | # Show node information: |
| hpc-login2 ~]$ scontrol show node hpc-node1 | hpc-login2 ~]$ scontrol show node hpc-node1 | ||
| NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 | NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 | ||
| Line 253: | Line 260: | ||
| </ | </ | ||
| ==== Partitions ==== | ==== Partitions ==== | ||
| - | Partitions in SLURM are logical groups of nodes. In the cluster, there is only one partition to which all nodes belong, so there is no need to specify it when submitting jobs. | + | Partitions in SLURM are logical groups of nodes. In the cluster, there is a single |
| <code bash> | <code bash> | ||
| # Show partition information: | # Show partition information: | ||
| hpc-login2 ~]$ sinfo | hpc-login2 ~]$ sinfo | ||
| defaultPartition* | defaultPartition* | ||
| - | # When ctgpgpu7 and 8 are added to the cluster, they will appear as nodes hpc-gpu1 and 2 respectively. | + | # When ctgpgpu7 and 8 are incorporated into the cluster, they will appear as nodes hpc-gpu1 and 2 respectively. |
| </ | </ | ||
| ==== Jobs ==== | ==== Jobs ==== | ||
| - | Jobs in SLURM are resource | + | Jobs in SLURM are resources |
| - | A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that runs sequentially in a JOB and there is one TASK for each program that runs in parallel. Therefore, in the simplest case, such as launching a job that consists | + | A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that is executed |
| - | ==== Queue system | + | ==== Queue System |
| - | The queue to which each job is sent defines the priority, limits, and also the relative | + | The queue to which each job is sent defines the priority, limits, and also the "relative |
| <code bash> | <code bash> | ||
| # Show queues | # Show queues | ||
| hpc-login2 ~]$ sacctmgr show qos | hpc-login2 ~]$ sacctmgr show qos | ||
| # There is an alias that shows only the most relevant information: | # There is an alias that shows only the most relevant information: | ||
| - | hpc-login2 ~]$ view_queues | + | hpc-login2 ~]$ show_queues |
| Name | Name | ||
| ---------- ---------- ------------------------------ ----------- -------------------- --------- ----------- | ---------- ---------- ------------------------------ ----------- -------------------- --------- ----------- | ||
| Line 282: | Line 289: | ||
| | | ||
| </ | </ | ||
| - | # Priority: | + | # Priority: is the relative priority of each queue. \\ |
| - | # DenyLimit: the job does not run if it does not meet the limits of the queue \\ | + | # DenyLimit: the job does not execute |
| # UsageFactor: | # UsageFactor: | ||
| - | # MaxTRES: limits per job \\ | + | # MaxTRES: |
| - | # MaxWall: maximum time the job can run \\ | + | # MaxWall: maximum time that the job can run \\ |
| # MaxTRESPU: global limits per user \\ | # MaxTRESPU: global limits per user \\ | ||
| - | # MaxJobsPU: | + | # MaxJobsPU: |
| - | # MaxSubmitPU: | + | # MaxSubmitPU: |
| ==== Submitting a job to the queue system ==== | ==== Submitting a job to the queue system ==== | ||
| - | == Resource | + | == Resource |
| - | By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns | + | By default, if a job is submitted without specifying anything, the system sends it to the default QOS (regular) and assigns a node, one CPU, and 4 GB of RAM. The time limit for job execution is that of the queue (4 days and 4 hours). |
| This is very inefficient; | This is very inefficient; | ||
| - %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%% | - %%The number of nodes (-N or --nodes), tasks (-n or --ntasks), and/or CPUs per task (-c or --cpus-per-task).%% | ||
| - | - %%The memory (--mem) per node or memory per cpu (--mem-per-cpu).%% | + | - %%The memory (--mem) per node or memory per CPU (--mem-per-cpu).%% |
| - | - %%The estimated execution time of the job ( --time )%% | + | - %%The estimated execution time of the job (--time)%% |
| Additionally, | Additionally, | ||
| | -J | | -J | ||
| | -q | | -q | ||
| - | | -o | + | | -o |
| | | | | ||
| | -C | | -C | ||
| Line 308: | Line 315: | ||
| | -w | %%--nodelist%% | | -w | %%--nodelist%% | ||
| - | == How resources are allocated | + | == How resources are assigned |
| - | By default, the allocation method | + | By default, the allocation method |
| - | == Calculation of priority == | + | == Calculating |
| - | When a job is submitted | + | When a job is sent to the queue system, the first thing that happens is that it checks whether the requested resources |
| - | If resources are available, the job runs directly, but if not, it is queued. Each job is assigned | + | If resources are available, the job executes |
| - | The fairshare | + | The fair share is a dynamic calculation |
| <code bash> | <code bash> | ||
| hpc-login2 ~]$ sshare -l | hpc-login2 ~]$ sshare -l | ||
| Line 323: | Line 330: | ||
| user_name | user_name | ||
| </ | </ | ||
| - | # RawShares: is the amount | + | # RawShares: is the quantity |
| - | # NormShares: | + | # NormShares: |
| - | # RawUsage: | + | # RawUsage: |
| - | # NormUsage: | + | # NormUsage: |
| - | # FairShare: | + | # FairShare: |
| - | == Job submission | + | == Submitting jobs == |
| - sbatch | - sbatch | ||
| - salloc | - salloc | ||
| Line 336: | Line 343: | ||
| 1. SBATCH \\ | 1. SBATCH \\ | ||
| - | Used to submit a script to the queue system. It is non-blocking | + | Used to submit a script to the queue system. It is batch processing and non-blocking. |
| <code bash> | <code bash> | ||
| # Create the script: | # Create the script: | ||
| Line 355: | Line 362: | ||
| </ | </ | ||
| 2. SALLOC \\ | 2. SALLOC \\ | ||
| - | Used to get an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell runs by default. | + | Used to obtain |
| <code bash> | <code bash> | ||
| - | # Get 5 nodes and launch a job. | + | # Obtain |
| hpc-login2 ~]$ salloc -N5 myprogram | hpc-login2 ~]$ salloc -N5 myprogram | ||
| - | # Get interactive access to a node (Press Ctrl+D to end access): | + | # Obtain |
| hpc-login2 ~]$ salloc -N1 | hpc-login2 ~]$ salloc -N1 | ||
| - | # Get EXCLUSIVE | + | # Obtain exclusive |
| hpc-login2 ~]$ salloc -N1 --exclusive | hpc-login2 ~]$ salloc -N1 --exclusive | ||
| </ | </ | ||
| Line 375: | Line 382: | ||
| ==== Using nodes with GPU ==== | ==== Using nodes with GPU ==== | ||
| - | To specifically request | + | To specifically request |
| - | | %%--gres%% | + | | %%--gres%% |
| - | | %%--gpus or -G%% | Request for gpus by JOB | %%--gpus=[type]: | + | | %%--gpus or -G%% | Request for GPUs by JOB | %%--gpus=[type]: |
| - | There are also the options %% --gpus-per-socket, | + | There are also the options %% --gpus-per-socket, |
| Examples: | Examples: | ||
| <code bash> | <code bash> | ||
| - | ## View the list of nodes and gpus: | + | ## View the list of nodes and GPUs: |
| - | hpc-login2 ~]$ view_resources | + | hpc-login2 ~]$ show_resources |
| - | ## Request 2 any GPU for a JOB, add: | + | ## Request 2 any GPUs for a JOB, add: |
| --gpus=2 | --gpus=2 | ||
| ## Request one A100 of 40G on one node and one A100 of 80G on another, add: | ## Request one A100 of 40G on one node and one A100 of 80G on another, add: | ||
| Line 392: | Line 399: | ||
| ==== Monitoring jobs ==== | ==== Monitoring jobs ==== | ||
| <code bash> | <code bash> | ||
| - | ## List all jobs in the queue | + | ## Listing |
| hpc-login2 ~]$ squeue | hpc-login2 ~]$ squeue | ||
| - | ## List jobs of a user | + | ## Listing |
| hpc-login2 ~]$ squeue -u < | hpc-login2 ~]$ squeue -u < | ||
| ## Cancel a job: | ## Cancel a job: | ||
| Line 400: | Line 407: | ||
| ## List recent jobs | ## List recent jobs | ||
| hpc-login2 ~]$ sacct -b | hpc-login2 ~]$ sacct -b | ||
| - | ## Detailed historical information | + | ## Detailed historical information |
| hpc-login2 ~]$ sacct -l -j < | hpc-login2 ~]$ sacct -l -j < | ||
| ## Debug information of a job for troubleshooting: | ## Debug information of a job for troubleshooting: | ||
| Line 412: | Line 419: | ||
| By default, these are the exit codes of the commands: | By default, these are the exit codes of the commands: | ||
| ^ SLURM command | ^ SLURM command | ||
| - | | salloc | + | | salloc |
| - | | srun | The highest among all executed | + | | srun | The highest among all tasks executed |
| - | | sbatch | + | | sbatch |
| == STDIN, STDOUT, and STDERR == | == STDIN, STDOUT, and STDERR == | ||
| **SRUN:**\\ | **SRUN:**\\ | ||
| - | By default, stdout and stderr from all TASKS are redirected | + | By default, stdout and stderr |
| | %%-i, --input=< | | %%-i, --input=< | ||
| | %%-o, --output=< | | %%-o, --output=< | ||
| Line 424: | Line 431: | ||
| And the options are: | And the options are: | ||
| * //all//: default option. | * //all//: default option. | ||
| - | * // | + | * // |
| - | * // | + | * // |
| * // | * // | ||
| - | * //filename pattern//: | + | * //filename pattern//: |
| **SBATCH: | **SBATCH: | ||
| - | By default, "/ | + | By default, "/ |
| | %%-i, --input=< | | %%-i, --input=< | ||
| | %%-o, --output=< | | %%-o, --output=< | ||
| | %%-e, --error=< | | %%-e, --error=< | ||
| - | The filename_pattern | + | The reference |
| ==== Sending emails ==== | ==== Sending emails ==== | ||
| - | JOBS can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**): | + | Jobs can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**): |
| | %%--mail-type=< | | %%--mail-type=< | ||
| | %%--mail-user=< | | %%--mail-user=< | ||
| Line 443: | Line 450: | ||
| - | ==== Job states | + | ==== Job statuses |
| <code bash> | <code bash> | ||
| hpc-login2 ~]# squeue -l | hpc-login2 ~]# squeue -l | ||
| Line 449: | Line 456: | ||
| 6547 defaultPa | 6547 defaultPa | ||
| - | ## Check the state of queue usage in the cluster: | + | ## Check queue usage status of the cluster: |
| hpc-login2 ~]$ queue_status.sh | hpc-login2 ~]$ queue_status.sh | ||
| JOBS PER USER: | JOBS PER USER: | ||
| -------------- | -------------- | ||
| - | usuario.uno: 3 | + | user.one: 3 |
| - | usuario.dos: 1 | + | user.two: 1 |
| JOBS PER QOS: | JOBS PER QOS: | ||
| Line 468: | Line 475: | ||
| Total JOBS in cluster: | Total JOBS in cluster: | ||
| </ | </ | ||
| - | Common job states (STATE): | + | Most common |
| * R RUNNING Job currently has an allocation. | * R RUNNING Job currently has an allocation. | ||
| * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. | * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. | ||
| - | * F FAILED Job terminated with a non-zero exit code or another | + | * F FAILED Job terminated with non-zero exit code or other failure condition. |
| * PD PENDING Job is awaiting resource allocation. | * PD PENDING Job is awaiting resource allocation. | ||
| [[ https:// | [[ https:// | ||
| - | If a job is not running, a reason | + | If a job is not running, |