Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| en:centro:servizos:hpc [2022/07/01 12:56] – [Status of work in the queuing system] fernando.guillen | en:centro:servizos:hpc [2025/12/05 10:04] (current) – [Using SLURM] fernando.guillen | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== High Performance Computing (HPC) cluster | + | ====== High-Performance Computing |
| - | [[ https:// | + | [[ https:// |
| ===== Description ===== | ===== Description ===== | ||
| - | The computing part of the cluster is made up of: | + | The cluster is composed in the computing part by: |
| - | * 9 servers for general | + | * 9 servers for general |
| - | * 1 "fat node" for memory-intensive | + | * 1 "fat node" for memory-intensive |
| - | * 4 servers for GPU computing. | + | * |
| - | Users only have direct access to the login node, which has more limited | + | Users only have direct access to the login node, which has more limited |
| All nodes are interconnected by a 10Gb network. \\ | All nodes are interconnected by a 10Gb network. \\ | ||
| - | There is distributed storage accessible from all nodes with 220 TB of capacity | + | There is distributed storage accessible from all nodes with a capacity of 220 TB connected |
| \\ | \\ | ||
| ^ Name ^ Model ^ Processor | ^ Name ^ Model ^ Processor | ||
| | hpc-login2 | | hpc-login2 | ||
| - | | hpc-node[1-2] | + | | hpc-node[1-2] |
| - | | hpc-node[3-9] | + | | hpc-node[3-9] |
| | hpc-fat1 | | hpc-fat1 | ||
| - | | | + | | hpc-gpu[1-2] |
| - | | hpc-gpu2 | + | | hpc-gpu3 |
| - | | hpc-gpu3 | + | | hpc-gpu4 |
| - | | hpc-gpu4 | + | | hpc-gpu5 |
| - | * Now ctgpgpu8. It will be integrated in the cluster soon. | + | | |
| - | ===== Accessing the system ===== | + | |
| - | To access the cluster, access must be requested in advance via [[https:// | + | |
| - | The access is done through | + | |
| + | ===== Connection to the system ===== | ||
| + | To access | ||
| + | |||
| + | Access | ||
| <code bash> | <code bash> | ||
| - | ssh <nombre_de_usuario> | + | ssh <username> |
| </ | </ | ||
| - | ===== Storage, directories and filesystems | + | ===== Storage, directories, and file systems |
| - | <note warning> | + | <note warning> |
| - | The HOME of the users in the cluster is on the file share system, so it is accessible from all nodes in the cluster. Path defined in the environment variable %%$HOME%%. \\ | + | Users' |
| - | Each node has a local 1TB scratch partition, which is deleted | + | Each node has a local scratch partition |
| - | For data to be shared by groups of users, | + | For data that must be shared by groups of users, |
| - | ^ Directory | + | ^ Directory |
| | Home | %%$HOME%% | | Home | %%$HOME%% | ||
| - | | | + | | |
| - | | Group folder | + | | Group Folder |
| - | %%* storage is shared %% | + | %%* storage is shared%% |
| - | === WARNING | + | === IMPORTANT NOTICE |
| - | The file share system | + | The shared |
| - | * Create the image file at your home folder: | + | * Create the image file in your home: |
| <code bash> | <code bash> | ||
| ## truncate image.name -s SIZE_IN_BYTES | ## truncate image.name -s SIZE_IN_BYTES | ||
| truncate example.ext4 -s 20G | truncate example.ext4 -s 20G | ||
| </ | </ | ||
| - | * Create a filesystem | + | * Create a file system |
| <code bash> | <code bash> | ||
| ## mkfs.ext4 -T small -m 0 image.name | ## mkfs.ext4 -T small -m 0 image.name | ||
| ## -T small optimized options for small files | ## -T small optimized options for small files | ||
| - | ## -m 0 Do not reserve capacity | + | ## -m 0 No space reserved |
| mkfs.ext4 -T small -m 0 example.ext4 | mkfs.ext4 -T small -m 0 example.ext4 | ||
| </ | </ | ||
| * Mount the image (using SUDO) with the script | * Mount the image (using SUDO) with the script | ||
| <code bash> | <code bash> | ||
| - | ## By default it is mounted | + | ## By default, it is mounted |
| sudo mount_image.py example.ext4 | sudo mount_image.py example.ext4 | ||
| </ | </ | ||
| - | * To unmount the image use the script // | + | * To unmount the image, use the script // |
| - | + | <code bash> | |
| - | The mount script has this options: | + | sudo umount_image.py |
| + | </ | ||
| + | <note warning> | ||
| + | The file can only be mounted from a single node if done in readwrite mode, but can be mounted from any number of nodes in readonly mode. | ||
| + | </ | ||
| + | The mount script has these options: | ||
| < | < | ||
| - | --mount-point path < | + | --mount-point path < |
| - | --rw <-- (optional) By default it is mounted readonly, with this option it is mounted readwrite. | + | --rw <-- (optional) By default, it is mounted readonly, with this option, it is mounted readwrite. |
| </ | </ | ||
| - | <note warning> Do not mount the image file readwrite from more than one node!!!</ | + | The unmount |
| - | + | < | |
| - | The unmounting | + | |
| - | < | + | |
| --mount-point | --mount-point | ||
| </ | </ | ||
| - | ===== | + | ===== File and data transfer |
| === SCP === | === SCP === | ||
| From your local machine to the cluster: | From your local machine to the cluster: | ||
| Line 83: | Line 87: | ||
| scp filename < | scp filename < | ||
| </ | </ | ||
| - | [[https:// | + | [[https:// |
| === SFTP === | === SFTP === | ||
| - | To transfer | + | To transfer |
| <code bash> | <code bash> | ||
| < | < | ||
| Line 95: | Line 99: | ||
| sftp> quit | sftp> quit | ||
| </ | </ | ||
| - | [[https:// | + | [[https:// |
| === RSYNC === | === RSYNC === | ||
| - | [[ https:// | + | [[ https:// |
| === SSHFS === | === SSHFS === | ||
| - | Requires | + | Requires installation of the sshfs package.\\ |
| - | Allows for example to mount the user' | + | Allows, for example, to mount the user's home on hpc-login2: |
| <code bash> | <code bash> | ||
| ## Mount | ## Mount | ||
| Line 107: | Line 111: | ||
| fusermount -u < | fusermount -u < | ||
| </ | </ | ||
| - | [[https:// | + | [[https:// |
| ===== Available Software ===== | ===== Available Software ===== | ||
| - | All nodes have the basic software that is installed by default | + | All nodes have the basic software that is installed by default |
| * GCC 8.5.0 | * GCC 8.5.0 | ||
| * Python 3.6.8 | * Python 3.6.8 | ||
| * Perl 5.26.3 | * Perl 5.26.3 | ||
| - | + | On the nodes with GPU, additionally: | |
| - | To use any other software not installed on the system or another version of the system, there are three options: | + | * nVidia Driver 560.35.03 |
| - | - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available). | + | * CUDA 11.6 |
| + | * libcudnn 8.7 | ||
| + | To use any other software not installed on the system or another version of it, there are three options: | ||
| + | - Use Modules with the modules that are already installed (or request the installation of a new module if it is not available) | ||
| - Use a container (uDocker or Apptainer/ | - Use a container (uDocker or Apptainer/ | ||
| - Use Conda | - Use Conda | ||
| - | A module is the simplest solution | + | A module is the simplest solution |
| - | A container is ideal when dependencies are complicated and/or the software is highly | + | A container is ideal when dependencies are complicated and/or the software is highly |
| - | Conda is the best solution if you need the latest version of a library or program or packages not otherwise | + | Conda is the best solution if what is needed is the latest version of a library or program or packages not available |
| - | ==== Modules/Lmod use==== | + | |
| - | [[ https:// | + | ==== Using modules/Lmod ==== |
| + | [[ https:// | ||
| <code bash> | <code bash> | ||
| - | # See available modules: | + | # View available modules: |
| module avail | module avail | ||
| - | # Module load: | + | # Load a module: |
| module < | module < | ||
| # Unload a module: | # Unload a module: | ||
| module unload < | module unload < | ||
| - | # List modules | + | # View loaded |
| module list | module list | ||
| - | # ml can be used as a shorthand of the module command: | + | # Can use ml as an abbreviation for the module command: |
| ml avail | ml avail | ||
| - | # To get info of a module: | + | # To get information about a module: |
| ml spider < | ml spider < | ||
| </ | </ | ||
| - | ==== Software | + | |
| + | |||
| + | ==== Running software | ||
| === uDocker ==== | === uDocker ==== | ||
| - | [[ https:// | + | [[ https:// |
| - | uDocker is installed as a module, so it needs to be loaded | + | uDocker is installed as a module, so it is necessary |
| <code bash> | <code bash> | ||
| - | ml uDocker | + | ml udocker |
| </ | </ | ||
| === Apptainer/ | === Apptainer/ | ||
| - | [[ https://sylabs.io/guides/3.8/ | + | [[ https://apptainer.org/docs/user/1.4/ | Apptainer |
| - | Apptainer/ | + | Apptainer is installed on the system of each node, so nothing needs to be done to use it. |
| ==== CONDA ==== | ==== CONDA ==== | ||
| [[ https:// | [[ https:// | ||
| - | Miniconda is the minimal | + | Miniconda is the minimum |
| <code bash> | <code bash> | ||
| - | # Getting | + | # Get miniconda |
| - | wget https:// | + | wget https:// |
| - | # Install | + | # Install |
| - | sh Miniconda3-py39_4.11.0-Linux-x86_64.sh | + | bash Miniconda3-latest-Linux-x86_64.sh |
| + | # Initialize miniconda for the bash shell | ||
| + | ~/ | ||
| </ | </ | ||
| - | |||
| ===== Using SLURM ===== | ===== Using SLURM ===== | ||
| - | The cluster queue manager is[[ https:// | + | The job manager |
| - | <note tip>The term CPU identifies | + | <note tip>The term CPU refers to a physical core of a socket. Hyperthreading is disabled, so each node has as many CPUs available as (number of sockets) * (number of physical cores per socket).</ |
| - | == Available | + | == Available |
| <code bash> | <code bash> | ||
| + | hpc-login2 ~]# ver_estado.sh | ||
| + | ============================================================================================================= | ||
| + | NODE | ||
| + | ============================================================================================================= | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | ============================================================================================================= | ||
| + | TOTALS: [Cores : 3/688] [Mem(MB): 270000/ | ||
| + | |||
| hpc-login2 ~]$ sinfo -e -o " | hpc-login2 ~]$ sinfo -e -o " | ||
| - | # There is an alias for that command: | + | # There is an alias for this command: |
| hpc-login2 ~]$ ver_recursos | hpc-login2 ~]$ ver_recursos | ||
| NODELIST | NODELIST | ||
| Line 175: | Line 209: | ||
| hpc-gpu[1-2] | hpc-gpu[1-2] | ||
| hpc-gpu3 | hpc-gpu3 | ||
| - | hpc-gpu4 | + | hpc-gpu4 |
| + | hpc-gpu[5-6] | ||
| hpc-node[1-2] | hpc-node[1-2] | ||
| hpc-node[3-9] | hpc-node[3-9] | ||
| - | # To see current resource | + | # To see current resource |
| hpc-login2 ~]$ sinfo -N -r -O NodeList, | hpc-login2 ~]$ sinfo -N -r -O NodeList, | ||
| - | # There is an alias for that command: | + | # There is an alias for this command: |
| hpc-login2 ~]$ ver_uso | hpc-login2 ~]$ ver_uso | ||
| NODELIST | NODELIST | ||
| hpc-fat1 | hpc-fat1 | ||
| + | hpc-gpu1 | ||
| + | hpc-gpu2 | ||
| hpc-gpu3 | hpc-gpu3 | ||
| hpc-gpu4 | hpc-gpu4 | ||
| + | hpc-gpu5 | ||
| + | hpc-gpu6 | ||
| hpc-node1 | hpc-node1 | ||
| hpc-node2 | hpc-node2 | ||
| Line 198: | Line 237: | ||
| </ | </ | ||
| ==== Nodes ==== | ==== Nodes ==== | ||
| - | A node is SLURM' | + | A node is the computing |
| <code bash> | <code bash> | ||
| - | # Show node info: | + | # Show node information: |
| hpc-login2 ~]$ scontrol show node hpc-node1 | hpc-login2 ~]$ scontrol show node hpc-node1 | ||
| NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 | NodeName=hpc-node1 Arch=x86_64 CoresPerSocket=18 | ||
| Line 221: | Line 260: | ||
| </ | </ | ||
| ==== Partitions ==== | ==== Partitions ==== | ||
| - | Partitions in SLURM are logical groups of nodes. In the cluster there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs. | + | Partitions in SLURM are logical groups of nodes. In the cluster, there is a single partition to which all nodes belong, so it is not necessary to specify it when submitting jobs. |
| <code bash> | <code bash> | ||
| - | # Show partition | + | # Show partition |
| hpc-login2 ~]$ sinfo | hpc-login2 ~]$ sinfo | ||
| - | defaultPartition* | + | defaultPartition* |
| + | # When ctgpgpu7 and 8 are incorporated into the cluster, they will appear as nodes hpc-gpu1 and 2 respectively. | ||
| </ | </ | ||
| ==== Jobs ==== | ==== Jobs ==== | ||
| - | Jobs in SLURM are resource | + | Jobs in SLURM are resources |
| - | A JOB consists of one or more STEPS, each consisting of one or more TASKS that use one or more CPUs. There is one STEP for each program that executes | + | A job (JOB) consists of one or more steps (STEPS), each consisting of one or more tasks (TASKS) that use one or more CPUs. There is one STEP for each program that is executed |
| - | ==== Queue system | + | ==== Queue System |
| - | The queue to which each job is submitted | + | The queue to which each job is sent defines the priority, limits, and also the "relative cost" |
| <code bash> | <code bash> | ||
| # Show queues | # Show queues | ||
| hpc-login2 ~]$ sacctmgr show qos | hpc-login2 ~]$ sacctmgr show qos | ||
| - | # There is an alias that shows only the relevant | + | # There is an alias that shows only the most relevant |
| - | hpc-login2 ~]$ ver_colas | + | hpc-login2 ~]$ show_queues |
| - | Name | + | Name |
| - | ---------- ---------- --------------- ----------- --------------------------- ----------- ------------- --------- ----------- | + | ---------- ---------- ------------------------------ ----------- -------------------- --------- ----------- |
| - | | + | |
| - | interactive | + | interacti+ |
| - | urgent | + | urgent |
| - | long 100 DenyOnLimit | + | long 100 gres/ |
| - | | + | |
| - | | + | |
| + | | ||
| + | | ||
| </ | </ | ||
| # Priority: is the relative priority of each queue. \\ | # Priority: is the relative priority of each queue. \\ | ||
| - | # DenyonLimit: job will not be executed | + | # DenyLimit: the job does not execute |
| - | # UsageFactor: | + | # UsageFactor: |
| - | # MaxTRES: | + | # MaxTRES: |
| - | # MaxWall: maximum time the job can run \\ | + | # MaxWall: maximum time that the job can run \\ |
| # MaxTRESPU: global limits per user \\ | # MaxTRESPU: global limits per user \\ | ||
| - | # MaxJobsPU: Maximum number of jobs a user can have running | + | # MaxJobsPU: Maximum number of jobs that a user can have running. \\ |
| - | # MaxSubmitPU: | + | # MaxSubmitPU: |
| - | ==== Sending | + | ==== Submitting |
| - | == Requesting resources | + | == Resource Specification |
| - | By default, if you submit | + | By default, if a job is submitted |
| - | This is very inefficient, | + | This is very inefficient; ideally, at least three parameters |
| - | - %%Node number (-N or --nodes), tasks (-n or --ntasks) and/or CPUs per task (-c or --cpus-per-task).%% | + | - %%The number |
| - | - %%Memory | + | - %%The memory |
| - | - %%Job execution time ( --time )%% | + | - %%The estimated |
| - | In addition, it may be interesting to add the following parameters: | + | Additionally, it may be interesting to add the following parameters: |
| - | | -J | + | | -J |
| | -q | | -q | ||
| | -o | | -o | ||
| - | | | + | | |
| - | | -C | + | | -C |
| - | | | %%--exclusive%% | + | | | %%--exclusive%% |
| - | | -w | %%--nodelist%% | + | | -w | %%--nodelist%% |
| - | == How resources are allocated | + | == How resources are assigned |
| - | The default allocation method between nodes is block allocation (all available cores on a node are allocated before using another | + | By default, the allocation method between nodes is block allocation (all available cores in a node are allocated before using another). The default allocation method within each node is cyclic allocation (the required cores are evenly |
| - | == Priority calculation | + | == Calculating priority |
| - | When a job is submitted | + | When a job is sent to the queue system, the first thing that happens is that it checks whether |
| - | If resources are available, the job is executed | + | If resources are available, the job executes |
| - | The fairshare | + | The fair share is a dynamic calculation |
| <code bash> | <code bash> | ||
| hpc-login2 ~]$ sshare -l | hpc-login2 ~]$ sshare -l | ||
| Line 288: | Line 330: | ||
| user_name | user_name | ||
| </ | </ | ||
| - | # RawShares: | + | # RawShares: |
| - | # NormShares: | + | # NormShares: |
| - | # RawUsage: | + | # RawUsage: |
| - | # NormUsage: | + | # NormUsage: |
| - | # FairShare: The FairShare factor between 0 and 1. The higher | + | # FairShare: The FairShare factor between 0 and 1. The more the cluster |
| - | == Job submission | + | == Submitting jobs == |
| + | - sbatch | ||
| - salloc | - salloc | ||
| - srun | - srun | ||
| - | - sbatch | ||
| - | 1. SALLOC | + | |
| - | It is used to immediately obtain an allocation of resources (nodes). As soon as it is obtained, the specified command or a shell is executed. | + | 1. SBATCH |
| - | <code bash> | + | Used to submit |
| - | # Get 5 nodes and launch a job. | + | |
| - | hpc-login2 ~]$ salloc -N5 myprogram | + | |
| - | # Get interactive access | + | |
| - | hpc-login2 ~]$ salloc -N1 | + | |
| - | </ | + | |
| - | 2. SRUN \\ | + | |
| - | It is used to launch a parallel job (preferable to using mpirun). It is interactive | + | |
| <code bash> | <code bash> | ||
| - | # Launch | + | # Create |
| - | hpc-login2 ~]$ srun -N2 hostname | + | hpc-login2 ~]$ vim example_job.sh |
| - | hpc-node1 | + | |
| - | hpc-node2 | + | |
| - | </ | + | |
| - | 3. SBATCH \\ | + | |
| - | Used to send a script to the queuing system. It is batch-processing and non-blocking. | + | |
| - | <code bash> | + | |
| - | # Crear el script: | + | |
| - | hpc-login2 ~]$ vim test_job.sh | + | |
| #!/bin/bash | #!/bin/bash | ||
| - | #SBATCH --job-name=test | + | #SBATCH --job-name=test |
| #SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
| #SBATCH --ntasks=1 | #SBATCH --ntasks=1 | ||
| Line 328: | Line 355: | ||
| #SBATCH --time=00: | #SBATCH --time=00: | ||
| #SBATCH --qos=urgent | #SBATCH --qos=urgent | ||
| - | #SBATCH --output=test%j.log | + | #SBATCH --output=test_%j.log |
| echo "Hello World!" | echo "Hello World!" | ||
| - | hpc-login2 ~]$ sbatch | + | hpc-login2 ~]$ sbatch |
| + | </ | ||
| + | 2. SALLOC \\ | ||
| + | Used to obtain an immediate allocation of resources (nodes). As soon as it is obtained, the specified command or a shell will run instead. | ||
| + | <code bash> | ||
| + | # Obtain 5 nodes and launch a job. | ||
| + | hpc-login2 ~]$ salloc -N5 myprogram | ||
| + | # Obtain interactive access to a node (Press Ctrl+D to end access): | ||
| + | hpc-login2 ~]$ salloc -N1 | ||
| + | # Obtain exclusive interactive access to a node | ||
| + | hpc-login2 ~]$ salloc -N1 --exclusive | ||
| + | </ | ||
| + | 3. SRUN \\ | ||
| + | Used to launch a parallel job (it is preferable to use mpirun). It is interactive and blocking. | ||
| + | <code bash> | ||
| + | # Launch a hostname on 2 nodes | ||
| + | hpc-login2 ~]$ srun -N2 hostname | ||
| + | hpc-node1 | ||
| + | hpc-node2 | ||
| </ | </ | ||
| - | ==== GPU use ==== | + | |
| - | To specifically request | + | ==== Using nodes with GPU ==== |
| - | | %%--gres%% | + | To specifically request |
| - | | %%--gpus | + | | %%--gres%% |
| - | There are also the options %% --gpus-per-socket, | + | | %%--gpus |
| - | Ejemplos: | + | There are also the options %% --gpus-per-socket, |
| + | Examples: | ||
| <code bash> | <code bash> | ||
| - | ## See the list of nodes and gpus: | + | ## View the list of nodes and GPUs: |
| - | hpc-login2 ~]$ ver_recursos | + | hpc-login2 ~]$ show_resources |
| - | ## Request | + | ## Request 2 any GPUs for a JOB, add: |
| --gpus=2 | --gpus=2 | ||
| - | ## Request | + | ## Request |
| --gres=gpu: | --gres=gpu: | ||
| </ | </ | ||
| - | ==== Job monitoring | + | ==== Monitoring jobs ==== |
| <code bash> | <code bash> | ||
| - | ## List all jobs in the queue | + | ## Listing |
| hpc-login2 ~]$ squeue | hpc-login2 ~]$ squeue | ||
| - | ## Listing a user's jobs | + | ## Listing |
| hpc-login2 ~]$ squeue -u < | hpc-login2 ~]$ squeue -u < | ||
| ## Cancel a job: | ## Cancel a job: | ||
| hpc-login2 ~]$ scancel < | hpc-login2 ~]$ scancel < | ||
| - | ## List of recent jobs: | + | ## List recent jobs |
| hpc-login2 ~]$ sacct -b | hpc-login2 ~]$ sacct -b | ||
| - | ## Detailed historical information | + | ## Detailed historical information |
| hpc-login2 ~]$ sacct -l -j < | hpc-login2 ~]$ sacct -l -j < | ||
| ## Debug information of a job for troubleshooting: | ## Debug information of a job for troubleshooting: | ||
| hpc-login2 ~]$ scontrol show jobid -dd < | hpc-login2 ~]$ scontrol show jobid -dd < | ||
| - | ## View the resource usage of a running job: | + | ## View resource usage of a running job: |
| hpc-login2 ~]$ sstat < | hpc-login2 ~]$ sstat < | ||
| + | |||
| </ | </ | ||
| - | ==== Configure | + | ==== Controlling |
| == Exit codes == | == Exit codes == | ||
| - | By default these are the output | + | By default, these are the exit codes of the commands: |
| ^ SLURM command | ^ SLURM command | ||
| - | | salloc | + | | salloc |
| - | | srun | The highest among all executed | + | | srun | The highest among all tasks executed or 253 for an out-of-memory |
| - | | sbatch | + | | sbatch |
| - | == STDIN, STDOUT | + | == STDIN, STDOUT, and STDERR == |
| **SRUN:**\\ | **SRUN:**\\ | ||
| - | By default stdout and stderr are redirected from all TASKS to srun' | + | By default, stdout and stderr are redirected from all TASKS to the stdout and stderr |
| | %%-i, --input=< | | %%-i, --input=< | ||
| | %%-o, --output=< | | %%-o, --output=< | ||
| | %%-e, --error=< | | %%-e, --error=< | ||
| - | And options are: | + | And the options are: |
| - | * // | + | * //all//: default |
| - | * // | + | * // |
| - | * // | + | * // |
| - | * // | + | * // |
| - | * //filename pattern//: | + | * //filename pattern//: |
| **SBATCH: | **SBATCH: | ||
| - | By default "/ | + | By default, "/ |
| | %%-i, --input=< | | %%-i, --input=< | ||
| | %%-o, --output=< | | %%-o, --output=< | ||
| | %%-e, --error=< | | %%-e, --error=< | ||
| - | The reference | + | The reference |
| - | ==== Sending | + | ==== Sending |
| - | JOBS can be configured to send mail in certain circumstances using these two parameters (**BOTH ARE REQUIRED**): | + | Jobs can be configured to send emails under certain circumstances using these two parameters (**BOTH ARE REQUIRED**): |
| | %%--mail-type=< | | %%--mail-type=< | ||
| - | | %%--mail-user=< | + | | %%--mail-user=< |
| - | ==== Status of Jobs in the queuing | + | ==== Job statuses |
| <code bash> | <code bash> | ||
| hpc-login2 ~]# squeue -l | hpc-login2 ~]# squeue -l | ||
| JOBID PARTITION | JOBID PARTITION | ||
| 6547 defaultPa | 6547 defaultPa | ||
| + | |||
| + | ## Check queue usage status of the cluster: | ||
| + | hpc-login2 ~]$ queue_status.sh | ||
| + | JOBS PER USER: | ||
| + | -------------- | ||
| + | | ||
| + | | ||
| + | |||
| + | JOBS PER QOS: | ||
| + | -------------- | ||
| + | | ||
| + | long: 1 | ||
| + | |||
| + | JOBS PER STATE: | ||
| + | -------------- | ||
| + | | ||
| + | | ||
| + | ========================================== | ||
| + | Total JOBS in cluster: | ||
| </ | </ | ||
| - | Common job states: | + | Most common |
| * R RUNNING Job currently has an allocation. | * R RUNNING Job currently has an allocation. | ||
| * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. | * CD COMPLETED Job has terminated all processes on all nodes with an exit code of zero. | ||
| Line 415: | Line 481: | ||
| * PD PENDING Job is awaiting resource allocation. | * PD PENDING Job is awaiting resource allocation. | ||
| - | [[ https:// | + | [[ https:// |
| - | If a job is not running, | + | If a job is not running, |