====== GPGPU computing servers ====== ===== Service description ===== These servers are intended for GPU computing (GPGPU), aimed at compute-intensive tasks, machine learning, data processing and scientific simulation that require acceleration by graphics hardware. ==== Open-access servers ==== Any researcher at the center can request access to these servers. Access is granted after application and validation. ^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job management ^ | ''ctgpgpu4'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL 24GB (Tesla P40, 2016) | AlmaLinux 9.1 \\ • CUDA 12.0 | **Slurm (mandatory use)** | * Servers in the HPC computing cluster: [[ centro:servizos:hpc | HPC computing cluster ]] * Servers at CESGA: [[ centro:servizos:cesga | Request access ]] ==== Restricted-access servers ==== Access to these servers is restricted to a specific group, a particular project, or is more tightly controlled for resource management and planning reasons. It is essential to check the updated information in Xici at the time of requesting the service, where the particular conditions of each server are detailed (access criteria, priorities, usage conditions, etc.). ^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job management ^ | ''ctgpgpu5'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL (Tesla P40) | Ubuntu 22.04 \\ • Nvidia driver 590 \\ • CUDA Toolkit 12.5 and 13.1 (default) | n/a | | ''ctgpgpu6'' | SIE LADON 4214 | 2 × [[https://ark.intel.com/content/www/us/en/ark/products/193385/intel-xeon-silver-4214-processor-16-5m-cache-2-20-ghz.html|Intel Xeon Silver 4214]] | 192 GB | Nvidia Quadro P6000 24GB (2018) \\ Nvidia Quadro RTX8000 48GB (2019) \\ 2 × Nvidia A30 24GB (2020) | CentOS 7.9 \\ • Nvidia driver 535.86.10 (CUDA 12.2) | n/a | | ''ctgpgpu9'' | Dell PowerEdge R750 | 2 × [[https://ark.intel.com/content/www/es/es/ark/products/215274/intel-xeon-gold-6326-processor-24m-cache-2-90-ghz.html|Intel Xeon Gold 6326]] | 128 GB | 2 × NVIDIA Ampere A100 80GB | AlmaLinux 8.6 \\ • NVIDIA driver 515.48.07 (CUDA 11.7) | n/a | | ''ctgpgpu11'' | Gigabyte G482-Z54 | 2 × [[https://www.amd.com/es/products/cpu/amd-epyc-7413|AMD EPYC 7413 @2.65 GHz (24c)]] | 256 GB | 5 × NVIDIA Ampere A100 80GB | AlmaLinux 9.1 \\ • NVIDIA driver 520.61.05 (CUDA 11.8) | n/a | | ''ctgpgpu12'' | Dell PowerEdge R760 | 2 × [[https://ark.intel.com/content/www/xl/es/ark/products/232376.html|Intel Xeon Silver 4410Y]] | 384 GB | 2 × NVIDIA Hopper H100 80GB | AlmaLinux 9.2 \\ • NVIDIA driver 555.42.06 (CUDA 12.5) | n/a | | ''ctgpgpu13'' | Gigabyte G493-ZB1-AAP1 | 2x AMD EPYC 9474F (48c) | 1536 GB | Nvidia RTX Pro 6000 Blackwell Server Edition\\ Nvidia H100 NVL\\ Nvidia L40S | AlmaLinux 9.6\\ • NVIDIA driver 580.95.05 (CUDA 13.0) | gpuctl | | ''ctgpgpu14'' | Gigabyte R283-ZF0-AAL1 | 2 × [[https://www.amd.com/es/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9554.html|AMD EPYC 9554 (128c)]] | 768 GB | 2 x Nvidia Blackwell Pro 6000 96GB | AlmaLinux 10.1 \\ • NVIDIA driver 595.71.05 (CUDA 13.2) | n/a | | ''ctgpgpu15'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.6 | Slurm | | ''ctgpgpu16'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | gpuctl | | ''ctgpgpu17'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | gpuctl | | ''ctgpgpu18'' ⚠️ | SIE LADON (MegaRAC SP-X) | 2x AMD EPYC 9335 (24c) | 1536 GB | 4 × NVIDIA H200 | Ubuntu 22.04 | gpuctl | ⚠️ The servers ''ctgpgpu15'', ''ctgpgpu16'', ''ctgpgpu17'' and ''ctgpgpu18'' have a temporary installation and assignments, and their configuration and accesses could be changed in the future. ===== Service registration ===== Not all servers are available at all times for any use. To access the servers, you must request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users who do not have access permission will receive an incorrect password message. ===== User manual ===== ==== Connecting to the servers ==== To connect to the servers, you must do so via SSH. The names and IP addresses of the servers are as follows: ^ Host ^ FQDN ^ IP ^ | ''ctgpgpu4'' | ctgpgpu4.inv.usc.es | 172.16.242.201 | | ''ctgpgpu5'' | ctgpgpu5.inv.usc.es | 172.16.242.202 | | ''ctgpgpu6'' | ctgpgpu6.inv.usc.es | 172.16.242.205 | | ''ctgpgpu9'' | ctgpgpu9.inv.usc.es | 172.16.242.94 | | ''ctgpgpu11'' | ctgpgpu11.inv.usc.es | 172.16.242.96 | | ''ctgpgpu12'' | ctgpgpu12.inv.usc.es | 172.16.242.97 | | ''ctgpgpu14'' | ctgpgpu14.inv.usc.es | 172.16.242.99 | | ''ctgpgpu15'' | ctgpgpu15.inv.usc.es | 172.16.242.207 | | ''ctgpgpu16'' | ctgpgpu16.inv.usc.es | 172.16.242.212 | | ''ctgpgpu17'' | ctgpgpu17.inv.usc.es | 172.16.242.213 | | ''ctgpgpu18'' | ctgpgpu18.inv.usc.es | 172.16.242.208 | Connection is only available from the center's network. To connect from other locations or from the RAI network it is necessary to use the [[:en:centro:servizos:vpn:start|VPN]] or the [[:en:centro:servizos:pasarela_ssh|SSH gateway]]. ==== Job management with SLURM ==== On servers where a Slurm queue manager is present, its use is mandatory to submit jobs and thus avoid conflicts between processes, since two jobs must not be run at the same time. Users are limited to using two cores and 1GB of RAM outside Slurm. To submit a job to the queue use the "sbatch" command with a Slurm script or directly ''srun'': srun cuda_program program_arguments It is mandatory to request at least one GPU when submitting jobs to this server or the job will be rejected with the following message: "Job rejected: a GPU must be requested (e.g. --gres=gpu:1 or --gpus=H200:1)." The ''srun'' process waits for the job to run before returning control to the user. If you do not want to wait, session managers such as ''screen'' can be used, allowing you to leave the job running and disconnect the session without worry and recover the console output later. Alternatively, you can use ''nohup'' and put the job in the background with ''&''. In this case the output is saved in the ''nohup.out'' file: nohup srun cuda_program program_arguments & To see the queue status use the ''squeue'' command. The command shows output similar to this: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9 servidore ca_water pablo.qu PD 0:00 1 (Resources) 10 servidore ca_water pablo.qu PD 0:00 1 (Priority) 11 servidore ca_water pablo.qu PD 0:00 1 (Priority) 12 servidore ca_water pablo.qu PD 0:00 1 (Priority) 13 servidore ca_water pablo.qu PD 0:00 1 (Priority) 14 servidore ca_water pablo.qu PD 0:00 1 (Priority) 8 servidore ca_water pablo.qu R 0:11 1 ctgpgpu2 You can also obtain an interactive view, updated every second, with the ''smap'' command: smap -i 1 ==== Reservation and job management with gpuctl ==== On servers that use ''gpuctl'', it is necessary to use the ''gpu'' command to request GPUs and be able to run jobs on them. The ''gpu'' command does not include a job queue or parallelism control. It can be used interactively or from scripts, but to queue multiple jobs or run them in parallel you should use external tools, such as ''task-spooler'' (command ''tsp''). There are two main workflows: === 1. Automatic GPU reservation === This method is best if you only want to run a single command. ''gpu exec'' acts as a //wrapper//: it takes care of claiming the GPU, preparing the necessary environment for the command and releasing it when finished. gpu exec python ./train.py This is the recommended option for simple runs, as it avoids having to manually manage the reservation and release of the GPU. === 2. Manual GPU reservation === This method is best if you want to keep a reservation active across several commands, either in an interactive session or in a workflow with several consecutive runs. gpu claim gpu exec python ./train.py gpu release Or, alternatively: gpu claim eval "$(gpu env --shell)" python ./train.py gpu release If a reservation already exists, ''gpu exec'' reuses it and automatically prepares the appropriate environment for the command. It is important to remember that after creating or modifying a reservation, you must run ''eval "$(gpu env --shell)"'' again if commands will be launched directly from the shell. Alternatively, you can use ''gpu exec'', which prepares the environment on each execution. === Lifetime of reservations === Reservations are kept as long as real compute activity is detected on the GPU. If a GPU remains reserved without activity for a prolonged time, the reservation may be lost automatically due to inactivity. Therefore, if after that you try to run a job assuming the reservation is still active, the job may fail. The reservation is not maintained simply by having a terminal open, but by real activity on the GPU. The inactivity expiration times are configured as follows: * Monday to Friday, from 10:00 to 18:00 → 1 hour * Other times → 2 hours * If the reservation appears abandoned (no user process on the system nor any shell) → 10 minutes === Checking the reservation status === If you are working with a manual reservation and want to check if it is still active, you can use: gpu mine Or, if you want to see general information about all GPUs, the time when real usage was last observed for each, and the current configured expiration time, you can run: gpu status This is especially recommended if time has passed since the last run or if the GPU may have been unused for a long period. === Queue behavior of reservations === Although ''gpu'' does not implement a job queue as such, there is an implicit waiting system for GPU reservation. When a command that requires reserving a GPU is executed and none are available, the command does not fail but instead remains blocked waiting for a GPU to become free. The user enters an internal queue managed by ''gpuctl'' and the execution will continue automatically as soon as the reservation can be satisfied. This behavior applies to both manual and automatic reservations. For example: gpu exec python ./train.py If no GPUs are available, this command will wait until one is released. Similarly: gpu claim will also remain blocked until it can obtain a GPU. === Selecting a specific GPU === By default, ''gpuctl'' automatically assigns an available GPU. However, it is possible to request a specific GPU by its index. Both ''gpu claim'' and ''gpu exec'' allow the ''--gpu-index'' parameter for this purpose: gpu claim --gpu-index 0 gpu exec --gpu-index 1 python ./train.py In this case, the command will attempt to reserve the specified GPU. If that GPU is not available, the command will wait in queue until that particular GPU is released. This behavior allows fixing runs to specific GPUs, but may increase waiting time if the chosen GPU is in high demand. === Opportunistic second GPU reservation === In addition to the main reservation, which is made with priority in the ''guaranteed'' queue, it is possible to reserve a second GPU in the ''burst_preemptible'' queue on some servers (check the reservation policy message when entering the server; if applicable it will refer to a second opportunistic GPU). This second GPU can be used whenever it is free, but it is not guaranteed: it can be lost even in the middle of a job if another user needs that GPU for their primary reservation. This allows taking advantage of additional free capacity when available, but jobs that depend on this second GPU must be prepared to tolerate its loss. To request this second GPU, it is enough to run ''gpu claim'' a second time: gpu claim gpu claim You can also explicitly specify the number of GPUs: gpu claim --numgpus 2 In this case, the command will wait until it can reserve the requested number of GPUs. Note that these two forms are not exactly equivalent: * with ''gpu claim'' followed by another ''gpu claim'', the first reservation is obtained earlier and the second is requested later; * with ''gpu claim ---numgpus 2'', the request is made jointly and the command will wait until both GPUs can be reserved. It is important to remember that after making any reservation you must run ''eval "$(gpu env --shell)"'' again or perform executions with ''gpu exec''. If the second GPU is lost later for being ''preemptible'', the jobs using it could fail or be interrupted. Therefore: * the first GPU is the priority, guaranteed reservation; * the second GPU is opportunistic and will only be available while it is not needed for another priority reservation; * if any GPU cannot be reserved, the command will remain waiting and the user will enter the corresponding queue; * when reserving two GPUs at once, the command will wait until both can be reserved at the same time. === Using task-spooler (tsp) === Since ''gpu'' does not include a job queue or parallelism control, a practical option to queue executions is to use ''task-spooler'', with the ''tsp'' command, combined with ''gpu''. == Run several jobs in series with the same reservation == gpu claim tsp gpu exec python ./train1.py tsp gpu exec python ./train2.py tsp gpu exec python ./train3.py tsp gpu release This method allows reusing the same reservation for several consecutive jobs queued in ''tsp''. == Run several jobs in parallel == If you want to run several jobs at the same time, you can increase the number of ''slots'' of ''tsp'': gpu claim tsp -S 2 tsp gpu exec python ./train1.py tsp gpu exec python ./train2.py tsp gpu exec python ./train3.py tsp gpu exec python ./train4.py tsp -w tsp -S 1 gpu release In this case, be careful not to run ''gpu release'' before all jobs have finished. First make sure the queue has finished, and only then release the GPU. Attention: running several jobs in parallel does not imply reserving multiple GPUs. If only one GPU is reserved, all processes will share that same GPU and compete for its resources, such as memory and compute time. Also remember that the reservation is only maintained while real compute activity on the GPU is detected. If the jobs go too long without using it, the reservation may automatically expire due to inactivity. === Using tmux === Using ''tmux'' allows keeping terminal sessions, disconnecting from them and reattaching later, thus maintaining an interactive session without keeping the connection open. Disconnecting also does not send ''HUP'' to processes. To start a ''tmux'' session or reconnect to it if it already exists, it is recommended to use a fixed name: tmux new -A -s main This creates the ''main'' session if it does not exist, or reconnects to it if it already exists. Once inside ''tmux'', to detach while keeping the session active press ''Control+b'', release, and then press ''d''. At this point you can disconnect from the machine without interrupting processes. To return to the same session, just run the previous command again. If you want to terminate the session completely from outside: tmux kill-session -t main === GPU partitioning with MIG profiles === ''gpuctl'' allows partitioning a GPU into several MIG instances to isolate resources and run smaller workloads in parallel on a single physical GPU in a more controlled way. This functionality is only available on compatible GPUs; otherwise it cannot be used. So far, only NVIDIA H200 GPUs have this function enabled. To reserve a GPU and enable it in MIG mode: gpu claim --mig half gpu claim --mig third gpu claim --mig quarter This will split the GPU into two, three or four partitions, choosing the most appropriate MIG profiles according to the compatible GPU. Note that this parameter ''--mig'' can only be used when exactly one GPU is claimed. Once the MIG reservation is created, ''gpu exec'' and ''gpu env'' automatically prepare ''CUDA_VISIBLE_DEVICES'' with the corresponding MIG UUIDs. === Email notifications === ''gpuctl'' allows configuring the sending of email notifications related to the state of reservations and GPU usage. == View current configuration == gpu notify == Enable or disable notifications == gpu notify --mode off gpu notify --mode email gpu notify --mode telegram gpu notify --mode both When notifications are enabled, emails or Telegram messages are sent in the following cases: * when a GPU is automatically released for remaining unused; * when the enforcer kills a process for using a GPU without having an active reservation or for losing an opportunistic GPU; * when a reservation request is queued; * and later when that reservation is granted. Notifications are not sent in case a reservation is made and the reservation is granted immediately without needing to wait, or if the GPU is released manually with ''release'' or at the end of an ''exec'' that made an automatic reservation. == Configure a custom email == gpu notify --email your.email@domain If a custom email is not configured, the address obtained from LDAP will be used by default. To revert to the default email: gpu notify --clear-email == Pair Telegram account == After running the following command: gpu notify --pair-telegram Follow the instructions to complete the pairing process. To remove the pairing: gpu notify --unpair-telegram == Send custom notifications == The user can send custom notifications at various points in their workflow using: gpu notify --send "message content" This can be useful to receive personal alerts at the start or end of an experiment, after completing a processing phase, or at any other relevant point in the workflow. The preferred notification method configured will be used, or email if neither is configured. === Usage reports === To consult a summary of the recent historical GPU usage by process, you can use: gpu report This shows the recorded PIDs, number of samples available and aggregated metrics of GPU, CPU and memory usage. If you want to see data for a specific process, with ASCII graphs and the timeline of stored samples, you can use: gpu report --pid PID If there are many samples, the output may be summarized. To see the full series: gpu report --pid PID --all-samples === Practical recommendations === * To run a single command, the simplest is usually ''gpu exec''. * To keep a reservation active across several commands, it is more appropriate to use ''gpu claim'' + ''gpu exec'' + ''gpu release''. * If you will work directly from the shell after making a manual reservation, run ''eval "$(gpu env --shell)"'' or prefix commands with ''gpu exec''. * To leave sessions active and recover them later, ''tmux'' is usually the most convenient option. * To queue multiple jobs, ''tsp'' is a practical alternative. * If you use ''tsp'' with parallelism, remember that this does not reserve additional GPUs: processes will share the GPU(s) visible at that moment. * If a reservation remains too long without real compute activity on the GPU, it may be lost automatically due to inactivity.