====== GPGPU Computing Servers ======
===== Service Description =====
These servers are intended for GPU computing (GPGPU), focused on intensive calculation tasks, machine learning, data processing, and scientific simulation that require acceleration by graphics hardware.
==== Freely Accessible Servers ====
Any researcher from the center can request access to these servers. Access is granted upon request and validation.
^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job Management ^
| ''ctgpgpu4'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL 24GB (Tesla P40, 2016) | AlmaLinux 9.1 \\ • CUDA 12.0 | **Slurm (mandatory use)** |
* Servers in the HPC computing cluster: [[ centro:servizos:hpc | HPC Computing Cluster ]]
* Servers in CESGA: [[ centro:servizos:cesga | Request Access ]]
==== Restricted Access Servers ====
Access to these servers is restricted to a specific group, specific project, or is more controlled due to resource management and planning issues.
It is essential to check the updated information on Xici at the time of requesting the service, which details the specific circumstances of each server (access criteria, priorities, usage conditions, etc.).
^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job Management ^
| ''ctgpgpu5'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL (Tesla P40) | Ubuntu 22.04 \\ • Nvidia Driver 590 \\ • CUDA Toolkit 12.5 and 13.1 (default) | n/a |
| ''ctgpgpu6'' | SIE LADON 4214 | 2 × [[https://ark.intel.com/content/www/us/en/ark/products/193385/intel-xeon-silver-4214-processor-16-5m-cache-2-20-ghz.html|Intel Xeon Silver 4214]] | 192 GB | Nvidia Quadro P6000 24GB (2018) \\ Nvidia Quadro RTX8000 48GB (2019) \\ 2 × Nvidia A30 24GB (2020) | CentOS 7.9 \\ • Nvidia Driver 535.86.10 (CUDA 12.2) | n/a |
| ''ctgpgpu9'' | Dell PowerEdge R750 | 2 × [[https://ark.intel.com/content/www/es/es/ark/products/215274/intel-xeon-gold-6326-processor-24m-cache-2-90-ghz.html|Intel Xeon Gold 6326]] | 128 GB | 2 × NVIDIA Ampere A100 80GB | AlmaLinux 8.6 \\ • Nvidia Driver 515.48.07 (CUDA 11.7) | n/a |
| ''ctgpgpu11'' | Gigabyte G482-Z54 | 2 × [[https://www.amd.com/es/products/cpu/amd-epyc-7413|AMD EPYC 7413 @2.65 GHz (24c)]] | 256 GB | 5 × NVIDIA Ampere A100 80GB | AlmaLinux 9.1 \\ • Nvidia Driver 520.61.05 (CUDA 11.8) | n/a |
| ''ctgpgpu12'' | Dell PowerEdge R760 | 2 × [[https://ark.intel.com/content/www/xl/es/ark/products/232376.html|Intel Xeon Silver 4410Y]] | 384 GB | 2 × NVIDIA Hopper H100 80GB | AlmaLinux 9.2 \\ • Nvidia Driver 555.42.06 (CUDA 12.5) | n/a |
| ''ctgpgpu15'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.6 | gpuctl* |
| ''ctgpgpu16'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | gpuctl* |
| ''ctgpgpu17'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | gpuctl* |
| ''ctgpgpu18'' ⚠️ | SIE LADON (MegaRAC SP-X) | 2x AMD EPYC 9335 (24c) | 1536 GB | 4 × NVIDIA H200 | Ubuntu 22.04 | gpuctl |
⚠️ Servers ''ctgpgpu15'', ''ctgpgpu16'', ''ctgpgpu17'' and ''ctgpgpu18'' have a temporary installation and assignments, and their configuration and access may be altered around May 2026.
(*) For now, these servers are using only ts (task-spooler) but a migration to a gpuctl-based system with optional tsp (task-spooler) usage is planned soon.
===== Service Registration =====
Not all servers are available at all times for any use. To access the servers, a request must be made in advance through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users without access permission will receive a message about incorrect password.
===== User Manual =====
==== Connecting to the Servers ====
To connect to the servers, you must do so via SSH. The names and IP addresses of the servers are as follows:
^ Node ^ FQDN ^ IP ^
| ''ctgpgpu4'' | ctgpgpu4.inv.usc.es | 172.16.242.201 |
| ''ctgpgpu5'' | ctgpgpu5.inv.usc.es | 172.16.242.202 |
| ''ctgpgpu6'' | ctgpgpu6.inv.usc.es | 172.16.242.205 |
| ''ctgpgpu9'' | ctgpgpu9.inv.usc.es | 172.16.242.94 |
| ''ctgpgpu11'' | ctgpgpu11.inv.usc.es | 172.16.242.96 |
| ''ctgpgpu12'' | ctgpgpu12.inv.usc.es | 172.16.242.97 |
| ''ctgpgpu15'' | ctgpgpu15.inv.usc.es | 172.16.242.207 |
| ''ctgpgpu16'' | ctgpgpu16.inv.usc.es | 172.16.242.212 |
| ''ctgpgpu17'' | ctgpgpu17.inv.usc.es | 172.16.242.213 |
| ''ctgpgpu18'' | ctgpgpu18.inv.usc.es | 172.16.242.208 |
The connection is only available from the center's network. To connect from other locations or from the RAI network, you need to use the [[:en:centro:servizos:vpn:start|VPN]] or the [[:en:centro:servizos:pasarela_ssh|SSH gateway]].
==== Job Management with SLURM ====
On servers with a Slurm queue manager, its use is mandatory for submitting jobs to avoid conflicts between processes, as two jobs should not run at the same time.
To submit a job to the queue, use the command ''srun'':
srun programa_cuda argumentos_programa_cuda
The ''srun'' process waits for the job to execute to return control to the user. If you do not want to wait, session managers like ''screen'' can be used to leave the job pending and disconnect the session without worry and retrieve the console output later.
Alternatively, ''nohup'' can be used and the job sent to the background with ''&''. In this case, the output is saved in the ''nohup.out'' file:
nohup srun programa_cuda argumentos_programa_cuda &
To check the status of the queue, the command ''squeue'' is used. The command shows output similar to this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9 servidore ca_water pablo.qu PD 0:00 1 (Resources)
10 servidore ca_water pablo.qu PD 0:00 1 (Priority)
11 servidore ca_water pablo.qu PD 0:00 1 (Priority)
12 servidore ca_water pablo.qu PD 0:00 1 (Priority)
13 servidore ca_water pablo.qu PD 0:00 1 (Priority)
14 servidore ca_water pablo.qu PD 0:00 1 (Priority)
8 servidore ca_water pablo.qu R 0:11 1 ctgpgpu2
An interactive view, updated every second, can also be obtained with the ''smap'' command:
smap -i 1
==== Job Management with TS ====
On servers using ''ts'' as a job manager, it is mandatory to use it to execute tasks utilizing GPU, in order to avoid conflicts and ensure proper resource allocation.
To request a GPU, the option ''-G 1'' (or the number of required GPUs) must precede:
ts -G 1 programa_cuda argumentos_programa_cuda
For example:
ts -G 1 python train.py --epochs 100
The system will handle queuing the job and executing it when a GPU is available.
For more advanced usage examples (multiple GPUs, additional resources, specific options, etc.), the command can be used:
usage-overview
==== Management of Reservations and Jobs with gpuctl ====
On servers using ''gpuctl'', it is necessary to use the ''gpu'' command to request GPUs and execute jobs on them.
The ''gpu'' command does not incorporate a job queue or parallelism control. It can be used interactively or from scripts, but to queue multiple jobs or execute them in parallel, external tools like ''task-spooler'' (command ''tsp'') must be used.
There are two main workflows:
=== 1. Automatic GPU Reservation ===
This method is most suitable if you only want to execute a single command. ''gpu exec'' acts as a //wrapper//: it takes care of claiming the GPU, preparing the necessary environment for the command, and releasing it upon completion.
gpu exec python ./train.py
This is the recommended option for simple executions as it avoids managing GPU reservation and release manually.
=== 2. Manual GPU Reservation ===
This method is most suitable if you want to maintain an active reservation between several commands, whether in an interactive session or in a workflow with multiple consecutive executions.
gpu claim
gpu exec python ./train.py
gpu release
Alternatively:
gpu claim
eval "$(gpu env --shell)"
python ./train.py
gpu release
If there is already an active reservation, ''gpu exec'' reuses it and automatically prepares the appropriate environment for the command.
It is important to remember that, after creating or modifying a reservation, you must run ''eval "$(gpu env --shell)"'' again if you are going to launch the commands directly from the shell. Alternatively, you can use ''gpu exec'', which prepares the environment for each execution.
=== Life of Reservations ===
Reservations are maintained as long as real computational activity is detected on the GPU.
If a GPU remains reserved without activity for an extended time, the reservation may be lost automatically due to inactivity. Therefore, if a job is attempted to be executed afterward assuming the reservation is still active, the job may fail.
The reservation is not maintained simply for having a terminal open, but for real GPU activity.
The expiration times without real GPU activity are configured as follows:
* Monday to Friday, from 10:00 to 18:00 → 1 hour
* At other times → 2 hours
* If the reservation appears abandoned (no user processes on the system or any shell) → 10 minutes
=== Checking the Reservation Status ===
If you are working with a manual reservation and want to check if it is still active, you can use:
gpu mine
Or, if you want to see general information about all GPUs, the time when actual use was captured for each, and the current configured expiration time, you can execute:
gpu status
This is especially advisable if time has passed since the last execution or if the GPU may have been unused for a long period.
=== Reserving an Opportunistic Second GPU ===
Besides the main reservation, which has priority in the ''guaranteed'' queue, it is possible to reserve a second GPU in the ''burst_preemptible'' queue.
This second GPU can be used as long as it is free, but it is not guaranteed: it can be lost even during a job if another user needs that GPU for their main reservation.
This allows taking advantage of additional spare capacity when available, but jobs depending on this second GPU must be prepared to tolerate its loss.
To request this second GPU, simply running ''gpu claim'' a second time is sufficient:
gpu claim
gpu claim
You can also specify the number of GPUs explicitly:
gpu claim --numgpus 2
In this case, the command will wait until it can reserve the requested number of GPUs.
It is important to note that these two methods are not exactly equivalent:
* with ''gpu claim'' followed by another ''gpu claim'', the first reservation is obtained first and the second is requested afterwards;
* with ''gpu claim --numgpus 2'', the request is made jointly and the command will wait until both requested GPUs can be reserved simultaneously.
It is essential to remember that, after making any reservation, you must run ''eval "$(gpu env --shell)"'' again or make executions with ''gpu exec''.
If the second GPU is later lost due to being ''preemptible'', jobs that are using it could fail or be interrupted.
Therefore:
* the first GPU is the priority and guaranteed reservation;
* the second GPU is opportunistic and will only be available while it is not needed for another priority reservation;
* if any GPU cannot be reserved, the command will wait and the user will move to the corresponding queue;
* when reserving two GPUs at once, the command will wait until they can both be reserved simultaneously.
=== Using task-spooler (tsp) ===
Since ''gpu'' does not incorporate a job queue or parallelism control, a practical option for queuing executions is to use ''task-spooler'', with the command ''tsp'', combined with ''gpu''.
== Running multiple jobs in series with the same reservation ==
gpu claim
tsp gpu exec python ./train1.py
tsp gpu exec python ./train2.py
tsp gpu exec python ./train3.py
tsp gpu release
This method allows reusing the same reservation for multiple consecutive jobs queued in ''tsp''.
== Running multiple jobs in parallel ==
If several jobs are to be run simultaneously, the number of ''tsp'' slots can be increased:
gpu claim
tsp -S 2
tsp gpu exec python ./train1.py
tsp gpu exec python ./train2.py
tsp gpu exec python ./train3.py
tsp gpu exec python ./train4.py
tsp -w
tsp -S 1
gpu release
In this case, care must be taken not to run ''gpu release'' before all jobs have finished. First, you must ensure that the queue has finished, and only then release the GPU.
Attention: running multiple jobs in parallel does not imply reserving multiple GPUs. If only one GPU is reserved, all processes will share that same GPU and compete for its resources, such as memory and compute time.
It is also important to remember that the reservation is maintained only while real computational activity is detected on the GPU. If jobs spend too much time without using it, the reservation may automatically expire due to inactivity.
=== Using tmux ===
Using ''tmux'' allows maintaining terminal sessions, disconnecting from them, and reconnecting later, maintaining an interactive session without having to keep the connection open. Disconnecting does not send ''HUP'' to the processes either.
To start a ''tmux'' session or reconnect to it if it already exists, it is recommended to use a fixed name:
tmux new -A -s main
This creates the ''main'' session if it does not exist, or reconnects to it if it was already created.
Once inside ''tmux'', to disconnect while maintaining the session active, you must press ''Control+b'', release, and then press ''d''.
At this point, you can disconnect from the machine without interrupting the processes. To return to the same session, simply execute the previous command again.
If you wish to terminate the session completely from outside:
tmux kill-session -t main
=== Partitioning a GPU with MIG Profiles ===
''gpuctl'' allows partitioning a GPU into several MIG instances to isolate resources and run smaller workloads in parallel on a single physical GPU in a more controlled manner.
This functionality is only available on compatible GPUs; otherwise, it cannot be used. Currently, only NVIDIA H200 GPUs have this function enabled.
To reserve a GPU and enable it in MIG mode:
gpu claim --mig half
gpu claim --mig third
gpu claim --mig quarter
This will partition the GPU into two, three, or four partitions, selecting the most suitable MIG profiles based on the compatible graphics card. Note that this ''--mig'' parameter can only be used when claiming exactly one GPU.
Once a MIG reservation is created, ''gpu exec'' and ''gpu env'' will automatically prepare ''CUDA_VISIBLE_DEVICES'' with the corresponding MIG UUIDs.
=== Email Notifications ===
''gpuctl'' allows configuring the sending of email notifications related to the status of reservations and GPU usage.
== View Current Configuration ==
gpu notify
== Activate or Deactivate Notifications ==
gpu notify --mode off
gpu notify --mode email
gpu notify --mode telegram
gpu notify --mode both
When notifications are activated, emails or Telegram messages are sent in the following cases:
* when a GPU is automatically released for remaining unused;
* when the //enforcer// kills a process for using GPU without having an active reservation or losing an opportunistic GPU;
* when a reservation request is queued;
* and subsequently when that reservation is granted.
No notifications are sent if a reservation is made and the reservation is granted immediately without needing to wait, or if the GPU is released manually with ''release'' or after finishing an ''exec'' that made an automatic reservation.
== Configure a Custom Email ==
gpu notify --email your.email@domain
If a custom email is not configured, the address obtained from LDAP will be used by default. To revert to the default email:
gpu notify --clear-email
== Pair Telegram Account ==
After executing the following command:
gpu notify --pair-telegram
Follow the instructions to complete the pairing process. To unpair:
gpu notify --unpair-telegram
== Sending Custom Notifications ==
The user can send custom notifications at various points in their workflow by using:
gpu notify --send "message content"
This can be useful for receiving personal alerts at the beginning or end of an experiment, after completing a phase of processing, or at any other relevant point in the workflow.
The preferred notification method that is configured will be used, or email if neither of the two is configured.
=== Usage Reports ===
To consult a summary of the recent usage history of the GPU by process, you can use:
gpu report
This displays the registered PIDs, the number of available samples, and aggregated metrics of GPU, CPU, and memory usage.
If you want to see the data of a specific process, with ASCII graphs and the time line of stored samples, you can use:
gpu report --pid PID
If there are many samples, the output may appear summarized. To view the full series:
gpu report --pid PID --all-samples
=== Practical Recommendations ===
* To execute a single command, the simplest option is usually ''gpu exec''.
* To maintain an active reservation among several commands, it is more suitable to use ''gpu claim'' + ''gpu exec'' + ''gpu release''.
* If you are going to work directly from the shell after making a manual reservation, you must execute ''eval "$(gpu env --shell)"'' or prepend ''gpu exec''.
* To leave active sessions and recover them later, ''tmux'' is usually the most convenient option.
* For queuing multiple jobs, ''tsp'' is a practical alternative.
* If using ''tsp'' with parallelism, it should be remembered that it does not reserve additional GPUs: processes will share the GPU or visible GPUs at that time.
* If a reservation remains inactive for too long without real GPU computation, it may be lost automatically due to inactivity.