====== GPGPU Computing Servers ======

===== Service Description =====

These servers are intended for GPU computing (GPGPU), oriented toward computationally intensive tasks, machine learning, data processing, and scientific simulation that require hardware graphics acceleration.

==== Public Access Servers ====

Any researcher from the center can request access to these servers. Access is granted upon request and validation.

^ Node         ^ Server          ^ CPU                                                                                                                    ^ RAM     ^ GPUs                                       ^ Operating System             ^ Job Management          ^
| ''ctgpgpu4'' | PowerEdge R730  | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]]  | 128 GB  | 2 × Nvidia GP102GL 24GB (Tesla P40, 2016)  | AlmaLinux 9.1 \\ • CUDA 12.0  | **Slurm (mandatory use)**  |

  * Servers in HPC computing cluster: [[ center:servizos:hpc | HPC computing cluster ]]
  * Servers in CESGA: [[ center:servizos:cesga | Request access ]]

==== Restricted Access Servers ====

In these servers, access is restricted to a specific group, specific project, or is more controlled due to resource management and planning issues.

It is essential to check the updated information in Xici at the time of requesting the service, where the particular circumstances of each server are detailed (access criteria, priorities, conditions of use, etc.).

^ Node              ^ Server                  ^ CPU                                                                                                                                                    ^ RAM      ^ GPUs                                                                                                ^ Operating System                                                                ^ Job Management  ^
| ''ctgpgpu5''      | PowerEdge R730            | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]]                                  | 128 GB   | 2 × Nvidia GP102GL (Tesla P40)                                                                      | Ubuntu 22.04 \\ • Nvidia Driver 590 \\ • CUDA Toolkit 12.5 and 13.1 (default)  | n/a                   |
| ''ctgpgpu6''      | SIE LADON 4214            | 2 × [[https://ark.intel.com/content/www/us/en/ark/products/193385/intel-xeon-silver-4214-processor-16-5m-cache-2-20-ghz.html|Intel Xeon Silver 4214]]  | 192 GB   | Nvidia Quadro P6000 24GB (2018) \\ Nvidia Quadro RTX8000 48GB (2019) \\ 2 × Nvidia A30 24GB (2020)  | CentOS 7.9 \\ • Nvidia Driver 535.86.10 (CUDA 12.2)                              | n/a                   |
| ''ctgpgpu9''      | Dell PowerEdge R750       | 2 × [[https://ark.intel.com/content/www/es/es/ark/products/215274/intel-xeon-gold-6326-processor-24m-cache-2-90-ghz.html|Intel Xeon Gold 6326]]        | 128 GB   | 2 × NVIDIA Ampere A100 80GB                                                                         | AlmaLinux 8.6 \\ • Nvidia Driver 515.48.07 (CUDA 11.7)                           | n/a                   |
| ''ctgpgpu11''     | Gigabyte G482-Z54         | 2 × [[https://www.amd.com/es/products/cpu/amd-epyc-7413|AMD EPYC 7413 @2.65 GHz (24c)]]                                                                | 256 GB   | 5 × NVIDIA Ampere A100 80GB                                                                         | AlmaLinux 9.1 \\ • Nvidia Driver 520.61.05 (CUDA 11.8)                           | n/a                   |
| ''ctgpgpu12''     | Dell PowerEdge R760       | 2 × [[https://ark.intel.com/content/www/xl/es/ark/products/232376.html|Intel Xeon Silver 4410Y]]                                                       | 384 GB   | 2 × NVIDIA Hopper H100 80GB                                                                         | AlmaLinux 9.2 \\ • Nvidia Driver 555.42.06 (CUDA 12.5)                           | n/a                   |
| ''ctgpgpu13''     | Gigabyte G493-ZB1-AAP1    | 2x AMD EPYC 9474F (48c)                                                                                                                                | 1536 GB  | Nvidia RTX Pro 6000 Blackwell Server Edition\\ Nvidia H100 NVL\\ Nvidia L40S                        | AlmaLinux 9.6\\ • Nvidia Driver 580.95.05 (CUDA 13.0)                            | gpuctl                    |
| ''ctgpgpu14''      | Gigabyte R283-ZF0-AAL1       | 2 × [[https://www.amd.com/es/products/processors/server/epyc/4th-generation-9004-and-8004-series/amd-epyc-9554.html|AMD EPYC 9554 (128c)]]        | 768 GB   | 2 x Nvidia Blackwell Pro 6000 96GB                                                                         | AlmaLinux 10.1 \\ • Driver NVIDIA 595.71.05 (CUDA 13.2)                           | n/a                   |
| ''ctgpgpu15'' ⚠️  | SIE LADON (Gigabyte)      | 2x AMD EPYC 9474F (48c)                                                                                                                                | 768 GB   | 4 × NVIDIA H200 NVL                                                                                 | AlmaLinux 9.6                                                                    | gpuctl*               |
| ''ctgpgpu16'' ⚠️  | SIE LADON (Gigabyte)      | 2x AMD EPYC 9474F (48c)                                                                                                                                | 768 GB   | 4 × NVIDIA H200 NVL                                                                                 | AlmaLinux 9.7                                                                    | gpuctl                |
| ''ctgpgpu17'' ⚠️  | SIE LADON (Gigabyte)      | 2x AMD EPYC 9474F (48c)                                                                                                                                | 768 GB   | 4 × NVIDIA H200 NVL                                                                                 | AlmaLinux 9.7                                                                    | gpuctl                |
| ''ctgpgpu18'' ⚠️  | SIE LADON (MegaRAC SP-X)  | 2x AMD EPYC 9335 (24c)                                                                                                                                 | 1536 GB  | 4 × NVIDIA H200                                                                                     | Ubuntu 22.04                                                                     | gpuctl                |

⚠️ The servers ''ctgpgpu15'', ''ctgpgpu16'', ''ctgpgpu17'' and ''ctgpgpu18'' have a temporary installation and assignments, and their configuration and access may be altered around May 2026.

(*) For now, these servers only use ts (task-spooler) but a migration to a gpuctl-based system with optional tsp (task-spooler) is planned soon.

===== Service Registration =====
Not all servers are available at all times for any use. To access the servers, prior request is required through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users without access permission will receive a wrong password message.

===== User Manual =====
==== Connecting to the Servers ====
To connect to the servers, you must do so via SSH. The names and IP addresses of the servers are as follows:

^ Node ^ FQDN ^ IP ^
| ''ctgpgpu4''  | ctgpgpu4.inv.usc.es  | 172.16.242.201 |
| ''ctgpgpu5''  | ctgpgpu5.inv.usc.es  | 172.16.242.202 |
| ''ctgpgpu6''  | ctgpgpu6.inv.usc.es  | 172.16.242.205 |
| ''ctgpgpu9''  | ctgpgpu9.inv.usc.es  | 172.16.242.94  |
| ''ctgpgpu11'' | ctgpgpu11.inv.usc.es | 172.16.242.96  |
| ''ctgpgpu12'' | ctgpgpu12.inv.usc.es | 172.16.242.97  |
| ''ctgpgpu15'' | ctgpgpu15.inv.usc.es | 172.16.242.207 |
| ''ctgpgpu16'' | ctgpgpu16.inv.usc.es | 172.16.242.212 |
| ''ctgpgpu17'' | ctgpgpu17.inv.usc.es | 172.16.242.213 |
| ''ctgpgpu18'' | ctgpgpu18.inv.usc.es | 172.16.242.208 |

Connection is only available from the center's network. To connect from other locations or from the RAI network, it is necessary to use the [[:en:centro:servizos:vpn:start|VPN]] or the [[:en:centro:servizos:pasarela_ssh|SSH gateway]].

==== Job Management with SLURM ====

On servers where there is a Slurm queue manager, its use is mandatory for submitting jobs and to avoid conflicts between processes, as no two jobs should run at the same time.

To submit a job to the queue, use the ''srun'' command:

  srun programa_cuda argumentos_programa_cuda

The ''srun'' process waits for the job to finish to return control to the user. If you don't want to wait, you can use console session managers like ''screen'', allowing you to leave the job in queue and disconnect without worry and retrieve the console output later.

Alternatively, you can use ''nohup'' and send the job to the background with ''&''. In this case, the output is saved in the ''nohup.out'' file:

  nohup srun programa_cuda argumentos_programa_cuda &

To see the status of the queue, use the ''squeue'' command. The command displays output similar to this:

<code>JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
9  server ca_water pablo.qu    PD       0:00      1 (Resources)
10 server ca_water pablo.qu    PD       0:00      1 (Priority)
11 server ca_water pablo.qu    PD       0:00      1 (Priority)
12 server ca_water pablo.qu    PD       0:00      1 (Priority)
13 server ca_water pablo.qu    PD       0:00      1 (Priority)
14 server ca_water pablo.qu    PD       0:00      1 (Priority)
 8 server ca_water pablo.qu     R       0:11      1 ctgpgpu2</code>

An interactive view, updated every second, can also be obtained with the ''smap'' command:

  smap -i 1
  
==== Managing Reservations and Jobs with gpuctl ====

On servers that use ''gpuctl'', it is necessary to use the ''gpu'' command to request GPU and be able to run jobs on them.

The ''gpu'' command does not involve a job queue or parallelism control. It can be used interactively or from scripts, but to enqueue multiple jobs or run them in parallel, external tools like ''task-spooler'' (command ''tsp'') need to be used.

There are two main workflows:

=== 1. Automatic GPU Reservation ===

This method is best if you only want to execute a single command. ''gpu exec'' acts as a //wrapper//: it takes care of claiming the GPU, preparing the necessary environment for the command, and releasing it upon completion.

<code bash>
gpu exec python ./train.py
</code>

This is the recommended option for simple executions, as it prevents the need to manually manage the reservation and release of the GPU.

=== 2. Manual GPU Reservation ===

This method is best if you want to maintain an active reservation between multiple commands, whether in an interactive session or in a workflow with consecutive executions.

<code bash>
gpu claim
gpu exec python ./train.py
gpu release
</code>

Or, alternatively:

<code bash>
gpu claim
eval "$(gpu env --shell)"
python ./train.py
gpu release
</code>

If there is already an active reservation, ''gpu exec'' will reuse it and automatically prepare the proper environment for the command.

It is important to remember that after creating or modifying a reservation, you must execute ''<nowiki>eval "$(gpu env --shell)"</nowiki>'' if you are going to launch commands directly from the shell. Alternatively, ''gpu exec'' can be used which prepares the environment in each execution.

=== Lifetime of Reservations ===

Reservations are maintained as long as real computational activity on the GPU is detected.

If a GPU remains reserved with no activity for a prolonged period, the reservation may automatically be lost due to inactivity. Therefore, if afterwards a job is attempted to be executed assuming that the reservation is still active, the job may fail.

The reservation is not maintained simply by having an open terminal, but by actual activity on the GPU.

The expiration times without real activity on the GPU are configured as follows:

  * Monday to Friday, from 10:00 to 18:00 → 1 hour
  * At other times → 2 hours
  * If the reservation seems abandoned (there are no user processes in the system nor any shell) → 10 minutes

=== Checking Reservation Status ===

If you are working with a manual reservation and want to check if it is still active, you can use:

<code bash>
gpu mine
</code>

Or, if you want to see general information about all GPUs, the time at which real use was captured for each, and the currently configured expiration time, you can execute:

<code bash>
gpu status
</code>

This is especially recommended if time has passed since the last execution or if the GPU may have gone unused for a long period.

=== Queue Behavior of Reservations ===

Although ''gpu'' does not implement a job queue per se, there is an implicit waiting system for GPU reservations.

When a command that requires reserving a GPU is executed and none are available, the command does not fail; it remains blocked waiting for a free GPU. The user enters an internal queue managed by ''gpuctl'', and execution will automatically continue as soon as the reservation can be satisfied.

This behavior applies to both manual and automatic reservations.

For example:

<code bash>gpu exec python ./train.py </code>

If no GPUs are available, this command will wait until one is freed.

Similarly:

<code bash>gpu claim </code>

will also remain blocked until a GPU can be obtained.

=== Selecting a Specific GPU ===

By default, ''gpuctl'' automatically assigns an available GPU. However, it is possible to request a specific GPU by its index.

Both ''gpu claim'' and ''gpu exec'' allow you to use the parameter ''--gpu-index'' for this purpose:

<code bash> 
gpu claim --gpu-index 0 
gpu exec --gpu-index 1 python ./train.py 
</code>

In this case, the command will attempt to reserve specifically the indicated GPU. If that GPU is not available, the command will wait in the queue until that particular GPU is freed.

This behavior allows you to fix executions to specific GPUs, but it may increase the wait time if the chosen GPU is in high demand.

=== Reserving a Second Opportunistic GPU ===

In addition to the primary reservation made with priority in the ''guaranteed'' queue, it is possible to reserve a second GPU in the ''burst_preemptible'' queue on some servers (check the reservation policy message when entering the server, if applicable, it will refer to a second opportunistic GPU).

This second GPU can be used as long as it is free, but it is not guaranteed: it may be lost even in the middle of a job if another user needs that GPU for their primary reservation.

This allows taking advantage of additional free capacity when available, but jobs that depend on this second GPU must be prepared to tolerate its loss.

To request this second GPU, just execute ''gpu claim'' a second time:

<code bash>
gpu claim
gpu claim
</code>

It is also possible to explicitly specify the number of GPUs:

<code bash>
gpu claim --numgpus 2
</code>

In this case, the command will wait until it can reserve the requested number of GPUs.

It is important to note that these two methods are not exactly equivalent:

  * with ''gpu claim'' followed by another ''gpu claim'', the first reservation is obtained first and the second is requested afterward;
  * with ''<nowiki>gpu claim --numgpus 2</nowiki>'', the request is made collectively and the command will wait until the two requested GPUs can be reserved.

It is important to remember that after making any reservation, you need to execute ''<nowiki>eval "$(gpu env --shell)"</nowiki>'' or perform the executions with ''gpu exec''.

If the second GPU is later lost due to being ''preemptible'', the jobs using it may fail or be interrupted.

Therefore:

  * the first GPU is the priority and guaranteed reservation;
  * the second GPU is opportunistic and will only be available while not needed for another priority reservation;
  * if a GPU cannot be reserved, the command will remain waiting, and the user will enter the corresponding queue;
  * when reserving two GPUs at once, the command will wait until both can be reserved simultaneously.

=== Using task-spooler (tsp) ===

As ''gpu'' does not include a job queue or parallelism control, a practical option for queuing executions is to use ''task-spooler'', with the ''tsp'' command, combined with ''gpu''.

== Running Multiple Jobs in Series with the Same Reservation ==

<code bash>
gpu claim
tsp gpu exec python ./train1.py
tsp gpu exec python ./train2.py
tsp gpu exec python ./train3.py
tsp gpu release
</code>

This method allows reusing the same reservation for several consecutive jobs queued in ''tsp''.

== Running Multiple Jobs in Parallel ==

If you want to run several jobs at the same time, you can increase the number of ''tsp'' slots:

<code bash>
gpu claim
tsp -S 2
tsp gpu exec python ./train1.py
tsp gpu exec python ./train2.py
tsp gpu exec python ./train3.py
tsp gpu exec python ./train4.py
tsp -w
tsp -S 1
gpu release
</code>

In this case, you must be careful not to execute ''gpu release'' before all jobs finish. You should first ensure that the queue has completed, and only then release the GPU.

Attention: running multiple jobs in parallel does not imply reserving multiple GPUs. If only one GPU is reserved, all processes will share that same GPU and compete for its resources, such as memory and computation time.

It is also important to remember that the reservation is only maintained while real computational activity on the GPU is detected. If jobs pass too much time without using it, the reservation may automatically expire due to inactivity.

=== Using tmux ===

Using ''tmux'' allows maintaining terminal sessions, disconnecting from them, and retrieving them later, thus allowing for an interactive session without needing to keep the connection open. Disconnecting also does not send ''HUP'' to the processes.

To start a ''tmux'' session or reconnect to it if it already exists, it is recommended to use a fixed name:

<code bash>
tmux new -A -s main
</code>

This creates the ''main'' session if it does not exist, or reconnects to it if it was already created.

Once inside ''tmux'', to disconnect while keeping the session active, you should press ''Control+b'', release, and then press ''d''.

At this point, you can disconnect from the machine without interrupting the processes. To return to the same session, just execute the previous command again.

If you want to completely terminate the session from outside:

<code bash>
tmux kill-session -t main
</code>

=== Partitioning a GPU with MIG Profiles ===

''gpuctl'' allows partitioning a GPU into multiple MIG instances to isolate resources and run smaller loads in parallel on a single physical GPU in a more controlled manner.

This feature is only available on compatible GPUs; otherwise, it cannot be used. For now, only NVIDIA H200 GPUs have this function enabled.

To reserve a GPU and activate it in MIG mode:

<code bash>
gpu claim --mig half
gpu claim --mig third
gpu claim --mig quarter
</code>

This will partition the GPU into two, three, or four partitions, choosing the most suitable MIG profiles according to the compatible graphics card. Note that this ''<nowiki>--mig</nowiki>'' parameter can only be used when claiming exactly one GPU.

Once the MIG reservation is created, ''gpu exec'' and ''gpu env'' automatically prepare ''CUDA_VISIBLE_DEVICES'' with the corresponding MIG UUIDs.

=== Email Notifications ===

''gpuctl'' allows configuring email notifications related to the status of reservations and GPU usage.

== Viewing Current Configuration ==

<code bash>
gpu notify
</code>

== Activating or Deactivating Notifications ==

<code bash>
gpu notify --mode off
gpu notify --mode email
gpu notify --mode telegram
gpu notify --mode both
</code>

When notifications are activated, emails or Telegram messages are sent in the following cases:

  * when a GPU is automatically released due to being unused;
  * when the //enforcer// kills a process for using GPU without having an active reservation or losing an opportunistic GPU;
  * when a reservation request is queued;
  * and later when that reservation is granted.

No notifications are sent if a reservation is made and granted immediately without needing to wait, or if the GPU is released manually with ''release'' or upon finishing an ''exec'' that made an automatic reservation.

== Configuring a Custom Email ==

<code bash>
gpu notify --email your.email@domain
</code>

If a custom email is not configured, the address obtained from LDAP will be used by default. To revert to the default email:

<code bash>
gpu notify --clear-email
</code>

== Pairing a Telegram Account ==

After executing the following command:

<code bash>
gpu notify --pair-telegram
</code>

Follow the instructions to complete the pairing process. To unpair:

<code bash>
gpu notify --unpair-telegram
</code>

== Sending Custom Notifications ==

The user can send custom notifications at various points in their workflow by using:

<code bash>
gpu notify --send "message content"
</code>

This can be useful for receiving personal alerts at the beginning or end of an experiment, after completing a phase of processing, or at any other relevant point in the workflow.

The preferred notification method that is configured will be used, or the email if neither is configured.

=== Usage Reports ===

To consult a summary of the recent historical GPU usage by process, you can use:

<code bash> gpu report </code>

This shows the registered PIDs, the number of available samples, and aggregated usage metrics for GPU, CPU, and memory.

If you want to see data from a specific process, with ASCII graphs and the timeline of stored samples, you can use:

<code bash> gpu report --pid PID </code>

If there are many samples, the output may appear summarized. To see the complete series:

<code bash> gpu report --pid PID --all-samples </code>

=== Practical Recommendations ===

  * To run a single command, the simplest often is ''gpu exec''.
  * To maintain an active reservation between several commands, using ''gpu claim'' + ''gpu exec'' + ''gpu release'' is more appropriate.
  * If you are going to work directly from the shell after making a manual reservation, you need to run ''<nowiki>eval "$(gpu env --shell)"</nowiki>'' or prefix with ''gpu exec''.
  * To leave active sessions and retrieve them later, ''tmux'' is usually the most convenient option.
  * To queue multiple jobs, ''tsp'' is a practical alternative.
  * If using ''tsp'' with parallelism, remember that does not reserve additional GPUs: the processes will share the visible GPU or GPUs at that moment.
  * If a reservation remains too long without real computational activity on the GPU, it may automatically be lost due to inactivity.