Tabla de Contenidos

GPGPU Computing Servers

Service Description

These servers are intended for GPU computing (GPGPU), focused on computationally intensive tasks, machine learning, data processing, and scientific simulation that require acceleration by graphics hardware.

Open Access Servers

Any researcher from the center can request access to these servers. Access is granted upon request and validation.

Node Server CPU RAM GPUs Operating System Job Management
ctgpgpu4 PowerEdge R730 2 × Intel Xeon E5-2623 v4 128 GB 2 × Nvidia GP102GL 24GB (Tesla P40, 2016) AlmaLinux 9.1
• CUDA 12.0
Slurm (mandatory use)

Restricted Access Servers

Access to these servers is restricted to a specific group, specific project, or is more tightly controlled due to resource management and planning issues.

It is essential to check the updated information in Xici at the time of requesting the service, which details the particular circumstances of each server (access criteria, priorities, usage conditions, etc.).

Node Server CPU RAM GPUs Operating System Job Management
ctgpgpu5 PowerEdge R730 2 × Intel Xeon E5-2623 v4 128 GB 2 × Nvidia GP102GL (Tesla P40) Ubuntu 22.04
• Nvidia Driver 590
• CUDA Toolkit 12.5 and 13.1 (default)
n/a
ctgpgpu6 SIE LADON 4214 2 × Intel Xeon Silver 4214 192 GB Nvidia Quadro P6000 24GB (2018)
Nvidia Quadro RTX8000 48GB (2019)
2 × Nvidia A30 24GB (2020)
CentOS 7.9
• Nvidia Driver 535.86.10 (CUDA 12.2)
n/a
ctgpgpu9 Dell PowerEdge R750 2 × Intel Xeon Gold 6326 128 GB 2 × NVIDIA Ampere A100 80GB AlmaLinux 8.6
• Nvidia Driver 515.48.07 (CUDA 11.7)
n/a
ctgpgpu11 Gigabyte G482-Z54 2 × AMD EPYC 7413 @2.65 GHz (24c) 256 GB 5 × NVIDIA Ampere A100 80GB AlmaLinux 9.1
• Nvidia Driver 520.61.05 (CUDA 11.8)
n/a
ctgpgpu12 Dell PowerEdge R760 2 × Intel Xeon Silver 4410Y 384 GB 2 × NVIDIA Hopper H100 80GB AlmaLinux 9.2
• Nvidia Driver 555.42.06 (CUDA 12.5)
n/a
ctgpgpu15 SIE LADON (Gigabyte) 2x AMD EPYC 9474F (48c) 768 GB 4 × NVIDIA H200 NVL AlmaLinux 9.6 ts
ctgpgpu16 SIE LADON (Gigabyte) 2x AMD EPYC 9474F (48c) 768 GB 4 × NVIDIA H200 NVL AlmaLinux 9.7 ts
ctgpgpu17 SIE LADON (Gigabyte) 2x AMD EPYC 9474F (48c) 768 GB 4 × NVIDIA H200 NVL AlmaLinux 9.7 ts
ctgpgpu18 SIE LADON (MegaRAC SP-X) 2x AMD EPYC 9335 (24c) 1536 GB 4 × NVIDIA H200 Ubuntu 22.04 ts

Service Registration

Not all servers are available at all times for any use. To access the servers, it must be requested in advance through the incident report form. Users who do not have access permission will receive a message indicating incorrect password.

User Manual

Connecting to the Servers

To connect to the servers, you must do it via SSH. The names and IP addresses of the servers are as follows:

Node FQDN IP
ctgpgpu4 ctgpgpu4.inv.usc.es 172.16.242.201
ctgpgpu5 ctgpgpu5.inv.usc.es 172.16.242.202
ctgpgpu6 ctgpgpu6.inv.usc.es 172.16.242.205
ctgpgpu9 ctgpgpu9.inv.usc.es 172.16.242.94
ctgpgpu11 ctgpgpu11.inv.usc.es 172.16.242.96
ctgpgpu12 ctgpgpu12.inv.usc.es 172.16.242.97
ctgpgpu15 ctgpgpu15.inv.usc.es 172.16.242.207
ctgpgpu16 ctgpgpu16.inv.usc.es 172.16.242.212
ctgpgpu17 ctgpgpu17.inv.usc.es 172.16.242.213
ctgpgpu18 ctgpgpu18.inv.usc.es 172.16.242.208

The connection is only available from the center's network. To connect from other locations or from the RAI network, you must use the VPN or the SSH gateway.

Job Management with SLURM

On servers where there is a Slurm queue manager, its use is mandatory for submitting jobs to avoid conflicts between processes, as two jobs should not run at the same time.

To submit a job to the queue, the srun command is used:

srun programa_cuda argumentos_programa_cuda

The srun process waits for the job to execute to return control to the user. If you do not want to wait, session managers like screen can be used, allowing the job to run in the background and disconnect the session without worrying and retrieve the console output later.

Alternatively, nohup can be used by sending the job to the background with &. In this case, the output is saved in the nohup.out file:

nohup srun programa_cuda argumentos_programa_cuda &

To view the status of the queue, the squeue command is used. The command shows output similar to this:

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
9  servidore ca_water pablo.qu    PD       0:00      1 (Resources)
10 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
11 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
12 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
13 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
14 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
 8 servidore ca_water pablo.qu     R       0:11      1 ctgpgpu2

An interactive view, updated every second, can also be obtained with the smap command:

smap -i 1

Job Management with TS

On servers using ts as the job manager, it is mandatory to use it to run tasks that use the GPU, in order to avoid conflicts and ensure correct resource allocation.

To request a GPU, the option -G 1 (or the number of GPUs needed) must be added:

ts -G 1 programa_cuda argumentos_programa_cuda

For example:

ts -G 1 python train.py --epochs 100

The system will handle putting the job in the queue and executing it when a GPU becomes available.

To consult more advanced examples (multiple GPUs, additional resources, specific options, etc.), the command can be used:

usage-overview