GPGPU Computing Servers
Service Description
These servers are intended for GPU computing (GPGPU), focused on computationally intensive tasks, machine learning, data processing, and scientific simulation that require acceleration by graphics hardware.
Open Access Servers
Any researcher from the center can request access to these servers. Access is granted upon request and validation.
| Node | Server | CPU | RAM | GPUs | Operating System | Job Management |
|---|---|---|---|---|---|---|
ctgpgpu4 | PowerEdge R730 | 2 × Intel Xeon E5-2623 v4 | 128 GB | 2 × Nvidia GP102GL 24GB (Tesla P40, 2016) | AlmaLinux 9.1 • CUDA 12.0 | Slurm (mandatory use) |
- Non-cluster HPC computing servers: HPC Computing Cluster
- Servers in CESGA: Request access
Restricted Access Servers
Access to these servers is restricted to a specific group, specific project, or is more tightly controlled due to resource management and planning issues.
It is essential to check the updated information in Xici at the time of requesting the service, which details the particular circumstances of each server (access criteria, priorities, usage conditions, etc.).
| Node | Server | CPU | RAM | GPUs | Operating System | Job Management |
|---|---|---|---|---|---|---|
ctgpgpu5 | PowerEdge R730 | 2 × Intel Xeon E5-2623 v4 | 128 GB | 2 × Nvidia GP102GL (Tesla P40) | Ubuntu 22.04 • Nvidia Driver 590 • CUDA Toolkit 12.5 and 13.1 (default) | n/a |
ctgpgpu6 | SIE LADON 4214 | 2 × Intel Xeon Silver 4214 | 192 GB | Nvidia Quadro P6000 24GB (2018) Nvidia Quadro RTX8000 48GB (2019) 2 × Nvidia A30 24GB (2020) | CentOS 7.9 • Nvidia Driver 535.86.10 (CUDA 12.2) | n/a |
ctgpgpu9 | Dell PowerEdge R750 | 2 × Intel Xeon Gold 6326 | 128 GB | 2 × NVIDIA Ampere A100 80GB | AlmaLinux 8.6 • Nvidia Driver 515.48.07 (CUDA 11.7) | n/a |
ctgpgpu11 | Gigabyte G482-Z54 | 2 × AMD EPYC 7413 @2.65 GHz (24c) | 256 GB | 5 × NVIDIA Ampere A100 80GB | AlmaLinux 9.1 • Nvidia Driver 520.61.05 (CUDA 11.8) | n/a |
ctgpgpu12 | Dell PowerEdge R760 | 2 × Intel Xeon Silver 4410Y | 384 GB | 2 × NVIDIA Hopper H100 80GB | AlmaLinux 9.2 • Nvidia Driver 555.42.06 (CUDA 12.5) | n/a |
ctgpgpu15 | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.6 | ts |
ctgpgpu16 | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | ts |
ctgpgpu17 | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | ts |
ctgpgpu18 | SIE LADON (MegaRAC SP-X) | 2x AMD EPYC 9335 (24c) | 1536 GB | 4 × NVIDIA H200 | Ubuntu 22.04 | ts |
Service Registration
Not all servers are available at all times for any use. To access the servers, it must be requested in advance through the incident report form. Users who do not have access permission will receive a message indicating incorrect password.
User Manual
Connecting to the Servers
To connect to the servers, you must do it via SSH. The names and IP addresses of the servers are as follows:
| Node | FQDN | IP |
|---|---|---|
ctgpgpu4 | ctgpgpu4.inv.usc.es | 172.16.242.201 |
ctgpgpu5 | ctgpgpu5.inv.usc.es | 172.16.242.202 |
ctgpgpu6 | ctgpgpu6.inv.usc.es | 172.16.242.205 |
ctgpgpu9 | ctgpgpu9.inv.usc.es | 172.16.242.94 |
ctgpgpu11 | ctgpgpu11.inv.usc.es | 172.16.242.96 |
ctgpgpu12 | ctgpgpu12.inv.usc.es | 172.16.242.97 |
ctgpgpu15 | ctgpgpu15.inv.usc.es | 172.16.242.207 |
ctgpgpu16 | ctgpgpu16.inv.usc.es | 172.16.242.212 |
ctgpgpu17 | ctgpgpu17.inv.usc.es | 172.16.242.213 |
ctgpgpu18 | ctgpgpu18.inv.usc.es | 172.16.242.208 |
The connection is only available from the center's network. To connect from other locations or from the RAI network, you must use the VPN or the SSH gateway.
Job Management with SLURM
On servers where there is a Slurm queue manager, its use is mandatory for submitting jobs to avoid conflicts between processes, as two jobs should not run at the same time.
To submit a job to the queue, the srun command is used:
srun programa_cuda argumentos_programa_cuda
The srun process waits for the job to execute to return control to the user. If you do not want to wait, session managers like screen can be used, allowing the job to run in the background and disconnect the session without worrying and retrieve the console output later.
Alternatively, nohup can be used by sending the job to the background with &. In this case, the output is saved in the nohup.out file:
nohup srun programa_cuda argumentos_programa_cuda &
To view the status of the queue, the squeue command is used. The command shows output similar to this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9 servidore ca_water pablo.qu PD 0:00 1 (Resources) 10 servidore ca_water pablo.qu PD 0:00 1 (Priority) 11 servidore ca_water pablo.qu PD 0:00 1 (Priority) 12 servidore ca_water pablo.qu PD 0:00 1 (Priority) 13 servidore ca_water pablo.qu PD 0:00 1 (Priority) 14 servidore ca_water pablo.qu PD 0:00 1 (Priority) 8 servidore ca_water pablo.qu R 0:11 1 ctgpgpu2
An interactive view, updated every second, can also be obtained with the smap command:
smap -i 1
Job Management with TS
On servers using ts as the job manager, it is mandatory to use it to run tasks that use the GPU, in order to avoid conflicts and ensure correct resource allocation.
To request a GPU, the option -G 1 (or the number of GPUs needed) must be added:
ts -G 1 programa_cuda argumentos_programa_cuda
For example:
ts -G 1 python train.py --epochs 100
The system will handle putting the job in the queue and executing it when a GPU becomes available.
To consult more advanced examples (multiple GPUs, additional resources, specific options, etc.), the command can be used:
usage-overview