====== GPGPU Computing Servers ====== ===== Service Description ===== These servers are intended for GPU computing (GPGPU), aimed at intensive computing tasks, machine learning, data processing, and scientific simulation that require hardware acceleration. ==== Public Access Servers ==== Any researcher from the center can request access to these servers. Access is granted upon prior request and validation. ^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job Management ^ | ''ctgpgpu4'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL 24GB (Tesla P40, 2016) | AlmaLinux 9.1 \\ • CUDA 12.0 | **Slurm (mandatory use)** | * Servers in the HPC computing cluster: [[ centro:servizos:hpc | HPC Computing Cluster ]] * Servers at CESGA: [[ centro:servizos:cesga | Request Access ]] ==== Restricted Access Servers ==== Access to these servers is restricted to a specific group, specific project, or is more controlled due to resource management and planning issues. It is essential to check the updated information in Xici at the time of requesting the service, which details the particular circumstances of each server (access criteria, priorities, usage conditions, etc.). ^ Node ^ Server ^ CPU ^ RAM ^ GPUs ^ Operating System ^ Job Management ^ | ''ctgpgpu5'' | PowerEdge R730 | 2 × [[https://ark.intel.com/products/92980/Intel-Xeon-Processor-E5-2623-v4-10M-Cache-2_60-GHz|Intel Xeon E5-2623 v4]] | 128 GB | 2 × Nvidia GP102GL (Tesla P40) | Ubuntu 22.04 \\ • Nvidia Driver 590 \\ • CUDA Toolkit 12.5 and 13.1 (default) | n/a | | ''ctgpgpu6'' | SIE LADON 4214 | 2 × [[https://ark.intel.com/content/www/us/en/ark/products/193385/intel-xeon-silver-4214-processor-16-5m-cache-2-20-ghz.html|Intel Xeon Silver 4214]] | 192 GB | Nvidia Quadro P6000 24GB (2018) \\ Nvidia Quadro RTX8000 48GB (2019) \\ 2 × Nvidia A30 24GB (2020) | CentOS 7.9 \\ • Nvidia Driver 535.86.10 (CUDA 12.2) | n/a | | ''ctgpgpu9'' | Dell PowerEdge R750 | 2 × [[https://ark.intel.com/content/www/es/es/ark/products/215274/intel-xeon-gold-6326-processor-24m-cache-2-90-ghz.html|Intel Xeon Gold 6326]] | 128 GB | 2 × NVIDIA Ampere A100 80GB | AlmaLinux 8.6 \\ • NVIDIA Driver 515.48.07 (CUDA 11.7) | n/a | | ''ctgpgpu11'' | Gigabyte G482-Z54 | 2 × [[https://www.amd.com/es/products/cpu/amd-epyc-7413|AMD EPYC 7413 @2.65 GHz (24c)]] | 256 GB | 5 × NVIDIA Ampere A100 80GB | AlmaLinux 9.1 \\ • NVIDIA Driver 520.61.05 (CUDA 11.8) | n/a | | ''ctgpgpu12'' | Dell PowerEdge R760 | 2 × [[https://ark.intel.com/content/www/xl/es/ark/products/232376.html|Intel Xeon Silver 4410Y]] | 384 GB | 2 × NVIDIA Hopper H100 80GB | AlmaLinux 9.2 \\ • NVIDIA Driver 555.42.06 (CUDA 12.5) | n/a | | ''ctgpgpu15'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.6 | ts | | ''ctgpgpu16'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | ts | | ''ctgpgpu17'' ⚠️ | SIE LADON (Gigabyte) | 2x AMD EPYC 9474F (48c) | 768 GB | 4 × NVIDIA H200 NVL | AlmaLinux 9.7 | ts | | ''ctgpgpu18'' ⚠️ | SIE LADON (MegaRAC SP-X) | 2x AMD EPYC 9335 (24c) | 1536 GB | 4 × NVIDIA H200 | Ubuntu 22.04 | ts | ⚠️ The servers ''ctgpgpu15'', ''ctgpgpu16'', ''ctgpgpu17'' and ''ctgpgpu18'' have a temporary installation and assignments, and their configuration and access may change around May 2026. ===== Service Registration ===== Not all servers are available at all times for any use. To access the servers, you must request it in advance through the [[https://citius.usc.es/uxitic/incidencias/add|incident form]]. Users without access permission will receive a message of incorrect password. ===== User Manual ===== ==== Connecting to the Servers ==== To connect to the servers, you must do so via SSH. The names and IP addresses of the servers are as follows: ^ Node ^ FQDN ^ IP ^ | ''ctgpgpu4'' | ctgpgpu4.inv.usc.es | 172.16.242.201 | | ''ctgpgpu5'' | ctgpgpu5.inv.usc.es | 172.16.242.202 | | ''ctgpgpu6'' | ctgpgpu6.inv.usc.es | 172.16.242.205 | | ''ctgpgpu9'' | ctgpgpu9.inv.usc.es | 172.16.242.94 | | ''ctgpgpu11'' | ctgpgpu11.inv.usc.es | 172.16.242.96 | | ''ctgpgpu12'' | ctgpgpu12.inv.usc.es | 172.16.242.97 | | ''ctgpgpu15'' | ctgpgpu15.inv.usc.es | 172.16.242.207 | | ''ctgpgpu16'' | ctgpgpu16.inv.usc.es | 172.16.242.212 | | ''ctgpgpu17'' | ctgpgpu17.inv.usc.es | 172.16.242.213 | | ''ctgpgpu18'' | ctgpgpu18.inv.usc.es | 172.16.242.208 | The connection is only available from the center's network. To connect from other locations or from the RAI network, it is necessary to use the [[:en:centro:servizos:vpn:start|VPN]] or the [[:en:centro:servizos:pasarela_ssh|SSH gateway]]. ==== Job Management with SLURM ==== On the servers where there is a Slurm queue manager, its use is mandatory to submit jobs and thus avoid conflicts between processes, as two jobs should not be executed at the same time. To submit a job to the queue, use the ''srun'' command: srun programa_cuda argumentos_programa_cuda The ''srun'' process waits for the job to run before returning control to the user. If you do not want to wait, you can use console session managers like ''screen'', allowing you to leave the job waiting and disconnect the session without worry and retrieve the console output later. Alternatively, you can use ''nohup'' and place the job in the background with ''&''. In this case, the output is saved in the ''nohup.out'' file: nohup srun programa_cuda argumentos_programa_cuda & To check the status of the queue, the ''squeue'' command is used. The command produces output similar to this: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9 servidore ca_water pablo.qu PD 0:00 1 (Resources) 10 servidore ca_water pablo.qu PD 0:00 1 (Priority) 11 servidore ca_water pablo.qu PD 0:00 1 (Priority) 12 servidore ca_water pablo.qu PD 0:00 1 (Priority) 13 servidore ca_water pablo.qu PD 0:00 1 (Priority) 14 servidore ca_water pablo.qu PD 0:00 1 (Priority) 8 servidore ca_water pablo.qu R 0:11 1 ctgpgpu2 You can also obtain an interactive view, updated every second, with the ''smap'' command: smap -i 1 ==== Job Management with TS ==== On servers that use ''ts'' as the job manager, it is mandatory to use it to run tasks that use GPU, in order to avoid conflicts and ensure correct resource allocation. To request a GPU, you must prepend the option ''-G 1'' (or the number of GPUs needed): ts -G 1 programa_cuda argumentos_programa_cuda For example: ts -G 1 python train.py --epochs 100 The system will take care of placing the job in the queue and running it when a GPU is available. To check more advanced examples (multiple GPUs, additional resources, specific options, etc.), you can use the command: usage-overview