Wiki do CiTIUS

This is an old revision of the document!

Service description

Servers with graphic cards:

ctgpgpu2:
- Dell Precision R5400
- 2 x Intel Xeon E5440
- 8 GB RAM (4 x DDR2 FB-DIMM 667 MHz)
- 1 Nvidia GK104 [Geforce GTX 680]
- Ubuntu 18.04 operative system
  - Slurm (mandatory to queue jobs!)
  - CUDA 9.2 (Nvidia official repo)
  - Docker-ce 18.06 (Docker official repo)
  - Nvidia-docker 2.0.3 (Nvidia official repo)
  - Nvidia cuDNN v7.2.1 for CUDA 9.2
  - Intel Parallel Studio Professional for C++ 2015 (single license! coordinate with other users!)
ctgpgpu3:
- PowerEdge R720
- 1 x Intel Xeon E52609
- 16 GB RAM (1 DDR3 DIMM 1600MHz)
- Connected to a graphical card extensión box with:
  - Gigabyte GeForce GTX Titan 6GB (2014)
  - Nvidia Titan X Pascal 12GB (2016)
- Ubuntu 18.04 operative system
  - Slurm (mandatory to queue jobs!)
  - CUDA 9.2 (Nvidia official repo)
  - Docker-ce 18.06 (Docker official repo)
  - Nvidia-docker 2.0.3 (Nvidia official repo)
  - Nvidia cuDNN v7.2.1 for CUDA 9.2
  - Intel Parallel Studio Professional for C++ 2015 (single license! coordinate with other users!)
  - ROS Melodic Morenia (repositorio oficial de ROS)
ctgpgpu4:
- PowerEdge R730
- 2 x Intel Xeon E52623v4
- 128 GB RAM (4 DDR4 DIMM 2400MHz)
- 2 x Nvidia GP102GL 24GB [Tesla P40]
- Centos 7.4
  - Docker 17.09 and nvidia-docker 1.0.1
  - OpenCV 2.4.5
  - Dliv, Caffe, Caffe2 and pycaffe
  - Python 3.4: cython, easydict, sonnet
  - TensorFlow
ctgpgpu5:
- PowerEdge R730
- 2 x Intel Xeon E52623v4
- 128 GB RAM (4 DDR4 DIMM 2400MHz)
- 2 x Nvidia GP102GL 24GB [Tesla P40]
- Ubuntu 16.04
  - Slurm as a mandatory use queue manager.
  - Modules for library version management .
  - CUDA 9.0
  - OpenCV 2.4 and 3.4
  - Atlas 3.10.3
  - MAGMA
  - TensorFlow
  - Caffee

ctgpgpu6:
- Server SIE LADON 4214
- 2 processors Intel Xeon Silver 4214
- 192 GB RAM memory(12 DDR4 DIMM 2933MHz)
- Nvidia Quadro P6000 24GB (2018)
- Operating system Centos 7.7
  - Nvidia Driver 418.87.00 for CUDA 10.1
  - Docker 19.03
  - Nvidia-docker
ctgpgpu7:
- Server Dell PowerEdge R740
- 2 processorsIntel Xeon Gold 5220
- 192 GB RAM (12 DDR4 DIMM a 2667MHz)
- 2 x Nvidia Tesla V100S 32GB (2019)
- Operating system Centos 8.1
  - Slurm as a mandatory use queue manager.
  - Modules for library version management .
  - Nvidia Driver 440.64.00 for CUDA 10.2
  - Docker 19.03
  - Nvidia-docker
ctgpgpu8:
- Dell PowerEdge R740
- 2 processors Intel Xeon Gold 5220
- 192 GB RAM (12 DDR4 DIMM a 2667MHz)
- 2 x Nvidia Tesla V100S 32GB (2019)
- Operating System Centos 8.1
  - Slurm as a mandatory use queue manager.
  - Modules for library version management .
  - Nvidia Driver 440.64.00 for CUDA 10.2
  - Docker 19.03
  - Nvidia-docker

Activation

All CITIUS users can access this service, but as not all servers are available all the time you have to register beforehand filling the requests and problem reporting form.

User Manual

How to connect the servers

Use SSH. Hostnames and ip addresses are:

ctgpgpu2.inv.usc.es - 172.16.242.92:22
ctgpgpu3.inv.usc.es - 172.16.242.93:22
ctgpgpu4.inv.usc.es - 172.16.242.201:22
ctgpgpu5.inv.usc.es - 172.16.242.202:22
ctgpgpu6.inv.usc.es - 172.16.242.205:22
ctgpgpu7.inv.usc.es - 172.16.242.207:22
ctgpgpu8.inv.usc.es - 172.16.242.208:22

Connection in only possible from inside the CITIUS network. To connect from other places or from the RAI network it is necessary to use the VPN or the SSH gateway.

Servers automatic power off

The servers switch themselves off after an hour of being idle. To switch them on again use the remote power service.

Servers won't switch themselves off if there is an open SSH or Screen session.

Job management with SLURM

On servers where there is a queue management software installed its use is mandatory to send jobs and avoid conflicts between different processes because two jobs shouldn't be executed at the same time.

To send a job to the queue command srun is used:

srun cuda_program arguments_of_cuda_program

The srun process waits until the job is executed before returning control to the user. If you don't want to wait a console session manager like screen can be used. This way you can leave the the job in the queue and disconnect the session without losing the output of the job witch can be recovered any other moment.

Alternatively nohup can be used and then the job sent to the background with &. This way the output is written in the file nohup.out:

nohup srun cuda_program cuda_program_arguments &

To check the queue status command squeue is used. The command shows an output similar to this one:

JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
9  servidore ca_water pablo.qu    PD       0:00      1 (Resources)
10 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
11 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
12 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
13 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
14 servidore ca_water pablo.qu    PD       0:00      1 (Priority)
 8 servidore ca_water pablo.qu     R       0:11      1 ctgpgpu2

An interactive view can be obtained, refreshed every second, with the smap command:

smap -i 1

GPGPU computation servers