====== Submit and manage jobs ======
===== Submit jobs to the queue =====
--------------
Job submission its done with the qsub command which has one mandatory parameter of a shell script's name.
ct$ qsub script.sh
The qsub command admits as parameters the same options that can be used as #PBS comments in the scripts.
===== Check job, queue or node state =====
---------------
==== Queue information ====
The ''qstat'' command shows the queue status,
ct$ qstat -q # Global queue information
server: ctcomp2
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
graphic32 -- -- 04:00:00 -- 0 0 -- E R
np16 -- -- 192:00:0 -- 0 0 -- E R
np32 -- -- 288:00:0 -- 0 0 -- E R
especial -- -- 672:00:0 -- 0 0 -- E R
parallel -- -- 192:00:0 -- 0 0 -- E R
np2 -- -- 192:00:0 -- 0 0 -- E R
np8 -- -- 192:00:0 -- 0 0 -- E R
short -- -- -- -- 0 0 -- E R
graphic1 -- -- 04:00:00 -- 0 0 -- E R
np1 -- -- 672:00:0 -- 0 0 -- E R
batch -- -- -- -- 0 0 -- E R
np4 -- -- 192:00:0 -- 0 0 -- E R
interactive -- -- 01:00:00 -- 0 0 -- E R
np64 -- -- 384:00:0 -- 0 0 -- E R
graphic -- -- -- -- 0 0 -- E R
bigmem -- -- -- -- 0 0 -- E R
graphic8 -- -- 04:00:00 -- 0 0 -- E R
----- -----
0 0
The first letter of the state column indicates if the queue is (E)nabled or (D)isabled and the second letter if the queue is (R)unning or (S)topped.
==== Jo information ====
Each time a job is assigned a JOB_ID that is used as a unique identifier. If the job was sent with the -t option then it is going to be identified by ''job_id[indice]''.
ct$ qstat # InformaciĆ³n general de los trabajos de usuario
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
999999.ctcomp2 job_name user_name 38:05:59 R np32
The Time Use column shows the CPU time used.
The S column is the job state, which can be one of the following:
* C - Job is completed after having run
* E - Job is exiting after having run.
* H - Job is held.
* Q - job is queued, eligible to run or routed.
* R - job is running.
* T - job is being moved to new location.
* W - job is waiting for its execution time(-a option) to be reached.
ct$ qstat -f 999999.ctcomp2 # Specific job info
Job Id: 999999.ctcomp2.innet
Job_Name = job_name
Job_Owner = user_name@ctcomp2.innet
job_state = Q
queue = np32
server = ctcomp2.innet
Checkpoint = u
ctime = Fri Feb 12 10:09:34 2016
Error_Path = ctcomp2.innet:/home/local/user_name/job_name.e999999
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = ae
Mail_Users = nombre_usuario@usc.es
mtime = Fri Feb 12 10:09:34 2016
Output_Path = ctcomp2.innet:/home/local/user_name/job_name.o999999
Priority = 0
qtime = Fri Feb 12 10:09:34 2016
Rerunable = True
Resource_List.neednodes = 1:ppn=32:intel:xeonl
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=32:intel:xeonl
Resource_List.vmem = 63gb
Resource_List.walltime = 12:00:00
substate = 10
Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/local/user_name,
PBS_O_LOGNAME=nombre_usuario,
PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
PBS_O_MAIL=/var/mail/nombre_usuario,PBS_O_SHELL=/bin/bash,
PBS_O_LANG=es_ES.UTF-8,PBS_O_WORKDIR=/home/local/user_name,
PBS_O_HOST=ctcomp2.innet,PBS_O_SERVER=ctcomp2
euser = user_name
egroup = citius
queue_rank = 2110
queue_type = E
etime = Fri Feb 12 10:09:34 2016
submit_args = script.sh
fault_tolerant = False
job_radix = 0
submit_host = ctcomp2.innet
An interesting characteristic about the finished jobs is the EXIT_STATUS which should be shown when JOB_STATE is C.
^ Internal code ^ EXIT_STATUS value ^ Meaning ^
| JOB_EXEC_OVERLIMIT | -10 | |
| JOB_EXEC_STDOUTFAIL | -9 | |
| JOB_EXEC_CMDFAIL | -8 | Exec() of user command failed |
| JOB_EXEC_BADRESRT | -7 | Job restart failed |
| JOB_EXEC_INITRMG | -6 | Job aborted on MOM init, chkpt, ok migrate |
| JOB_EXEC_INITRST | -5 | Job aborted on MOM init, chkpt, no migrate |
| JOB_EXEC_INITABT | -4 | Job aborted on MOM initialization |
| JOB_EXEC_RETRY | -3 | Job execution failed, do retry |
| JOB_EXEC_FAIL2 | -2 | Job execution failed, after files, no retry |
| JOB_EXEC_FAIL1 | -1 | Job execution failed, before files, no retry |
| JOB_EXEC_OK | 0 | Job execution successful |
| | 1-256 | Exit status of the top-level shell |
| | >256 | Job ended by a UNIX signal, substracting 256 results in the signal number. |
ct$ checkjob 999999.ctcomp2 # Info about a specific job
checking job 999999
State: Running
Creds: user:user_name group:citius class:np32 qos:DEFAULT
WallTime: 00:25:46 of 12:00:00
SubmitTime: Tue Feb 16 10:40:31
(Time Queued Total: 00:00:01 Eligible: 00:00:01)
StartTime: Tue Feb 16 10:40:32
Total Tasks: 32
Req[0] TaskCount: 32 Partition: DEFAULT
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [active][intel][xeonl]
Allocated Nodes:
[inode15:32]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 1
PartitionMask: [ALL]
Flags: RESTARTABLE
Reservation '137092' (-00:25:32 -> 11:34:28 Duration: 12:00:00)
PE: 32.00 StartPriority: 21
ct$ tracejob -n 3 999999.ctcomp2 # Shows the log content related to the indicated jobid.
Job: 136553.ctcomp2.innet
02/10/2016 15:22:26 S enqueuing into batch, state 1 hop 1
02/10/2016 15:22:26 S dequeuing from batch, state QUEUED
02/10/2016 15:22:26 S enqueuing into np1, state 1 hop 1
02/10/2016 15:22:26 S Job Run at request of citiuscap@ctcomp2.innet
02/10/2016 15:22:26 S Not sending email: User does not want mail of this type.
02/10/2016 15:22:26 A queue=batch
02/10/2016 15:22:26 A queue=np1
02/10/2016 15:22:26 A user=user_name group=citius
jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=user_name@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00
02/10/2016 16:08:34 S Exit_status=0 resources_used.cput=00:46:14
resources_used.mem=234868kb resources_used.vmem=1002480kb
resources_used.walltime=00:46:08
02/10/2016 16:08:34 S on_job_exit valid pjob: 999999.ctcomp2.innet (substate=50)
02/10/2016 16:08:34 A user=user_name group=citius jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=nombre_usuario@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00 session=7304 end=1455116914 Exit_status=0 resources_used.cput=00:46:14 resources_used.mem=234868kb resources_used.vmem=1002480kb resources_used.walltime=00:46:08
02/10/2016 17:08:35 S dequeuing from np1, state COMPLETE
==== Node info ====
To get a global view of the cluster state the command ''nodes-usage'' can be used.
$ nodes-usage
+----------------------------------+-------------------+
| USAGE | NODE |
+----------------------------------+-------------------+
| ################################ | node1 (64/64) |
| ################################ | node2 (64/64) |
| | node3 (0/64) |
| ################################ | node4 (64/64) |
| | node5 (0/64) |
| ################################ | node6 (64/64) |
| | node7 (0/64) |
| | inode11 (0/32) |
| | inode12 (0/??) |
| | inode13 (0/32) |
| | inode14 (0/32) |
| | inode15 (0/??) |
| | inode16 (0/32) |
| | inode17 (0/??) |
| | inode18 (0/??) |
| ## | inode19 (2/32) |
| ############################ | inode20 (28/32) |
+----------------------------------+-------------------+
| ############## | TOTAL (286/640) |
+----------------------------------+-------------------+
To get information of the users in each node, the ''node-users '' command can be used:
$ node-users node1
Tracing node jobs...................................................................
jorge.suarez natalia.fernandez
To get more detailed information on the nodes, the ''pnbsnodes'' command can be used:
ct$ pbsnodes #Detailed information on all nodes
node1
state = free
np = 64
properties = amd,bigmem,test,active,activeX
ntype = cluster
status = rectime=1455267717,varattr=,jobs=,state=free,netload=86957182662,gres=,loadave=0.00,ncpus=64,physmem=132250896kb,availmem=162914704kb,totmem=163499276kb,idletime=1876325,nusers=0,nsessions=0,uname=Linux node1 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
node2
state = down,offline
np = 64
properties = amd,bigmem
ntype = cluster
status = rectime=1454919087,varattr=,jobs=,state=free,netload=1185896,gres=,loadave=0.00,ncpus=64,physmem=264633540kb,availmem=295220244kb,totmem=295881920kb,idletime=11140,nusers=0,nsessions=0,uname=Linux node2 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 0
......
ct$ pbsnodes -l # Shutdown node list(down) or not available(offline)
node2 down,offline
node3 down,offline
node4 down,offline
node5 down,offline
node6 down,offline
node7 down,offline
inode11 down,offline
inode12 down,offline
inode13 down,offline
inode14 down,offline
inode15 down,offline
inode17 down,offline
inode18 down,offline
inode19 down,offline
===== Cancel a job from the queue =====
-------------
The ''qdel'' command allows the user to erase a job. It works by sending first a TERM and then a KILL signal. This command needs the PBS identifier assigned to the job as an argument, it can be seen using the ''qstat'' command.
ct$ qdel job_id