Táboa de Contidos

Submit and manage jobs

Submit jobs to the queue


Job submission its done with the qsub command which has one mandatory parameter of a shell script's name.

ct$ qsub script.sh

The qsub command admits as parameters the same options that can be used as #PBS comments in the scripts.

Check job, queue or node state


Queue information

The qstat command shows the queue status,

ct$ qstat -q # Global queue information
server: ctcomp2
 
Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
graphic32          --      --    04:00:00   --    0   0 --   E R
np16               --      --    192:00:0   --    0   0 --   E R
np32               --      --    288:00:0   --    0   0 --   E R
especial           --      --    672:00:0   --    0   0 --   E R
parallel           --      --    192:00:0   --    0   0 --   E R
np2                --      --    192:00:0   --    0   0 --   E R
np8                --      --    192:00:0   --    0   0 --   E R
short              --      --       --      --    0   0 --   E R
graphic1           --      --    04:00:00   --    0   0 --   E R
np1                --      --    672:00:0   --    0   0 --   E R
batch              --      --       --      --    0   0 --   E R
np4                --      --    192:00:0   --    0   0 --   E R
interactive        --      --    01:00:00   --    0   0 --   E R
np64               --      --    384:00:0   --    0   0 --   E R
graphic            --      --       --      --    0   0 --   E R
bigmem             --      --       --      --    0   0 --   E R
graphic8           --      --    04:00:00   --    0   0 --   E R
                                               ----- -----
                                                   0     0

The first letter of the state column indicates if the queue is (E)nabled or (D)isabled and the second letter if the queue is (R)unning or (S)topped.

Jo information

Each time a job is assigned a JOB_ID that is used as a unique identifier. If the job was sent with the -t option then it is going to be identified by job_id[indice].

ct$ qstat  # Información general de los trabajos de usuario 
Job id                    Name             User               Time Use S Queue
------------------------- ---------------- ---------------    -------- - -----
999999.ctcomp2            job_name         user_name          38:05:59 R np32       

The Time Use column shows the CPU time used. The S column is the job state, which can be one of the following:

ct$ qstat -f 999999.ctcomp2   # Specific job info
Job Id: 999999.ctcomp2.innet
    Job_Name = job_name
    Job_Owner = user_name@ctcomp2.innet
    job_state = Q
    queue = np32
    server = ctcomp2.innet
    Checkpoint = u
    ctime = Fri Feb 12 10:09:34 2016
    Error_Path = ctcomp2.innet:/home/local/user_name/job_name.e999999
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = ae
    Mail_Users = nombre_usuario@usc.es
    mtime = Fri Feb 12 10:09:34 2016
    Output_Path = ctcomp2.innet:/home/local/user_name/job_name.o999999
    Priority = 0
    qtime = Fri Feb 12 10:09:34 2016
    Rerunable = True
    Resource_List.neednodes = 1:ppn=32:intel:xeonl
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=32:intel:xeonl
    Resource_List.vmem = 63gb
    Resource_List.walltime = 12:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/local/user_name,
	PBS_O_LOGNAME=nombre_usuario,
	PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
	PBS_O_MAIL=/var/mail/nombre_usuario,PBS_O_SHELL=/bin/bash,
	PBS_O_LANG=es_ES.UTF-8,PBS_O_WORKDIR=/home/local/user_name,
	PBS_O_HOST=ctcomp2.innet,PBS_O_SERVER=ctcomp2
    euser = user_name
    egroup = citius
    queue_rank = 2110
    queue_type = E
    etime = Fri Feb 12 10:09:34 2016
    submit_args = script.sh
    fault_tolerant = False
    job_radix = 0
    submit_host = ctcomp2.innet

An interesting characteristic about the finished jobs is the EXIT_STATUS which should be shown when JOB_STATE is C.

Internal code EXIT_STATUS value Meaning
JOB_EXEC_OVERLIMIT -10
JOB_EXEC_STDOUTFAIL -9
JOB_EXEC_CMDFAIL -8 Exec() of user command failed
JOB_EXEC_BADRESRT -7 Job restart failed
JOB_EXEC_INITRMG -6 Job aborted on MOM init, chkpt, ok migrate
JOB_EXEC_INITRST -5 Job aborted on MOM init, chkpt, no migrate
JOB_EXEC_INITABT -4 Job aborted on MOM initialization
JOB_EXEC_RETRY -3 Job execution failed, do retry
JOB_EXEC_FAIL2 -2 Job execution failed, after files, no retry
JOB_EXEC_FAIL1 -1 Job execution failed, before files, no retry
JOB_EXEC_OK 0 Job execution successful
1-256 Exit status of the top-level shell
>256 Job ended by a UNIX signal, substracting 256 results in the signal number.
ct$ checkjob 999999.ctcomp2                # Info about a specific job
 
checking job 999999
 
State: Running
Creds:  user:user_name  group:citius  class:np32  qos:DEFAULT
WallTime: 00:25:46 of 12:00:00
SubmitTime: Tue Feb 16 10:40:31
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)
 
StartTime: Tue Feb 16 10:40:32
Total Tasks: 32
 
Req[0]  TaskCount: 32  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [active][intel][xeonl]
Allocated Nodes:
[inode15:32]
 
 
IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE
 
Reservation '137092' (-00:25:32 -> 11:34:28  Duration: 12:00:00)
PE:  32.00  StartPriority:  21
ct$ tracejob -n 3 999999.ctcomp2   # Shows the log content related to the indicated jobid.
Job: 136553.ctcomp2.innet
 
02/10/2016 15:22:26  S    enqueuing into batch, state 1 hop 1
02/10/2016 15:22:26  S    dequeuing from batch, state QUEUED
02/10/2016 15:22:26  S    enqueuing into np1, state 1 hop 1
02/10/2016 15:22:26  S    Job Run at request of citiuscap@ctcomp2.innet
02/10/2016 15:22:26  S    Not sending email: User does not want mail of this type.
02/10/2016 15:22:26  A    queue=batch
02/10/2016 15:22:26  A    queue=np1
02/10/2016 15:22:26  A    user=user_name group=citius
                          jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=user_name@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00
02/10/2016 16:08:34  S    Exit_status=0 resources_used.cput=00:46:14
                          resources_used.mem=234868kb resources_used.vmem=1002480kb
                          resources_used.walltime=00:46:08
02/10/2016 16:08:34  S    on_job_exit valid pjob: 999999.ctcomp2.innet (substate=50)
02/10/2016 16:08:34  A    user=user_name group=citius jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=nombre_usuario@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00 session=7304 end=1455116914 Exit_status=0 resources_used.cput=00:46:14 resources_used.mem=234868kb resources_used.vmem=1002480kb resources_used.walltime=00:46:08
02/10/2016 17:08:35  S    dequeuing from np1, state COMPLETE

Node info

To get a global view of the cluster state the command nodes-usage can be used.

$ nodes-usage
+----------------------------------+-------------------+
| USAGE                            | NODE              |
+----------------------------------+-------------------+
| ################################ | node1 (64/64)     |
| ################################ | node2 (64/64)     |
|                                  | node3 (0/64)      |
| ################################ | node4 (64/64)     |
|                                  | node5 (0/64)      |
| ################################ | node6 (64/64)     |
|                                  | node7 (0/64)      |
|                                  | inode11 (0/32)    |
|                                  | inode12 (0/??)    |
|                                  | inode13 (0/32)    |
|                                  | inode14 (0/32)    |
|                                  | inode15 (0/??)    |
|                                  | inode16 (0/32)    |
|                                  | inode17 (0/??)    |
|                                  | inode18 (0/??)    |
| ##                               | inode19 (2/32)    |
| ############################     | inode20 (28/32)   |
+----------------------------------+-------------------+
| ##############                   | TOTAL (286/640)   |
+----------------------------------+-------------------+

To get information of the users in each node, the node-users <node> command can be used:

$ node-users node1
Tracing node jobs...................................................................
jorge.suarez natalia.fernandez

To get more detailed information on the nodes, the pnbsnodes command can be used:

ct$ pbsnodes  #Detailed information on all nodes
node1
     state = free
     np = 64
     properties = amd,bigmem,test,active,activeX
     ntype = cluster
     status = rectime=1455267717,varattr=,jobs=,state=free,netload=86957182662,gres=,loadave=0.00,ncpus=64,physmem=132250896kb,availmem=162914704kb,totmem=163499276kb,idletime=1876325,nusers=0,nsessions=0,uname=Linux node1 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0
 
node2
     state = down,offline
     np = 64
     properties = amd,bigmem
     ntype = cluster
     status = rectime=1454919087,varattr=,jobs=,state=free,netload=1185896,gres=,loadave=0.00,ncpus=64,physmem=264633540kb,availmem=295220244kb,totmem=295881920kb,idletime=11140,nusers=0,nsessions=0,uname=Linux node2 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0
 
......
ct$ pbsnodes -l  # Shutdown node list(down) or not available(offline)
node2                down,offline
node3                down,offline
node4                down,offline
node5                down,offline
node6                down,offline
node7                down,offline
inode11              down,offline
inode12              down,offline
inode13              down,offline
inode14              down,offline
inode15              down,offline
inode17              down,offline
inode18              down,offline
inode19              down,offline

Cancel a job from the queue


The qdel command allows the user to erase a job. It works by sending first a TERM and then a KILL signal. This command needs the PBS identifier assigned to the job as an argument, it can be seen using the qstat command.

ct$ qdel job_id