====== Submit and manage jobs ======

===== Submit jobs to the queue =====
--------------
Job submission its done with the qsub command which has one mandatory parameter of a shell script's name. 
<code>
ct$ qsub script.sh
</code>
The qsub command admits as parameters the same options that can be used as #PBS comments in the scripts.

===== Check job, queue or node state =====
---------------
==== Queue information ====
The ''qstat'' command shows the queue status, 
 
<code bash>
ct$ qstat -q # Global queue information
server: ctcomp2

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
graphic32          --      --    04:00:00   --    0   0 --   E R
np16               --      --    192:00:0   --    0   0 --   E R
np32               --      --    288:00:0   --    0   0 --   E R
especial           --      --    672:00:0   --    0   0 --   E R
parallel           --      --    192:00:0   --    0   0 --   E R
np2                --      --    192:00:0   --    0   0 --   E R
np8                --      --    192:00:0   --    0   0 --   E R
short              --      --       --      --    0   0 --   E R
graphic1           --      --    04:00:00   --    0   0 --   E R
np1                --      --    672:00:0   --    0   0 --   E R
batch              --      --       --      --    0   0 --   E R
np4                --      --    192:00:0   --    0   0 --   E R
interactive        --      --    01:00:00   --    0   0 --   E R
np64               --      --    384:00:0   --    0   0 --   E R
graphic            --      --       --      --    0   0 --   E R
bigmem             --      --       --      --    0   0 --   E R
graphic8           --      --    04:00:00   --    0   0 --   E R
                                               ----- -----
                                                   0     0
</code>
The first letter of the state column indicates if the queue is (E)nabled or (D)isabled and the second letter if the queue is (R)unning or (S)topped.

==== Jo information ====
Each time a job is assigned a JOB_ID that is used as a unique identifier. If the job was sent with the -t option then it is going to be identified by ''job_id[indice]''.

<code bash>
ct$ qstat  # Información general de los trabajos de usuario 
Job id                    Name             User               Time Use S Queue
------------------------- ---------------- ---------------    -------- - -----
999999.ctcomp2            job_name         user_name          38:05:59 R np32       
</code>
The Time Use column shows the CPU time used.
The S column is the job state, which can be one of the following:
       *           C -  Job is completed after having run
       *           E -  Job is exiting after having run.
       *           H -  Job is held.
       *           Q -  job is queued, eligible to run or routed.
       *           R -  job is running.
       *           T -  job is being moved to new location.
       *           W -  job is waiting for its execution time(-a option) to be reached.

<code bash>
ct$ qstat -f 999999.ctcomp2   # Specific job info
Job Id: 999999.ctcomp2.innet
    Job_Name = job_name
    Job_Owner = user_name@ctcomp2.innet
    job_state = Q
    queue = np32
    server = ctcomp2.innet
    Checkpoint = u
    ctime = Fri Feb 12 10:09:34 2016
    Error_Path = ctcomp2.innet:/home/local/user_name/job_name.e999999
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = ae
    Mail_Users = nombre_usuario@usc.es
    mtime = Fri Feb 12 10:09:34 2016
    Output_Path = ctcomp2.innet:/home/local/user_name/job_name.o999999
    Priority = 0
    qtime = Fri Feb 12 10:09:34 2016
    Rerunable = True
    Resource_List.neednodes = 1:ppn=32:intel:xeonl
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=32:intel:xeonl
    Resource_List.vmem = 63gb
    Resource_List.walltime = 12:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/local/user_name,
	PBS_O_LOGNAME=nombre_usuario,
	PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games,
	PBS_O_MAIL=/var/mail/nombre_usuario,PBS_O_SHELL=/bin/bash,
	PBS_O_LANG=es_ES.UTF-8,PBS_O_WORKDIR=/home/local/user_name,
	PBS_O_HOST=ctcomp2.innet,PBS_O_SERVER=ctcomp2
    euser = user_name
    egroup = citius
    queue_rank = 2110
    queue_type = E
    etime = Fri Feb 12 10:09:34 2016
    submit_args = script.sh
    fault_tolerant = False
    job_radix = 0
    submit_host = ctcomp2.innet

</code>
An interesting characteristic about the finished jobs is the EXIT_STATUS which should be shown when JOB_STATE is  C. 
^  Internal code       ^  EXIT_STATUS value ^  Meaning                                                                       ^
|  JOB_EXEC_OVERLIMIT   |  -10                   |                                                                                    |
|  JOB_EXEC_STDOUTFAIL  |  -9                    |                                                                                    |
|  JOB_EXEC_CMDFAIL     |  -8                    |  Exec() of user command failed                                                     |
|  JOB_EXEC_BADRESRT    |  -7                    |  Job restart failed                                                                |
|  JOB_EXEC_INITRMG     |  -6                    |  Job aborted on MOM init, chkpt, ok migrate                                        |
|  JOB_EXEC_INITRST     |  -5                    |  Job aborted on MOM init, chkpt, no migrate                                        |
|  JOB_EXEC_INITABT     |  -4                    |  Job aborted on MOM initialization                                                 |
|  JOB_EXEC_RETRY       |  -3                    |  Job execution failed, do retry                                                    |
|  JOB_EXEC_FAIL2       |  -2                    |  Job execution failed, after files, no retry                                       |
|  JOB_EXEC_FAIL1       |  -1                    |  Job execution failed, before files, no retry                                      |
|  JOB_EXEC_OK          |  0                     |  Job execution successful                                                          |
|                       |  1-256                 |  Exit status of the top-level shell                                                |
|                       |  >256                  |  Job ended by a UNIX signal, substracting 256 results in the signal number.  |

<code bash>
ct$ checkjob 999999.ctcomp2                # Info about a specific job

checking job 999999

State: Running
Creds:  user:user_name  group:citius  class:np32  qos:DEFAULT
WallTime: 00:25:46 of 12:00:00
SubmitTime: Tue Feb 16 10:40:31
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Tue Feb 16 10:40:32
Total Tasks: 32

Req[0]  TaskCount: 32  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [active][intel][xeonl]
Allocated Nodes:
[inode15:32]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '137092' (-00:25:32 -> 11:34:28  Duration: 12:00:00)
PE:  32.00  StartPriority:  21
</code>

<code bash>
ct$ tracejob -n 3 999999.ctcomp2   # Shows the log content related to the indicated jobid.
Job: 136553.ctcomp2.innet

02/10/2016 15:22:26  S    enqueuing into batch, state 1 hop 1
02/10/2016 15:22:26  S    dequeuing from batch, state QUEUED
02/10/2016 15:22:26  S    enqueuing into np1, state 1 hop 1
02/10/2016 15:22:26  S    Job Run at request of citiuscap@ctcomp2.innet
02/10/2016 15:22:26  S    Not sending email: User does not want mail of this type.
02/10/2016 15:22:26  A    queue=batch
02/10/2016 15:22:26  A    queue=np1
02/10/2016 15:22:26  A    user=user_name group=citius
                          jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=user_name@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00
02/10/2016 16:08:34  S    Exit_status=0 resources_used.cput=00:46:14
                          resources_used.mem=234868kb resources_used.vmem=1002480kb
                          resources_used.walltime=00:46:08
02/10/2016 16:08:34  S    on_job_exit valid pjob: 999999.ctcomp2.innet (substate=50)
02/10/2016 16:08:34  A    user=user_name group=citius jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=nombre_usuario@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00 session=7304 end=1455116914 Exit_status=0 resources_used.cput=00:46:14 resources_used.mem=234868kb resources_used.vmem=1002480kb resources_used.walltime=00:46:08
02/10/2016 17:08:35  S    dequeuing from np1, state COMPLETE
</code>


==== Node info ====
To get a global view of the cluster state the command ''nodes-usage'' can be used.
<code bash>
$ nodes-usage
+----------------------------------+-------------------+
| USAGE                            | NODE              |
+----------------------------------+-------------------+
| ################################ | node1 (64/64)     |
| ################################ | node2 (64/64)     |
|                                  | node3 (0/64)      |
| ################################ | node4 (64/64)     |
|                                  | node5 (0/64)      |
| ################################ | node6 (64/64)     |
|                                  | node7 (0/64)      |
|                                  | inode11 (0/32)    |
|                                  | inode12 (0/??)    |
|                                  | inode13 (0/32)    |
|                                  | inode14 (0/32)    |
|                                  | inode15 (0/??)    |
|                                  | inode16 (0/32)    |
|                                  | inode17 (0/??)    |
|                                  | inode18 (0/??)    |
| ##                               | inode19 (2/32)    |
| ############################     | inode20 (28/32)   |
+----------------------------------+-------------------+
| ##############                   | TOTAL (286/640)   |
+----------------------------------+-------------------+
</code>

To get information of the users in each node, the ''node-users <node>'' command can be used:

<code bash>
$ node-users node1
Tracing node jobs...................................................................
jorge.suarez natalia.fernandez
</code>

To get more detailed information on the nodes, the ''pnbsnodes'' command can be used:

<code bash>
ct$ pbsnodes  #Detailed information on all nodes
node1
     state = free
     np = 64
     properties = amd,bigmem,test,active,activeX
     ntype = cluster
     status = rectime=1455267717,varattr=,jobs=,state=free,netload=86957182662,gres=,loadave=0.00,ncpus=64,physmem=132250896kb,availmem=162914704kb,totmem=163499276kb,idletime=1876325,nusers=0,nsessions=0,uname=Linux node1 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

node2
     state = down,offline
     np = 64
     properties = amd,bigmem
     ntype = cluster
     status = rectime=1454919087,varattr=,jobs=,state=free,netload=1185896,gres=,loadave=0.00,ncpus=64,physmem=264633540kb,availmem=295220244kb,totmem=295881920kb,idletime=11140,nusers=0,nsessions=0,uname=Linux node2 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

......
</code>

<code bash>
ct$ pbsnodes -l  # Shutdown node list(down) or not available(offline)
node2                down,offline
node3                down,offline
node4                down,offline
node5                down,offline
node6                down,offline
node7                down,offline
inode11              down,offline
inode12              down,offline
inode13              down,offline
inode14              down,offline
inode15              down,offline
inode17              down,offline
inode18              down,offline
inode19              down,offline

</code>


===== Cancel a job from the queue =====
-------------
The ''qdel'' command allows the user to erase a job. It works by sending first a TERM and then a KILL signal. This command needs the PBS identifier assigned to the job as an argument, it can be seen using the ''qstat'' command.

<code>
ct$ qdel job_id
</code>