Submit and manage jobs
Submit jobs to the queue
Job submission its done with the qsub command which has one mandatory parameter of a shell script's name.
ct$ qsub script.sh
The qsub command admits as parameters the same options that can be used as #PBS comments in the scripts.
Check job, queue or node state
Queue information
The qstat
command shows the queue status,
ct$ qstat -q # Global queue information server: ctcomp2 Queue Memory CPU Time Walltime Node Run Que Lm State ---------------- ------ -------- -------- ---- --- --- -- ----- graphic32 -- -- 04:00:00 -- 0 0 -- E R np16 -- -- 192:00:0 -- 0 0 -- E R np32 -- -- 288:00:0 -- 0 0 -- E R especial -- -- 672:00:0 -- 0 0 -- E R parallel -- -- 192:00:0 -- 0 0 -- E R np2 -- -- 192:00:0 -- 0 0 -- E R np8 -- -- 192:00:0 -- 0 0 -- E R short -- -- -- -- 0 0 -- E R graphic1 -- -- 04:00:00 -- 0 0 -- E R np1 -- -- 672:00:0 -- 0 0 -- E R batch -- -- -- -- 0 0 -- E R np4 -- -- 192:00:0 -- 0 0 -- E R interactive -- -- 01:00:00 -- 0 0 -- E R np64 -- -- 384:00:0 -- 0 0 -- E R graphic -- -- -- -- 0 0 -- E R bigmem -- -- -- -- 0 0 -- E R graphic8 -- -- 04:00:00 -- 0 0 -- E R ----- ----- 0 0
The first letter of the state column indicates if the queue is (E)nabled or (D)isabled and the second letter if the queue is (R)unning or (S)topped.
Jo information
Each time a job is assigned a JOB_ID that is used as a unique identifier. If the job was sent with the -t option then it is going to be identified by job_id[indice]
.
ct$ qstat # Información general de los trabajos de usuario Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 999999.ctcomp2 job_name user_name 38:05:59 R np32
The Time Use column shows the CPU time used. The S column is the job state, which can be one of the following:
- C - Job is completed after having run
- E - Job is exiting after having run.
- H - Job is held.
- Q - job is queued, eligible to run or routed.
- R - job is running.
- T - job is being moved to new location.
- W - job is waiting for its execution time(-a option) to be reached.
ct$ qstat -f 999999.ctcomp2 # Specific job info Job Id: 999999.ctcomp2.innet Job_Name = job_name Job_Owner = user_name@ctcomp2.innet job_state = Q queue = np32 server = ctcomp2.innet Checkpoint = u ctime = Fri Feb 12 10:09:34 2016 Error_Path = ctcomp2.innet:/home/local/user_name/job_name.e999999 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = ae Mail_Users = nombre_usuario@usc.es mtime = Fri Feb 12 10:09:34 2016 Output_Path = ctcomp2.innet:/home/local/user_name/job_name.o999999 Priority = 0 qtime = Fri Feb 12 10:09:34 2016 Rerunable = True Resource_List.neednodes = 1:ppn=32:intel:xeonl Resource_List.nodect = 1 Resource_List.nodes = 1:ppn=32:intel:xeonl Resource_List.vmem = 63gb Resource_List.walltime = 12:00:00 substate = 10 Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/local/user_name, PBS_O_LOGNAME=nombre_usuario, PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games, PBS_O_MAIL=/var/mail/nombre_usuario,PBS_O_SHELL=/bin/bash, PBS_O_LANG=es_ES.UTF-8,PBS_O_WORKDIR=/home/local/user_name, PBS_O_HOST=ctcomp2.innet,PBS_O_SERVER=ctcomp2 euser = user_name egroup = citius queue_rank = 2110 queue_type = E etime = Fri Feb 12 10:09:34 2016 submit_args = script.sh fault_tolerant = False job_radix = 0 submit_host = ctcomp2.innet
An interesting characteristic about the finished jobs is the EXIT_STATUS which should be shown when JOB_STATE is C.
Internal code | EXIT_STATUS value | Meaning |
---|---|---|
JOB_EXEC_OVERLIMIT | -10 | |
JOB_EXEC_STDOUTFAIL | -9 | |
JOB_EXEC_CMDFAIL | -8 | Exec() of user command failed |
JOB_EXEC_BADRESRT | -7 | Job restart failed |
JOB_EXEC_INITRMG | -6 | Job aborted on MOM init, chkpt, ok migrate |
JOB_EXEC_INITRST | -5 | Job aborted on MOM init, chkpt, no migrate |
JOB_EXEC_INITABT | -4 | Job aborted on MOM initialization |
JOB_EXEC_RETRY | -3 | Job execution failed, do retry |
JOB_EXEC_FAIL2 | -2 | Job execution failed, after files, no retry |
JOB_EXEC_FAIL1 | -1 | Job execution failed, before files, no retry |
JOB_EXEC_OK | 0 | Job execution successful |
1-256 | Exit status of the top-level shell | |
>256 | Job ended by a UNIX signal, substracting 256 results in the signal number. |
ct$ checkjob 999999.ctcomp2 # Info about a specific job checking job 999999 State: Running Creds: user:user_name group:citius class:np32 qos:DEFAULT WallTime: 00:25:46 of 12:00:00 SubmitTime: Tue Feb 16 10:40:31 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Tue Feb 16 10:40:32 Total Tasks: 32 Req[0] TaskCount: 32 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [active][intel][xeonl] Allocated Nodes: [inode15:32] IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '137092' (-00:25:32 -> 11:34:28 Duration: 12:00:00) PE: 32.00 StartPriority: 21
ct$ tracejob -n 3 999999.ctcomp2 # Shows the log content related to the indicated jobid. Job: 136553.ctcomp2.innet 02/10/2016 15:22:26 S enqueuing into batch, state 1 hop 1 02/10/2016 15:22:26 S dequeuing from batch, state QUEUED 02/10/2016 15:22:26 S enqueuing into np1, state 1 hop 1 02/10/2016 15:22:26 S Job Run at request of citiuscap@ctcomp2.innet 02/10/2016 15:22:26 S Not sending email: User does not want mail of this type. 02/10/2016 15:22:26 A queue=batch 02/10/2016 15:22:26 A queue=np1 02/10/2016 15:22:26 A user=user_name group=citius jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=user_name@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00 02/10/2016 16:08:34 S Exit_status=0 resources_used.cput=00:46:14 resources_used.mem=234868kb resources_used.vmem=1002480kb resources_used.walltime=00:46:08 02/10/2016 16:08:34 S on_job_exit valid pjob: 999999.ctcomp2.innet (substate=50) 02/10/2016 16:08:34 A user=user_name group=citius jobname=job_name queue=np1 ctime=1455114146 qtime=1455114146 etime=1455114146 start=1455114146 owner=nombre_usuario@ctcomp2.innet exec_host=inode19/24 Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1 Resource_List.nodes=1:ppn=1 Resource_List.vmem=2040mb Resource_List.walltime=12:00:00 session=7304 end=1455116914 Exit_status=0 resources_used.cput=00:46:14 resources_used.mem=234868kb resources_used.vmem=1002480kb resources_used.walltime=00:46:08 02/10/2016 17:08:35 S dequeuing from np1, state COMPLETE
Node info
To get a global view of the cluster state the command nodes-usage
can be used.
$ nodes-usage +----------------------------------+-------------------+ | USAGE | NODE | +----------------------------------+-------------------+ | ################################ | node1 (64/64) | | ################################ | node2 (64/64) | | | node3 (0/64) | | ################################ | node4 (64/64) | | | node5 (0/64) | | ################################ | node6 (64/64) | | | node7 (0/64) | | | inode11 (0/32) | | | inode12 (0/??) | | | inode13 (0/32) | | | inode14 (0/32) | | | inode15 (0/??) | | | inode16 (0/32) | | | inode17 (0/??) | | | inode18 (0/??) | | ## | inode19 (2/32) | | ############################ | inode20 (28/32) | +----------------------------------+-------------------+ | ############## | TOTAL (286/640) | +----------------------------------+-------------------+
To get information of the users in each node, the node-users <node>
command can be used:
$ node-users node1 Tracing node jobs................................................................... jorge.suarez natalia.fernandez
To get more detailed information on the nodes, the pnbsnodes
command can be used:
ct$ pbsnodes #Detailed information on all nodes node1 state = free np = 64 properties = amd,bigmem,test,active,activeX ntype = cluster status = rectime=1455267717,varattr=,jobs=,state=free,netload=86957182662,gres=,loadave=0.00,ncpus=64,physmem=132250896kb,availmem=162914704kb,totmem=163499276kb,idletime=1876325,nusers=0,nsessions=0,uname=Linux node1 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 node2 state = down,offline np = 64 properties = amd,bigmem ntype = cluster status = rectime=1454919087,varattr=,jobs=,state=free,netload=1185896,gres=,loadave=0.00,ncpus=64,physmem=264633540kb,availmem=295220244kb,totmem=295881920kb,idletime=11140,nusers=0,nsessions=0,uname=Linux node2 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64,opsys=linux mom_service_port = 15002 mom_manager_port = 15003 gpus = 0 ......
ct$ pbsnodes -l # Shutdown node list(down) or not available(offline) node2 down,offline node3 down,offline node4 down,offline node5 down,offline node6 down,offline node7 down,offline inode11 down,offline inode12 down,offline inode13 down,offline inode14 down,offline inode15 down,offline inode17 down,offline inode18 down,offline inode19 down,offline
Cancel a job from the queue
The qdel
command allows the user to erase a job. It works by sending first a TERM and then a KILL signal. This command needs the PBS identifier assigned to the job as an argument, it can be seen using the qstat
command.
ct$ qdel job_id