Job Submission Tutorial

From athena

MPI Users: Click here for information on invoking MPI executables within batch jobs

Our batch scheduler is a combination of two packages that interoperate: PBS/Torque and MOAB. PBS (or "Portable Batch System") is known as the "resource manager." PBS was originally developed for NASA in the mid 1990's and emerged as the principle resource manager in the high-performance computing world. There are now several versions of PBS out there (just like there are versions of Linux). Ours is a commercial product known as "Torque."

MOAB is known as a "workload manager." A true understanding of the distinction between MOAB and PBS is not needed to use the products effectively, however. Therefore, it suffices to say that PBS keeps track of the minutiae of the jobs that you submit (e.g. job name, current working directories, do you get email when the job completes, etc), and MOAB tries to figure out how to best arrange those jobs on the compute nodes themselves. Consequently, when submitting your job, modifying your job, or constructing your job script you will be interacting with PBS. Once you are done doing this and your job is in the queue, then you interact with MOAB to watch it as it progresses.

[edit] Submitting jobs with PBS

PBS comes with very complete man pages. Therefore, for complete documentation of PBS commands you are encouraged to type "man pbs" and go from there.

Jobs are submitted using the "qsub" command. Type "man qsub" for information on the plethora of options that it offers.

Let's say I have an executable called "myprog". Let me try and submit it to PBS:

[richardc@athena0 ill]$ qsub myprog
qsub:  file must be an ascii script

Oops...That didn't work because qsub expects a shell script. Any shell should work, so use your favorite one. So I write a simple script called "myscript.csh",

 #!/bin/csh
 cd $PBS_O_WORKDIR
 ./myprog argument1 argument2

and then I submit it:

[richardc@athena0 ill]$ qsub myscript.csh
34736.athena0.npl.washington.edu

That worked! Note the use of the $PBS_O_WORKDIR environment variable. This is important, since by default PBS on our cluster will start executing the commands in your shell script from your home directory. To go to the directory in which you executed "qsub", cd to $PBS_O_WORKDIR. There are several other useful PBS environment variables that we will encounter later.

[edit] Specifying PBS job parameters

By default, any script you submit will run on a single processor for a maximum of 4 hours. The name of the job will be the name of the script, and it will not email you when it starts, finished, or is interrupted. Stdout and Stderr are collected into separate files named after the job number.

You can effect the default behavior of PBS be passing it parameters. These parameters can be specified on the command line or inside the shell script itself. For example, let's say I want to send stdout and stderr to a file that is different from the default:

 qsub -e myprog.err -o myprog.out myscript.csh

Alternatively, I can actually edit myscript.csh to include these parameters. I can specify any PBS command line parameter I want in a line that begins with "#PBS":

 #!/bin/csh
 #PBS -e myprog.err
 #PBS -o myprog.out
 cd $PBS_O_WORKDIR
 ./myprog argument1 argument2

Now I just submit my modified script with no command-line arguments

 qsub myscript.csh

[edit] Useful PBS parameters

Here is an example of a more involved PBS script that requests only 1 hour of execution time, renames the job, and send email when the job begins, ends, or aborts:

 #!/bin/csh
 
 # Name of my job:
 #PBS -N My-Program
 
 # Run for 1 hour:
 #PBS -l walltime=1:00:00
 
 # Where to write stderr:
 #PBS -e myprog.err
 
 # Where to write stdout: 
 #PBS -o myprog.out
 
 # Send me email when my job aborts, begins, or ends
 #PBS -m abe
 
 # This command switched to the directory from which the "qsub" command was run:
 cd $PBS_O_WORKDIR
 
 #  Now run my program
 ./myprog argument1 argument2
 
 echo Done!

Some more useful PBS parameters:

-M: Specify your email address.
-j oe: merge standard output and standard error into standard output file.
-V: export all your environment variables to the batch job.
-I: run and interactive job (see below).

Once again, you are encouraged to consult the qsub manpage for more options.

[edit] Debugging with PBS: Interactive jobs

Any job submitted with the -I flag will execute interactively. What this means is that instead of executing a script that you provide, you will be given a command line on the compute node where your script would have executed. This allows you to run your program multiple times and watch the output as it is produced. Also note that there is a special queue set up for debugging. Using this queue is described in Debug queue subsection of the section on job routing and priority management

[edit] Special concerns for running OpenMP programs

By default, PBS assigns you 1 core on 1 node. You can, however, run your job on up to 8 cores per node. Therefore, if you want to run an OpenMP program, you must specify the number of processors per node. This is done with the flag -l nodes=1:ppn=<cores> where "<cores>" is the number of OpenMP threads you wish to use.

Keep in mind that you still must set the OMP_NUM_THREADS environment variable within your PBS script, e.g.:

 #!/bin/csh
 #PBS -N My-OpenMP-Script
 #PBS -l nodes=1:ppn=8
 #PBS -l walltime=1:00:00
 
 cd $PBS_O_WORKDIR
 setenv OMP_NUM_THREADS 8
 ./MyOpenMPProgram

[edit] Using reservations

Up until now, we have been happy to run on any node that the cluster assigned us. However, you may be part of a group that owns specific nodes on the cluster. These nodes will run any user's job if nobody from the group has asked for them specifically. If you are part of a group that owns specific nodes, you will have access to these nodes by using that group's standing reservation. This process was updated on 4/30/2008 and is now documented in a new section on job routing and priority management.

The current groups with standing reservations are as follows:

Astro
CENPA
INT
Physics

[edit] Using special queues

In addition to the default queue (where all of the jobs in the examples on this page are routed), there are two other queues: debug and scavenge. These queues are also documented in the new section on job routing and priority management.

[edit] Using the PBS_NODEFILE for multi-threaded jobs

Until now, we have only dealt with serial jobs. In a serial job, your PBS script will automatically be executed on the target node assigned by the scheduler. If you asked for more than one node, however, your script will only execute on the first node of the set of nodes allocated to you. To access the remainder of the nodes, you must either use MPI or manually launch threads. But which nodes to run on? PBS gives you a list of nodes in a file at the location pointed to by the PBS_NODEFILE environment variable.

In your shell script, you may thereby ascertain the nodes on which you job can run by looking at the file in the location specified by this variable:

 #!/bin/csh
 #PBS -l nodes=2:ppn=8
 
 echo The nodefile for this job is stored at `echo $PBS_NODEFILE`
 echo `cat $PBS_NODEFILE`

When you run this job, you should then get output similar to:

 The nodefile for this job is stored at /opt/torque/aux//33168.athena0.npl.washington.edu
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-1.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local
 compute-3-2.local

If you have an application that manually forks processes onto the nodes of your job, you are responsible for parsing the PBS_NODEFILE to determine which nodes those are. If we ever catch you running on nodes that are not yours, we will provide your name and contact info to the other Athena users whose jobs you have interfered with and let vigilante justice take its course.

MPI jobs also require you to feed the PBS_NODEFILE to 'mpirun'. For more information on running MPI parallel jobs, please refer to the section on invoking MPI executables within batch jobs.

[edit] Watching your jobs with MOAB

[edit] Examining the queue

As we learned above, MOAB is the software application that actually decides what resources your job will run on. You can look at the queue by either using the PBS qstat command, or by using the MOAB showq command. qstat will display the queue ordered by JobID, whereas showq will display jobs grouped by their state ("running," "idle," or "hold") then ordered by priority. Therefore, showq is often more useful.

[richardc@athena0 ill]$ showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

34695              gardnerj    Running   128    00:08:53  Thu Apr 10 18:22:21
34681              wdetmold    Running   128     2:51:14  Thu Apr 10 13:34:42
34730              wdetmold    Running   128     4:46:07  Thu Apr 10 15:29:35
34731              wdetmold    Running   128     4:47:08  Thu Apr 10 15:30:36
34732              wdetmold    Running   128     5:20:58  Thu Apr 10 16:04:26
34729              wdetmold    Running   128     7:45:08  Thu Apr 10 18:28:36
31503                cbrook    Running   112  3:05:46:16  Sun Apr  6 16:29:44
32610                cbrook    Running   112  5:07:29:28  Tue Apr  8 18:12:56
34720                becker    Running     1 12:08:19:33  Thu Apr 10 15:03:01
34733                becker    Running     1 12:09:16:37  Thu Apr 10 16:00:05
34734                becker    Running     1 12:09:18:18  Thu Apr 10 16:01:46

11 active jobs         995 of 1017 processors in use by local jobs (97.84%)
                        127 of 128 nodes active      (99.22%) 

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME 

34698              gardnerj       Idle   128    00:30:00  Thu Apr 10 14:07:13
34737              wdetmold       Idle   128     8:00:00  Thu Apr 10 18:28:41

2 eligible jobs   

 blocked jobs-----------------------
 JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


 0 blocked jobs    

Total jobs:  13

showq also has some useful options. To see a detailed list of all the "idle" jobs, try showq -i:

[richardc@athena0 ill]$ showq -i

JOBID        PRIORITY  XFACTOR  Q  USERNAME    GROUP  PROCS     WCLIMIT     CLASS      SYSTEMQUEUETIME

68080           11314      1.4 as    cbrook    astro    128  2:18:00:00   default  Thu Aug 14 12:43:56
68671           11154     24.7 ph  wdetmold  physics     32     1:00:00   default  Thu Aug 14 15:58:00
68673           11154     24.7 ph  wdetmold  physics     32     1:00:00   default  Thu Aug 14 15:58:00
68674           11154     24.7 ph  wdetmold  physics     32     1:00:00   default  Thu Aug 14 15:58:00
68675           11154     24.7 ph  wdetmold  physics     32     1:00:00   default  Thu Aug 14 15:58:00
68676           11154     24.7 ph  wdetmold  physics     32     1:00:00   default  Thu Aug 14 15:58:00
71087              28      1.1 no       trq    astro     32    12:00:00   default  Fri Aug 15 15:05:58

"XFACTOR" is not used for our scheduling strategy, so ignore it. "Q" is the first 2 letters of the "qos=" argument that was requested ("no" means none). "CLASS" is the queue to which the job was submitted (yeah, I know this would have made more sense to be "Q"...).

To see a detailed list of all the "running" jobs, use showq -r:

[richardc@athena0 ill]$ showq -i

JOBID      S  PAR  EFFIC  XFACTOR  Q  USERNAME    GROUP        MHOST PROCS   REMAINING            STARTTIME

70454      R  bat 2576.79      6.8 sc  wdetmold  physics compute-3-25    32    00:44:45  Fri Aug 15 15:30:39
70455      R  bat 2600.16      6.8 sc  wdetmold  physics compute-5-30    32    00:45:29  Fri Aug 15 15:31:23
70457      R  deb 2469.31      6.8 sc  wdetmold  physics compute-5-28    32    00:46:27  Fri Aug 15 15:32:21
70456      R  bat 2392.50      6.8 sc  wdetmold  physics  compute-4-3    32    00:46:59  Fri Aug 15 15:32:53
70458      R  bat 2782.63      6.8 sc  wdetmold  physics  compute-6-1    32    00:47:11  Fri Aug 15 15:33:05
70460      R  deb 2645.04      6.8 sc  wdetmold  physics compute-5-24    32    00:47:33  Fri Aug 15 15:33:27

Once again, ignore "XFACTOR". "PAR" is the partition on which the job is running. "Q" is the qos. Documentation on showq is available at the MOAB website.

[edit] Asking MOAB when you job will probably start and finish

If you want to see a time estimate for when you job will start, use the showstart command:

[richardc@athena0 ill]$ showstart 71087

job 71087 requires 32 procs for 12:00:00

Estimated Rsv based start in                00:00:00 on Fri Aug 15 15:50:50
Estimated Rsv based completion in           12:00:00 on Sat Aug 16 03:50:50 

Best Partition: batch_nodes

See the Documentation on the MOAB website for more options.

[edit] Other useful commands

A full list of MOAB commands is available on the MOAB documentation website. Here are a few useful suggestions.

If I want to see the details of a specific job, use checkjob on it:

 checkjob 34695

Shows the reservations on the system. Note that we have infinite reservations for the four primary groups on this cluster:

 showres

Shows what nodes are busy and with what jobs:

 showstate

Tells you more information about your job.

 checkjob -v <jobid>

Shows the current priorities in the queue

 mdiag -p

Sorts load on nodes by cabinet and then by rack position

 ganglia load_one | sort -n -t- -k2 -k3

[edit] Walk-through example

Let's walk through an example of submitting this PBS script.

Since we're good stewards of resources, this next command says that we want 3 hours and 10 minutes on the reservation for CENPA. You can set this on the command line:

[richardc@athena0 ill]$ qsub -l walltime=3:10:00,qos=cenpa myscript.sh
33154.athena0.npl.washington.edu

The command returns the job id (in this example, 33154). Did it actually get submitted? We can check...

[richardc@athena0 ill]$ mshow

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

33154              richardc    Running     1     3:04:24  Sat Mar  8 17:40:20

1 active job             1 of 1017 processors in use by local jobs (0.10%)
                          1 of 128 nodes active      (0.78%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 eligible jobs   

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME


0 blocked jobs    

Total job:  1

Now you want to know things like where your job is running and what not:

[richardc@athena0 ill]$ checkjob 33154
job 33154 (RM job '33154.athena0.npl.washington.edu')

AName: STDIN
State: Running 
Creds:  user:richardc  group:cenpa  class:default  qos:cenpa
WallTime:   00:02:01 of 3:10:00
SubmitTime: Sat Mar  8 17:40:20
  (Time Queued  Total: 00:00:00  Eligible: 00:00:00)

StartTime: Sat Mar  8 17:40:20
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: base  
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
Dedicated Resources Per Task: PROCS: 1
Utilized Resources Per Task:  PROCS: 0.97  MEM: 3M  SWAP: 114M
Avg Util Resources Per Task:  PROCS: 0.97
Max Util Resources Per Task:  PROCS: 1.00  MEM: 3M  SWAP: 114M
Average Utilized Memory: 0.77 MB
Average Utilized Procs: 0.26
NodeAccess: ---
NodeCount:  1

Allocated Nodes:
[compute-6-32.local:1]

SystemID:   athena0
SystemJID:  athena0.15
Task Distribution: compute-6-32.local

IWD:            $HOME/ill
UMask:          0022 
Executable:     /opt/moab/spool/moab.job.DyOrwm

OutputFile:     - (athena0.npl.washington.edu:/share/home/richardc/ill/STDIN.o13)
ErrorFile:      - (athena0.npl.washington.edu:/share/home/richardc/ill/STDIN.e13)
StartCount:     1
User Specified Partition Mask:   [base][athena0]
System Available Partition Mask: [base][athena0]
Partition Mask: [athena0]
SrcRM:          internal  DstRM: athena0  DstRMJID: 13.athena0.npl.washington.edu
Flags:          ADVRES:cenpa,RESTARTABLE,GLOBALQUEUE
Attr:           checkpoint
StartPriority:  1
PE:             1.00
Reservation '33154' (-00:01:05 -> 3:08:55  Duration: 3:10:00)

[edit] NEW: PBS Job Dependencies (advanced)

There may be times you may wish to have PBS jobs dependent on one another. For example, you might wish to submit multiple scripts, but want to ensure that they get executed in a specific order.

[edit] The wrong way (recursive scripts)

It is possible to submit a PBS script from within another PBS script. This can be very dangerous and lead to race conditions if done incorrectly. You should never ever have a script submit itself! For example, consider the following script that submits itself:

my_runaway_script.pbs:

#!/bin/csh
# 
#PBS -N runaway_job_script
#PBS -l walltime=1:00
#
cd $PBS_O_WORKDIR
./my_program_that_crashes_in_1_millisecond
qsub my_runaway_script.pbs

If there is a node that is completely free, it will start running your script. But after 1 millisecond, the script terminates and resubmits itself! Then it starts running again, then after another millisecond it terminates and resubmits. Etc etc etc. There have already been circumstances like this where Athena has processed over 100 jobs per second. This overloads the resource manager and messes life up for everyone. Therefore, please do no submit recursive job scripts! If we ever catch somebody doing this, we will delete their account.

[edit] The right way (job chaining)

It is totally fine to have jobs that execute one after another, just so long as a race condition is impossible. There are two ways to accomplish job chaining in PBS.

[edit] Method 1: Script 1 submits Script 2, etc

The first way to chain PBS jobs is the more obvious method where a script calls another script directly Here is an example that does not lead to a race condition:

my_OK_script1.pbs:

#!/bin/csh
#PBS -N OK_script1
#PBS -l walltime=1:00
cd $PBS_O_WORKDIR
./my_program_that_crashes_in_1_millisecond
ssh athena0 "(cd $PBS_O_WORKDIR; qsub my_OK_script2.pbs)"

my_OK_script2.pbs:

#!/bin/csh
#PBS -N OK_script1
#PBS -l walltime=1:00
echo Hi!

In the above example, my_OK_script2.pbs does not submit anything. Therefore, my_OK_script1.pbs will only execute once (in 1 millisecond) but otherwise have no ill effects. You can chain an arbitrarily large number of jobs together, so long as there is one script at the end that does not submit anything.

Note that the compute nodes do not have the PBS command set installed. Therefore, to submit a job from a script that executes on a compute node, you must ssh back to the head node to actually run "qsub".

[edit] Method 2: Using PBS job dependencies

A much better way to chain jobs together is to use PBS's built-in dependency capability. The ARSC newsletter has an excellent article explaining these, so please click on the link below and read it:

ARSC Newsletter on PBS job chaining

[edit] EXAMPLE: Cleaning up your file on local node storage, even if your job is terminated

One problem that people have been having in using local storage is that their data can be left on the local disk if their job is aborted early. Consider the following script:

#!/bin/csh
#PBS -N My_job_that_cleans_up_after_itself
if (! -d /state/partition1/my_directory) then
    mkdir /state/partition1/my_directory
endif
cd /state/partition1/my_directory
cp $HOME/data/input.dat .
$HOME/myprogram < input.dat > output.dat
cp output.dat $HOME/data/output.dat
cd $HOME
rm -r /state/partition1/my_directory

This is a nice, considerate script that cleans up after itself. However, what happens if "myprogram" hangs and is terminated by the scheduler? In that case, the rest of the script is not executed and the files remain on /state/partition1. One way to safeguard against this is to chain a second job to your first one that only runs if the first one exits incorrectly. Here are two example scripts that I have tested and run on Athena. You are welcome to cut and paste them from this page and modify them for your own usage. You may also download them directly from www.phys.washington.edu/~gardnerj/Athena/examples.

You submit the first script, "test_cleanup.pbs" by hand. This is the PBS job that does the actual work. It submits, as a second job, a script "cleanup_files.csh" that will ssh to the correct compute node and remove the correct directory. The submission using the PBS job dependencies capability to specify that it will only be run if the original script fails for some reason. As is, the scripts are configured in "test mode". You will need to read through them and change where appropriate for you. Currently, for testing you should have a file named "input.dat" in your $PBS_O_WORKDIR.

test_cleanup.pbs:

#!/bin/csh
#
# Example PBS script that automatically cleans up its files on the compute node's local
# directory, even if the job is killed early.
#
#PBS -N test_cleanup
#PBS -j oe
#PBS -l walltime=5:00,nodes=1:ppn=1
#PBS -m abe

# Acquire useful job variables
set pbs_job_id = `echo $PBS_JOBID | awk -F . '{print $1}'`
set num_procs = `wc -l < $PBS_NODEFILE`
set master_node_id = `hostname`

# Print useful diagnostic information to stdout
echo This is job `echo $pbs_job_id`
echo The master node of this job is `echo $master_node_id`
echo The directory from which this job was submitted is `echo $PBS_O_WORKDIR`
echo This job is running on `echo $num_procs` processors

# Create temporary subdirectory for this job
cd /state/partition1
if (! -d gardnerj) then
   mkdir gardnerj
endif
cd gardnerj
mkdir tmp.$pbs_job_id
# Set local_scratch_dir to the temporary directory for this job
set local_scratch_dir = /state/partition1/gardnerj/tmp.$pbs_job_id

# Now submit the "cleanup job."  This will only be run if the current job exits abnormally
# (e.g. it was terminated).  Note that we must ssh back to the head node because the compute
# nodes do not have access to the PBS command set.  Also note that there is no way to
# provide command line arguments to a script submitted to PBS, but we can use a workaround
# by defining environment variables.
set pbs_output = `ssh athena0 "(cd $PBS_O_WORKDIR; qsub -N clean.$pbs_job_id -W depend=afternotok:$pbs_job_id \
 cleanup_files.csh -v CLEAN_MASTER_NODE_ID=$master_node_id,CLEAN_LOCAL_SCRATCH_DIR=$local_scratch_dir)"`
set cleanup_job_id = `echo $pbs_output | awk '{print $2}'`
echo I just submitted cleanup job ID $cleanup_job_id

# Copy input file(s) from the directory from which this job was submitted
# to local_scratch_dir
cp $PBS_O_WORKDIR/input.dat $local_scratch_dir

#
# "Run" my job (Replace the "sleep" command with your program)
#
cd $local_scratch_dir
sleep 60

# If we made it this far, we can just execute the cleanup script directly.
# Note that we could have submitted the above "cleanup job" so that it *always*
# ran after the current one, in which case we would not have needed to execute
# "cleanup_files.csh" below.  However, it is better to use our current
# time allotment on this node to do post-processing.  We only want to 
# rely on the cleanup job when absolutely necessary.
echo Executing cleanup script on this node:
cd $PBS_O_WORKDIR
cleanup_files.csh $master_node_id $local_scratch_dir

echo Job $pbs_job_id completed normally on node $master_node_id

# When this script exits normally here, the cleanup job will automatically
# be deleted by the resource manager.

The cleanup_files.csh script is currently configured to simply do an "ls -al" of the relevant directory rather than delete it. To execute "rm -r", simply comment and uncomment where indicated.

cleanup_files.csh:

#!/bin/csh
#
# Usage: cleanup_files.csh compute_node_id local_scratch_dir
#   -- Runs "rm -r" on directory "local_scratch_dir" on Athena compute node "compute_node_id"
#   -- If no command line arguments are given, arguments are taken from the following environment variables:
#         CLEAN_MASTER_NODE_ID    (sets compute_node_id)
#         CLEAN_LOCAL_SCRATCH_DIR (sets local_scratch_dir)
#
#   Example: "cleanupfiles.csh compute-5-1 /state/partition1/username/tmp.12345"
#
#
# The following lines are useful when this script is submitted to PBS.  Otherwise,
# they are ignored.
#PBS -j oe
#PBS -l walltime=1:00,nodes=1:ppn=1
#PBS -m abe

# Get arguments from either the command-line or environment variables
if ($#argv == 2) then
  set master_node_id = $1
  set local_scratch_dir = $2
  echo Cleanup script received these arguments from the command line:
else
  set master_node_id = $CLEAN_MASTER_NODE_ID
  set local_scratch_dir = $CLEAN_LOCAL_SCRATCH_DIR
  echo Cleanup script received these arguments from environment variables:
endif
 echo "   master_node_id:    $master_node_id"
 echo "   local_scratch_dir: $local_scratch_dir"

# Comment the following line when you are finished testing...
ssh $master_node_id "ls -al $local_scratch_dir"
# ...and uncomment the following line for actual usage
# ssh $master_node_id "rm -r $local_scratch_dir"
echo Cleanup script complete!