Job Routing and Priority

From athena

1 Partitions on Athena
- 1.1 Batch partition
- 1.2 Debug partition
2 Queues on Athena
3 Requesting Special Access to Nodes

[edit] Partitions on Athena

The Athena cluster is divided into two partitions. A job is only allowed to run in one partition at a time. In our case, we impose this constraint because we have a set of 16 nodes that, although well connected to one another, are poorly connected to the other 112 compute nodes in the cluster. Therefore, these nodes (debug_nodes) are segregated from the rest (batch_nodes) by being in their own partition.

[edit] Batch partition

The main partition is batch_nodes and consists of 112 nodes. All of the nodes that are owned by specific groups (e.g. physics, astro, int, cenpa) are in the batch_nodes partition. (Currently, the inverse is also true: all nodes in the batch_nodes partition are owned by exactly one group).

Jobs submitted to the default queue (see below) run only in the batch_nodes partition.

[edit] Debug partition

The debug_nodes partition consists of the 16 nodes that are not well-connected to the rest of the cluster. Their primary purpose is for smaller debugging and interactive jobs.

Jobs submitted to the debug queue (see below) run only in the debug_nodes partition.

[edit] Queues on Athena

The Athena cluster has 3 queues: default, debug, and scavenge. Which queue you submit your job to affects which partitions it will run on. You select which queue you want to run on with the -q parameter. For example, to submit a job to the debug queue:

 [richardc@athena0 ill] qsub -q debug -l nodes=2:ppn=8,walltime=30:00 myscript.csh

[edit] Default queue

The default queue is, obviously, the default queue for job submissions. Jobs submitted here will run only in the batch_nodes partition. The default time limit is 3 hours.

[edit] Debug queue

Jobs submitted to the debug queue will run only in the debug_nodes partition. The have a maximum limit of 1 hour. If you submit a job longer than this to the debug, the batch system will reject it. Keeping "debug" jobs separate from "batch" jobs guarantees high availability of resources for short, interactive use.

[edit] Scavenge queue

Jobs to the scavenge queue, are eligible to run in either the batch_nodes partition or the debug_nodes (with the restriction that the job will not span across partition boundaries when it runs). This flexibility comes at a price: if your job happens to be running in the debug_nodes partition and somebody else submits a job debug, that job will preempt yours if there are not enough free debug nodes. If it is lucky and finds room to run in the code>batch_nodes</code> it will be identical in priority and preemption behavior to jobs submitted in the default queue.

Therefore, the scavenge queue is designed for jobs that want to take advantage of the high availability of the nodes in the debug_nodes partition, at the expense of being preemptable by debug queue jobs. They will also compete and run in equal footing with standard default jobs in the batch_nodes partition.

Note that if you submit a scavenge job for more than 16 nodes (i.e. larger than the debug_nodes partition), it will be functionally identical to a default job.

[edit] Requesting Special Access to Nodes

If you belong to the groups "astro," "cenpa," "int," or "physics," you have the ability to request that your jobs get special handling for nodes that are owned by that group. "astro," "int," and "physics" own 16 nodes each, and "cenpa" owns 8. There are two levels or special handling that each group can achieve: "priority" and "preemption."

Note that you can request priority handing only if you submit jobs to the default queue.

[edit] Priority access

If you request "priority" access, your job will be placed ahead of non-priority jobs for the nodes that your group owns. To request "priority" access, simply use the -l qos=<group> option for qsub where <group> is the name of your group (e.g. astro, cenpa, int, physics). For example, if you are in the "physics" group and want priority access to the nodes owned by "physics":

 [richardc@athena0 ill] qsub -l qos=physics,nodes=8:ppn=8,walltime=30:00 myImportantScript.csh

If two or more jobs of the same group request priority handling, the one submitted first will run first.

[edit] Preemptive access

Let's say you have an important deadline tomorrow and you really need your nodes now. You can opt to kick everyone else off of the nodes that your group owns (ah, that sweet thrill of ownership!). To do this, you format your job submission with -l qos=<group>_now. For example, if you are in the "astro" group and want to run a job on your nodes right now and preempt anybody else who may be using them:

 [richardc@athena0 ill] qsub -l qos=astro_now,nodes=8:ppn=8,walltime=30:00 myVeryImportantScript.csh

Note that in this example, priority "astro" jobs will also be preempted (and "physics_now" preempts "physics," etc). So be prepared to be very very nice to your colleagues if you exercise this option.

This option should only be used when time is critical as it will kill any running job and cause it to be requeued.

[edit] What if my job got preempted?

If you job is preempted, it will be automatically requeued. It will also have a priority equal to the priority it would have had if it actually waited in the queue the entire time (i.e. you do not have to return to the back of the line). Right now, there is no way to specify that you want your job canceled instead of requeued when it is preempted. We are working with the vendor to enable this flexibility and will notify users if it becomes available.

Therefore, it is a good idea to write your scripts so that they can detect if they have been restarted and compensate for that.

[edit] Can I find out who the [expletive] was that preempted my job?

No. This is in the interest of minimizing violence within the research community. If too many users kill one another off, it may impact funding availability for the cluster.