Job Routing and Priority
From athena
Contents |
[edit] Partitions on Athena
The Athena cluster is divided into two partitions. A job is only allowed to run in one partition at a time. In our case, we impose this constraint because we have a set of 16 nodes that, although well connected to one another, are poorly connected to the other 112 compute nodes in the cluster. Therefore, these nodes (debug_nodes
) are segregated from the rest (batch_nodes
) by being in their own partition.
[edit] Batch partition
The main partition is batch_nodes
and consists of 112 nodes. All of the nodes that are owned by specific groups (e.g. physics, astro, int, cenpa) are in the batch_nodes partition. (Currently, the inverse is also true: all nodes in the batch_nodes partition are owned by exactly one group).
Jobs submitted to the default
queue (see below) run only in the batch_nodes
partition.
[edit] Debug partition
The debug_nodes
partition consists of the 16 nodes that are not well-connected to the rest of the cluster. Their primary purpose is for smaller debugging and interactive jobs.
Jobs submitted to the debug
queue (see below) run only in the debug_nodes
partition.
[edit] Queues on Athena
The Athena cluster has 3 queues: default
, debug
, and scavenge
. Which queue you submit your job to affects which partitions it will run on. You select which queue you want to run on with the -q
parameter. For example, to submit a job to the debug queue:
[richardc@athena0 ill] qsub -q debug -l nodes=2:ppn=8,walltime=30:00 myscript.csh
[edit] Default queue
The default
queue is, obviously, the default queue for job submissions. Jobs submitted here will run only in the batch_nodes
partition. The default time limit is 3 hours.
[edit] Debug queue
Jobs submitted to the debug
queue will run only in the debug_nodes
partition. The have a maximum limit of 1 hour. If you submit a job longer than this to the debug
, the batch system will reject it. Keeping "debug" jobs separate from "batch" jobs guarantees high availability of resources for short, interactive use.
[edit] Scavenge queue
Jobs to the scavenge
queue, are eligible to run in either the batch_nodes
partition or the debug_nodes
(with the restriction that the job will not span across partition boundaries when it runs). This flexibility comes at a price: if your job happens to be running in the debug_nodes
partition and somebody else submits a job debug
, that job will preempt yours if there are not enough free debug nodes. If it is lucky and finds room to run in the code>batch_nodes</code> it will be identical in priority and preemption behavior to jobs submitted in the default
queue.
Therefore, the scavenge
queue is designed for jobs that want to take advantage of the high availability of the nodes in the debug_nodes
partition, at the expense of being preemptable by debug
queue jobs. They will also compete and run in equal footing with standard default
jobs in the batch_nodes
partition.
Note that if you submit a scavenge
job for more than 16 nodes (i.e. larger than the debug_nodes
partition), it will be functionally identical to a default
job.
[edit] Requesting Special Access to Nodes
If you belong to the groups "astro," "cenpa," "int," or "physics," you have the ability to request that your jobs get special handling for nodes that are owned by that group. "astro," "int," and "physics" own 16 nodes each, and "cenpa" owns 8. There are two levels or special handling that each group can achieve: "priority" and "preemption."
Note that you can request priority handing only if you submit jobs to the default
queue.
[edit] Priority access
If you request "priority" access, your job will be placed ahead of non-priority jobs for the nodes that your group owns. To request "priority" access, simply use the -l qos=<group>
option for qsub
where <group>
is the name of your group (e.g. astro, cenpa, int, physics). For example, if you are in the "physics" group and want priority access to the nodes owned by "physics":
[richardc@athena0 ill] qsub -l qos=physics,nodes=8:ppn=8,walltime=30:00 myImportantScript.csh
If two or more jobs of the same group request priority handling, the one submitted first will run first.
[edit] Preemptive access
Let's say you have an important deadline tomorrow and you really need your nodes now. You can opt to kick everyone else off of the nodes that your group owns (ah, that sweet thrill of ownership!). To do this, you format your job submission with -l qos=<group>_now
. For example, if you are in the "astro" group and want to run a job on your nodes right now and preempt anybody else who may be using them:
[richardc@athena0 ill] qsub -l qos=astro_now,nodes=8:ppn=8,walltime=30:00 myVeryImportantScript.csh
Note that in this example, priority "astro" jobs will also be preempted (and "physics_now" preempts "physics," etc). So be prepared to be very very nice to your colleagues if you exercise this option.
This option should only be used when time is critical as it will kill any running job and cause it to be requeued.
[edit] What if my job got preempted?
If you job is preempted, it will be automatically requeued. It will also have a priority equal to the priority it would have had if it actually waited in the queue the entire time (i.e. you do not have to return to the back of the line). Right now, there is no way to specify that you want your job canceled instead of requeued when it is preempted. We are working with the vendor to enable this flexibility and will notify users if it becomes available.
Therefore, it is a good idea to write your scripts so that they can detect if they have been restarted and compensate for that.
[edit] Can I find out who the [expletive] was that preempted my job?
No. This is in the interest of minimizing violence within the research community. If too many users kill one another off, it may impact funding availability for the cluster.