Millipede cluster user guide: Submitting jobs

The login nodes of the cluster should only be used for editing files, compiling programs and very small tests (about a minute). If you perform large calculations on the login node you will hinder other people in their work. Furthermore you are limited to that single node and might therefore as well run the calculation on your desktop machine.

In order to perform larger calculations you will have to run your work on one or more of the so called ‘batch’ nodes. These nodes can only be reached through a workload management system. The task of the workload management system is to allocate resources (like processor cores and memory) to the jobs of the cluster users. Only one job can make use of a given core and a piece of memory at a time. When all cores are occupied no new jobs can be started and these will have to wait and are placed in a queue. The workload management system fulfils tasks like monitoring the compute nodes in the system, controlling the jobs (starting and stopping them), and monitoring job status.

The priority in the queue depends on the cluster usage of the user in the recent past. Each user has a share of the cluster. When the user has not been using that share in the recent past his priority for new jobs will be high. When the user has been doing a lot of work, and has gone above his share, his priority will decrease. In this way no single user can use the whole cluster for a long period of time, preventing other users from doing their work. It also allows users to submit a lot of jobs in a short period of time, without having to worry about the effect that may have on other users of the system. Note, that since the system has had difficulties handling thousands of jobs we have put limits on the number of jobs that the scheduler will evaluate for each user. More jobs can be submitted, but these will be put aside until the number of jobs for the user is below the threshold again.

The workload management and scheduling system used on the cluster is the combination of torque for the workload management and maui for the scheduling.

Note that you may have to add torque and maui to your environment first, before you can use the commands described below. You can do this using:

$ module add torque maui

Job script

In order to run a job on the cluster a job script should be constructed first. This script contains the commands that you want to run. It also contains special lines starting with “#PBS”. These lines are interpreted by the torque workload management system. An example is given below:

#!/bin/bash #PBS -N myjob #PBS -l nodes=1:ppn=2 #PBS -l pmem=500mb #PBS -l walltime=02:00:00 cd my_work_directory myprog a b c

Here is a description of what it does:

#!/bin/bash	The interpreter used to run the script if run directly. /bin/bash in this case
The lines starting with #PBS are instructions for the job scheduler on the system.
#PBS -N myjob	This is used to attach a name to the job. This name will be displayed in the status listings.
#PBS -l nodes=1:ppn=2	Request 2 cores (ppn=2) on 1 computer (nodes).
#PBS -l pmem=500mb	Request 500 MB of memory for each core used by the the job. This will be 1GB in total.
#PBS -l walltime=02:00:00	The job may take at most 2 hours. The format is hours:minutes:seconds. After this time has passed the job will be removed from the system, even when it was not finished! So please be sure to select enough time here. Note, however that giving much more time than necessary may lead to a longer waiting time in the queue when the scheduler is unable to find a free spot.
cd my_work_directory	Go to the directory where my input files are. It is useful to use the statement "cd $PBS_O_WORKDIR" here. $PBS_O_WORKDIR points to the directory from which the job was submitted. In most cases this is the location of the jobsript.
myprog a b c	Start my program called myprog with the parameters a b and c.

Submitting the job

The job script can be submitted to the scheduler using the qsub command, where job_script is the name of the script to submit:

$ qsub job_script

1421463.master

The command returns with the id of the submitted job. In principle you do not have to remember this id as it can be easily retrieved later on.

Checking job status

The status of the job can be requested using the commands qstat or showq. The difference between the commands is that showq shows jobs in order of remaining time when jobs are running or priority when jobs are still scheduled, while qstat will show the jobs in order of appearance in the system (by job id). Here are some examples:

Here an (shortened) example for qstat:

$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 195864.master sam2 p123456 122:06:1 R nodeslong 293889.master ...core_ion2.inp p234567 00:00:00 R quadslong 295494.master ..._core_del.inp p234567 00:00:00 R quadslong 319979.master ...er_d2h_vb.inp p234567 00:00:00 R quadslong 381076.master Ligand p345678 12243:10 R nodeslong 381507.master no-Ligand p345678 12169:37 R nodeslong 381515.master dopc-rho-9 p456789 8064:17: R nodeslong 386704.master ...ter-pol_PME_1 p567890 1957:36: R nodeslong 386714.master ...ans-pol_PME_3 p567890 1957:16: R nodeslong 389836.master pD0_TwoM4 p678901 1627:35: R quadslong 403177.master pD1_TwoM2 p678901 0 Q nodeslong 403940.master pD0_TwoMS8 p678901 0 Q quadslong 403946.master pD1_TwoMS8 p678901 0 Q quadslong 404602.master pbd3-1 p789012 146:38:1 R nodeslong 404604.master pbd3-2 p789012 145:59:4 R nodeslong 404606.master pbd3-3 p789012 146:19:5 R nodeslong 404608.master pbd4-1 p789012 146:29:4 R nodeslong 404609.master pbd4-2 p789012 161:30:3 R nodeslong .....

The S field shows the status of the job. In this case R for running and Q for queued.

Here is also an (also shortened) example of output from showq:

ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 10754468 s9343231 Running 12 00:07:47 Thu May 8 15:23:39 10748690 p623296 Running 1 00:30:28 Tue May 6 23:56:20 10439469 p651143 Running 1 00:56:12 Mon Apr 28 16:22:04 10748693 p623296 Running 1 1:33:42 Wed May 7 00:59:34 10691587 p652081 Running 1 2:03:16 Sun May 4 13:29:08 10723344 p651167 Running 1 2:25:41 Mon May 5 18:51:33 10723345 p651167 Running 1 2:26:21 Mon May 5 18:52:13 10754207 p662281 Running 2 2:32:51 Thu May 8 08:58:43 10754208 p662281 Running 2 2:32:51 Thu May 8 08:58:43 10754209 p662281 Running 2 2:32:51 Thu May 8 08:58:43 10748701 p622296 Running 1 2:37:57 Wed May 7 02:03:49 10748706 p625296 Running 1 2:37:57 Wed May 7 02:03:49 .... .... 1138 Active Jobs 2659 of 3108 Processors Active (85.55%) 233 of 243 Nodes Active (95.88%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 10662720 p634661 Idle 72 1:00:00:00 Wed Apr 30 06:59:01 10666757 p634661 Idle 72 1:00:00:00 Wed Apr 30 17:04:24 10666769 p634661 Idle 72 1:00:00:00 Wed Apr 30 17:04:43 10666771 p634661 Idle 72 1:00:00:00 Wed Apr 30 17:05:07 10666773 p634661 Idle 72 1:00:00:00 Wed Apr 30 17:05:37 10666775 p634661 Idle 72 1:00:00:00 Wed Apr 30 17:06:03 .... .... BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 10663884 p751502 BatchHold 24 10:00:00:00 Wed Apr 30 12:36:33 10663886 p751502 BatchHold 24 10:00:00:00 Wed Apr 30 12:38:06 10663887 p751502 BatchHold 24 10:00:00:00 Wed Apr 30 12:38:22 10663889 p751502 BatchHold 24 10:00:00:00 Wed Apr 30 12:40:40 10663890 p751502 BatchHold 24 10:00:00:00 Wed Apr 30 12:41:11 .... .... Total Jobs: 2429 Active Jobs: 1138 Idle Jobs: 1086 Blocked Jobs: 205

The useful information is the remaining time for runing jobs. Idle jobs are shown in order of priority. Blocked jobs arise when jobs cannot be run immediately. This happens when certain limits (e.g. maximum number of long jobs) have been reached. Jobs are also blocked when they do not fit the nodes that they have been submitted to. This can happen if you ask for more cores or memory than present on a single node.

A useful option for both commands is the -u option which will only show jobs for the given user, e.g.

$ showq -u p123456

will only show the jobs of user peter. It may also be useful to use less to list the output per page. This can be done by piping the output to less using |. (This symbol can on US-international keyboards be found above the symol “\”, and may have a small hole in it on the keyboard. )

$ showq | less

The result of the command will be displayed per page. and can be used to scroll through the text, as well as the up and down arrow. Pressing q will exit less.

Cancelling jobs

If you discover that a job is or will not be running as it should you can remove the job from the queuing system using the qdel command.

$ qdel jobid

Here jobid is the id of the job you want to cancel. You can easily find the ids of your jobs by using qstat or showq.

Queues

Because the cluster has three types of nodes available for jobs queues have been created that match these nodes. These queues are:

Queue	Wallclock time limit	Node type	Remarks
nodes	24:00:00	12 core nodes	default based on walltime
nodesmedium	72:00:00	12 core nodes	default based on walltime
nodeslong	240:00:00	12 core nodes	default based on walltime
quads	24:00:00	24 core nodes
quadsmedium	72:00:00	24 core nodes
quadslong	240:00:00	24 core nodes
smp	24:00:00	64 core node
smpmedium	72:00:00	64 core node
smplong	240:00:00	64 core node
short	00:30:00	12 core nodes	useful for testing, some reserved capacity for this

The jobs have a default maximum wallclock time limit of 30 minutes, which means that you have to set the correct limit yourself. Using a good estimate will improve the scheduling of your jobs.

The default queue you will be put into when submitting a job is the “nodes” queue. If you want to use a different type of machine, you will have to select the queue for these machines explicitly. This can be done using the -q option on the commandline, or using a line like "#PBS -q <queuename>" in the jobscript:

$ qsub -q smp myjob

There is also a limit on how many jobs can be submitted to the long queues. To prevent the system from being fully occupied by long running jobs, reducing the turnaround time for new jobs, we limited the 10 day queues to maximum half of the cluster.

Parallel jobs

There are several ways to run parallel jobs that use more than a single core. They can be grouped in two main flavours. Jobs that use a shared memory programming model, and those that use a distributed memory programming model. Since the first depend on shared memory between the cores these can only be run on a single node. The latter are able to run using multiple nodes.

Shared memory jobs

Jobs that need shared memory can only run on a single node. Because there are three types of nodes the amount of cores that you want to use and the amount of memory that you need, determine the nodes that are available for your job. For obtaining a set of cores on a single node you will need the PBS directive:

#PBS -l nodes=1:ppn=n

where you have to replace n by the number of cores that you want to use. You will later have to submit to the queue of the node type that you want to use.

Distributed memory jobs

Jobs that do not depend on shared memory can run on more than a single node. This leads to a job requirement for nodes that looks like:

#PBS -l nodes=n:ppn= m

Where n is the number of nodes (computers) that you want to use and m is the number of cores per computer that you want to use. If you want to use full nodes the number m should be equal to the number of cores per node.

Memory requirements

By default a job will have a memory requirement per process that is equal to the available memory of a node divided by the number of cores. This default has been set to 1900MB per core. This number is less thant 2GB because the operating system also needs some memory, which is not available for the jobs. The default is set the same for all node types. If you need more (or less) than this amount of memory, you should specify this in you job requirement by adding a line:

#PBS -l pmem=xG

This means that you require x GByte of memory per core. You can also use the suffix M for megabytes. The total amount of memory available for your job is this number multiplied by the number of cores you request within a single node.

Other PBS directives

There are several other #PBS directives one can use. Here a few of them are explained.

`-l walltime= hh:mm:ss`	Specify the maximum wallclock time for the job. After this time the job will be removed from the system.
`-l nodes=n:ppn=m`	Specify the number of nodes and cores per node to use. n is the number of nodes and m the number of cores per node. The total number of cores will be nm*
`-l mem=xmb`	Specify the amount of memory necessary for the job. The amount can be specified in mb(Megabytes), or gb (Gigabytes). In this case x MBytes.
`-j oe`	Merge standard output and standard error of the jobs script in to the output file. (The option eo would combine the output into the error file).
`-e filename`	Name of the file where the standard error output of the job script will be written into.
`-o filename`	Name of the file where the standard output output of the job script will be written into.
`-m events`	Mail job information to the user for the given events, where events is a combination of letters. These letters can be: n (no mail), a (mail when the job is aborted), b (mail when the job is started), e ( mail when the job is finished). By default mail is only sent when the job is aborted.
`-M emails`	e-mail addresses for e-mailing events. emails is a comma separated list of e-mail adresses.
`-q queue_name`	Submit to the queue given by queue_name
`-S shell`	Change the interpreter for the job to shell