Skip to ContentSkip to Navigation
Maatschappij/bedrijvenCentrum voor Informatie TechnologieResearch and Innovation SupportVirtual Reality en Visualisatie

Millipede cluster user guide: Submitting jobs

The login nodes of the cluster should only be used for editing files, compiling programs and very small tests (about a minute). If you perform large calculations on the login node you will hinder other people in their work. Furthermore you are limited to that single node and might therefore as well run the calculation on your desktop machine.

In order to perform larger calculations you will have to run your work on one or more of the so called ‘batch’ nodes. These nodes can only be reached through a workload management system. The task of the workload management system is to allocate resources (like processor cores and memory) to the jobs of the cluster users. Only one job can make use of a given core and a piece of memory at a time. When all cores are occupied no new jobs can be started and these will have to wait and are placed in a queue. The workload management system fulfils tasks like monitoring the compute nodes in the system, controlling the jobs (starting and stopping them), and monitoring job status.

The priority in the queue depends on the cluster usage of the user in the recent past. Each user has a share of the cluster. When the user has not been using that share in the recent past his priority for new jobs will be high. When the user has been doing a lot of work, and has gone above his share, his priority will decrease. In this way no single user can use the whole cluster for a long period of time, preventing other users from doing their work. It also allows users to submit a lot of jobs in a short period of time, without having to worry about the effect that may have on other users of the system. Note, that since the system has had difficulties handling thousands of jobs we have put limits on the number of jobs that the scheduler will evaluate for each user. More jobs can be submitted, but these will be put aside until the number of jobs for the user is below the threshold again.

The workload management and scheduling system used on the cluster is the combination of torque for the workload management and maui for the scheduling. More information about this software can be found at http://www.clusterresources.com/

Note that you may have to add torque and maui to your environment first, before you can use the commands described below. You can do this using:

$ module add torque maui

Job script

In order to run a job on the cluster a job script should be constructed first. This script contains the commands that you want to run. It also contains special lines starting with “#PBS”. These lines are interpreted by the torque workload management system. An example is given below:

#!/bin/bash #PBS -N myjob #PBS -l nodes=1:ppn=2 #PBS -l pmem=500mb #PBS -l walltime=02:00:00 cd my_work_directory myprog a b c

Here is a description of what it does:

#!/bin/bash The interpreter used to run the script if run directly. /bin/bash in this case
The lines starting with #PBS are instructions for the job scheduler on the system.
#PBS -N myjob This is used to attach a name to the job. This name will be displayed in the status listings.
#PBS -l nodes=1:ppn=2 Request 2 cores (ppn=2) on 1 computer (nodes).
#PBS -l pmem=500mb Request 500 MB of memory for each core used by the the job. This will be 1GB in total.
#PBS -l walltime=02:00:00 The job may take at most 2 hours. The format is hours:minutes:seconds. After this time has passed the job will be removed from the system, even when it was not finished! So please be sure to select enough time here. Note, however that giving much more time than necessary may lead to a longer waiting time in the queue when the scheduler is unable to find a free spot.
cd my_work_directory Go to the directory where my input files are. It is useful to use the statement "cd $PBS_O_WORKDIR" here. $PBS_O_WORKDIR points to the directory from which the job was submitted. In most cases this is the location of the jobsript.
myprog a b c Start my program called myprog with the parameters a b and c.

Submitting the job

The job script can be submitted to the scheduler using the qsub command, where job_script is the name of the script to submit:

$ qsub job_script

1421463.master

The command returns with the id of the submitted job. In principle you do not have to remember this id as it can be easily retrieved later on.

Checking job status

The status of the job can be requested using the commands qstat or showq. The difference between the commands is that showq shows jobs in order of remaining time when jobs are running or priority when jobs are still scheduled, while qstat will show the jobs in order of appearance in the system (by job id). Here are some examples:

Here an (shortened) example for qstat:

$ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
195864.master             sam2             p123456         122:06:1 R nodeslong
293889.master             ...core_ion2.inp p234567         00:00:00 R quadslong
295494.master             ..._core_del.inp p234567         00:00:00 R quadslong
319979.master             ...er_d2h_vb.inp p234567         00:00:00 R quadslong
381076.master             Ligand           p345678         12243:10 R nodeslong
381507.master             no-Ligand        p345678         12169:37 R nodeslong
381515.master             dopc-rho-9       p456789         8064:17: R nodeslong
386704.master             ...ter-pol_PME_1 p567890         1957:36: R nodeslong
386714.master             ...ans-pol_PME_3 p567890         1957:16: R nodeslong
389836.master             pD0_TwoM4        p678901         1627:35: R quadslong
403177.master             pD1_TwoM2        p678901                0 Q nodeslong
403940.master             pD0_TwoMS8       p678901                0 Q quadslong
403946.master             pD1_TwoMS8       p678901                0 Q quadslong
404602.master             pbd3-1           p789012         146:38:1 R nodeslong
404604.master             pbd3-2           p789012         145:59:4 R nodeslong
404606.master             pbd3-3           p789012         146:19:5 R nodeslong
404608.master             pbd4-1           p789012         146:29:4 R nodeslong
404609.master             pbd4-2           p789012         161:30:3 R nodeslong
.....

The S field shows the status of the job. In this case R for running and Q for queued.

Here is also an (also shortened) example of output from showq:

ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

10754468           s9343231    Running    12    00:07:47  Thu May  8 15:23:39
10748690            p623296    Running     1    00:30:28  Tue May  6 23:56:20
10439469            p651143    Running     1    00:56:12  Mon Apr 28 16:22:04
10748693            p623296    Running     1     1:33:42  Wed May  7 00:59:34
10691587            p652081    Running     1     2:03:16  Sun May  4 13:29:08
10723344            p651167    Running     1     2:25:41  Mon May  5 18:51:33
10723345            p651167    Running     1     2:26:21  Mon May  5 18:52:13
10754207            p662281    Running     2     2:32:51  Thu May  8 08:58:43
10754208            p662281    Running     2     2:32:51  Thu May  8 08:58:43
10754209            p662281    Running     2     2:32:51  Thu May  8 08:58:43
10748701            p622296    Running     1     2:37:57  Wed May  7 02:03:49
10748706            p625296    Running     1     2:37:57  Wed May  7 02:03:49
....
....

  1138 Active Jobs    2659 of 3108 Processors Active (85.55%)
                       233 of  243 Nodes Active      (95.88%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

10662720            p634661       Idle    72  1:00:00:00  Wed Apr 30 06:59:01
10666757            p634661       Idle    72  1:00:00:00  Wed Apr 30 17:04:24
10666769            p634661       Idle    72  1:00:00:00  Wed Apr 30 17:04:43
10666771            p634661       Idle    72  1:00:00:00  Wed Apr 30 17:05:07
10666773            p634661       Idle    72  1:00:00:00  Wed Apr 30 17:05:37
10666775            p634661       Idle    72  1:00:00:00  Wed Apr 30 17:06:03
....
....

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

10663884            p751502  BatchHold    24 10:00:00:00  Wed Apr 30 12:36:33
10663886            p751502  BatchHold    24 10:00:00:00  Wed Apr 30 12:38:06
10663887            p751502  BatchHold    24 10:00:00:00  Wed Apr 30 12:38:22
10663889            p751502  BatchHold    24 10:00:00:00  Wed Apr 30 12:40:40
10663890            p751502  BatchHold    24 10:00:00:00  Wed Apr 30 12:41:11
....
....

Total Jobs: 2429   Active Jobs: 1138   Idle Jobs: 1086   Blocked Jobs: 205

   

The useful information is the remaining time for runing jobs. Idle jobs are shown in order of priority. Blocked jobs arise when jobs cannot be run immediately. This happens when certain limits (e.g. maximum number of long jobs) have been reached. Jobs are also blocked when they do not fit the nodes that they have been submitted to. This can happen if you ask for more cores or memory than present on a single node.

A useful option for both commands is the -u option which will only show jobs for the given user, e.g.

$ showq -u p123456

will only show the jobs of user peter. It may also be useful to use less to list the output per page. This can be done by piping the output to less using |. (This symbol can on US-international keyboards be found above the symol “\”, and may have a small hole in it on the keyboard. )

$ showq | less

The result of the command will be displayed per page. and can be used to scroll through the text, as well as the up and down arrow. Pressing q will exit less.

Cancelling jobs

If you discover that a job is or will not be running as it should you can remove the job from the queuing system using the qdel command.

$ qdel jobid

Here jobid is the id of the job you want to cancel. You can easily find the ids of your jobs by using qstat or showq.

Queues

Because the cluster has three types of nodes available for jobs queues have been created that match these nodes. These queues are:

Queue Wallclock time limit Node type Remarks
nodes 24:00:00 12 core nodes default based on walltime
nodesmedium 72:00:00 12 core nodes default based on walltime
nodeslong 240:00:00 12 core nodes default based on walltime
quads 24:00:00 24 core nodes
quadsmedium 72:00:00 24 core nodes
quadslong 240:00:00 24 core nodes
smp 24:00:00 64 core node
smpmedium 72:00:00 64 core node
smplong 240:00:00 64 core node
short 00:30:00 12 core nodes useful for testing, some reserved capacity for this

The jobs have a default maximum wallclock time limit of 30 minutes, which means that you have to set the correct limit yourself. Using a good estimate will improve the scheduling of your jobs.

The default queue you will be put into when submitting a job is the “nodes” queue. If you want to use a different type of machine, you will have to select the queue for these machines explicitly. This can be done using the -q option on the commandline, or using a line like "#PBS -q <queuename>" in the jobscript:

$ qsub -q smp myjob

There is also a limit on how many jobs can be submitted to the long queues. To prevent the system from being fully occupied by long running jobs, reducing the turnaround time for new jobs, we limited the 10 day queues to maximum half of the cluster.

Parallel jobs

There are several ways to run parallel jobs that use more than a single core. They can be grouped in two main flavours. Jobs that use a shared memory programming model, and those that use a distributed memory programming model. Since the first depend on shared memory between the cores these can only be run on a single node. The latter are able to run using multiple nodes.

Shared memory jobs

Jobs that need shared memory can only run on a single node. Because there are three types of nodes the amount of cores that you want to use and the amount of memory that you need, determine the nodes that are available for your job. For obtaining a set of cores on a single node you will need the PBS directive:

#PBS -l nodes=1:ppn=n

where you have to replace n by the number of cores that you want to use. You will later have to submit to the queue of the node type that you want to use.

Distributed memory jobs

Jobs that do not depend on shared memory can run on more than a single node. This leads to a job requirement for nodes that looks like:

#PBS -l nodes=n:ppn= m

Where n is the number of nodes (computers) that you want to use and m is the number of cores per computer that you want to use. If you want to use full nodes the number m should be equal to the number of cores per node.

Memory requirements

By default a job will have a memory requirement per process that is equal to the available memory of a node divided by the number of cores. This default has been set to 1900MB per core. This number is less thant 2GB because the operating system also needs some memory, which is not available for the jobs. The default is set the same for all node types. If you need more (or less) than this amount of memory, you should specify this in you job requirement by adding a line:

#PBS -l pmem=xG

This means that you require x GByte of memory per core. You can also use the suffix M for megabytes. The total amount of memory available for your job is this number multiplied by the number of cores you request within a single node.

Other PBS directives

There are several other #PBS directives one can use. Here a few of them are explained.

-l walltime= hh:mm:ss

Specify the maximum wallclock time for the job. After this time the job will be removed from the system.

-l nodes=n:ppn=m

Specify the number of nodes and cores per node to use. n is the number of nodes and m the number of cores per node. The total number of cores will be n*m

-l mem=xmb

Specify the amount of memory necessary for the job. The amount can be specified in mb(Megabytes), or gb (Gigabytes). In this case x MBytes.

-j oe Merge standard output and standard error of the jobs script in to the output file. (The option eo would combine the output into the error file).
-e filename Name of the file where the standard error output of the job script will be written into.
-o filename

Name of the file where the standard output output of the job script will be written into.

-m events

Mail job information to the user for the given events, where events is a combination of letters. These letters can be: n (no mail), a (mail when the job is aborted), b (mail when the job is started), e ( mail when the job is finished). By default mail is only sent when the job is aborted.

-M emails e-mail adresses for e-mailing events. emails is a comma separated list of e-mail adresses.
-q queue_name Submit to the queue given by queue_name
-S shell

Change the interpreter for the job to shell

Laatst gewijzigd:02 oktober 2015 22:23