Running Batch Jobs

This section describes the batch processing environment in our facilities.

What is Batch Processing?

Batch processing is a procedure by which you submit a program for delayed execution. Batch processing enables you to perform multiple commands and functions without waiting for results from one command to begin another, and to execute these processes without your attendance. The terms process and job are interchangeable.

The batch processing system at HMDC runs on a high throughput cluster on which you can perform extensive, time-consuming calculations without the technical limitations imposed by a typical workstation.

Why Use Batch Processing?

HMDC provides a large, powerful pool of computers that are available for you to use to conduct research. This pool is extremely useful for the following applications:

  • Jobs that run for a long time - You can submit a batch processing job that executes for days or weeks and does not tie up your RCE session during that time.  In fact a user does not need to run a RCE desktop session to submit a batch process.  Batch jobs can be submitted from command-line via ssh.

  • Jobs that are too big to run on your desktop - You can submit batch processing that requires more infrastructure than your workstation provides. For example, you could use a dataset that is larger in size than the memory on your workstation.

  • Groups of dozens or hundreds of jobs that are similar - You can submit batch processing that entails multiple uses of the same program with different parameters or input data. Examples of these types of submission are simulations, sensitivity analysis, or parameterization studies.

If you are interested in learning more about our batch cluster resource manager continue reading below.  For those that want to move on and learn how to submit a batch job please click the Batch Basics link on the left side menu.

Condor System for Batch Processing

The Condor system enables you to submit a program for execution as batch processing, which then does not require your attention until processing is complete. The Condor project website is located at the following URL:

http://www.cs.wisc.edu/condor/

To view the user manual for this software, go to the following URL and choose a viewing option:

http://www.cs.wisc.edu/condor/manual/

Condor System Components and Terminology

A Condor system comprises a central manager and a pool. A Condor central manager machine manages the execution of all jobs that you submit as batch processing. An associated pool of Condor machines associated with that central manager execute individual processes based on policies defined for each pool member. If a computing installation has multiple Condor pools or additional machine clusters dedicated to Condor system use, these pools and clusters can be associated as a flock.

Listed below are some common Condor terms and references, which are unique to Condor:

  • Cluster - A group of jobs or processes submitted together to Condor for batch processing is known as a cluster. Each job has a unique job identifier in a cluster, but shares a common cluster identifier.

  • Pool - A Condor pool comprises a single machine serving as a central manager, and an arbitrary number of machines that have joined the pool. Simply put, the pool is a collection of resources (machines) and resource requests (jobs).

  • Jobs - In a Condor system, jobs are unique processes submitted to a pool for execution and are tracked with a unique process ID number.

  • Flock - A Condor flock is a collection of Condor pools and clusters associated for managing jobs and clusters with varying priorities. A Condor flock functions in the same manner as a pool, but provides greater processing power.

When you submit batch processing to the Condor system, you use a submit description file (or submit file) to describe your jobs. This file results in aClassAd for each job, which defines requirements and preferences for running that job. Each pool machine has a description of what job requirements and preferences that machine can run, called the machine ClassAd. The central manager matches job ClassAds with pool machine ClassAds to select the machine on which to execute a job.

Process Identification Numbers

For Condor batch processing, there are two identification numbers that are important to you:

  • Cluster number - The cluster number represents each set of executable jobs submitted to the Condor system. It is a cluster of jobs, or processes. A cluster can consist of a single job.

  • Process number - The process number represents each individual job (process) within a single cluster. Process numbers for a cluster always start at zero.

Each single job in a cluster is assigned a process identification number, called the process ID or job ID. This ID consists of both cluster and process number in the form <cluster>.<process>.

For example, if you submit a batch that consists of a single job, and your batch submission to the Condor queue is assigned cluster number 20, then your process ID is 20.0. If you submit a batch that consists of fifteen jobs that all use the same executable, and your batch submission to the Condor queue is assigned cluster number 8, then your process IDs range from 8.0 to 8.14.