Skip navigation
NASA Logo, National Aeronautics and Space Administration
Modeling Guru Banner
Currently Being Moderated

Comparing Tools for Running Large Batches of Independent Tasks

VERSION 12  Click to view document history
Created on: Sep 26, 2013 10:16 AM by Jules Kouatchou - Last Modified:  Oct 24, 2013 11:29 AM by Jules Kouatchou

We compare the Portable Distributed Scripts (PoDS) and GNU Parallel. Both tools allow users to concurrently execute independent tasks across nodes on multicore platforms. We want to learn how easy these tools are to use, what are their advantages and limitations, how they handle environment variables, how well they concurrently run independent tasks, etc.

 

We will carry out a series of experiments to test the tools under various configurations. Whenver possible, we also use SLURM srun command line to submit inndependent tasks.

 

What is PoDS?

PoDS is an application that allows users to execute independent tasks concurrently across nodes on multicore clusters. It does not require any knowledge of the individual tasks and does not make any assumptions about the underlying application. As a matter of fact, the tasks to be executed can be from different applications. It can be seen as a task parallelism tool where concurrent independent tasks are executed in parallel. PoDS was initially designed to run serial tasks but it can also be used on parallel (MPI or OpenMP for instance) tasks. See the link for additional information.

 

To utilize PoDS, users must:

 

  1. First create a simple ASCII "execution" file where each line contains a complete command (along with any parameter required to execute it, e.g., input files, switches, etc.) that they want execute.
    • The command can be itself another script that issues subtasks (each of which to be executed sequentially on the same processor on a node).
    • There can be some kind of dependency among the subtasks (since they will be run on the same processor) as long as the user lists them in a sequential way.
  2. Provide the list of available nodes to be used to execute the commands. The list is automatically determined by PoDS if a batch or an interactive session is employed.
  3. Provide the number of cores to be used per node.

 

PoDS determines the list of available nodes and distribute the tasks to the nodes. As long as tasks are available, each node receives as many tasks as it has processors (if the user chooses to employ all the processors within the node). Each time PoDS is used, it provides at the end, a summary report containing:

 

  • The number of nodes used together with the requested number of cores per node,
  • The number of independent tasks executed, and
  • The total wallclock time.

 

What is GNU Parallel (GnuPa)?

GnuPa is a shell tool for executing jobs in parallel using one or more computers.

 

  1. Allows the execution of tasks in parallel using one or more nodes.
  2. Allows to keep the cores on each node busy as long as enough tasks are provided.
  3. A task can be a single command and a small script that has to be run for each of the lines in the input.
  4. A typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables.
  5. By default, parallel runs as many tasks in parallel as there are cores available in each node, but using command line options GnuPa can be made to detect the number of CPUs and use all of them.

 

What is SLURM srun?

 

  • Run a parallel job on cluster managed by SLURMS
  • Can do a resource allocation in which to run the parallel job
  • Able to run sequential tasks in parallel by using all the cores in each available nodes

 

#!/bin/csh 
#SBATCH --job-name=YourJobname 
#SBATCH --account=YourProject 
#SBATCH --time=hh:mm:ss 
#SBATCH --nodes=2 
 
@ n = 1
while ( $n <= 300 )
     srun -n 1 -c 1  --exclusive ./myExecutable.exe  inputData_"$n".inp > outputData_"$n".out &
     @ n = $n + 1
end
 
wait

 

Performance Analysis

We want to compare the performance PoDS and GnuPa when they are used to run large batches of tasks. We assume that all the tasks are listed in an execution file labelled myExecutableFile. The only requirement is that the tasks are independent of each other. Each task can be a simple command line, a script (containing subtasks), a parallel job (MPI or OpenMP), etc.

 

Here is how the tasks are submitted with PoDS through PBS or SLURM:

 

set coresPerNode = 1
pods.py myExecutableFile $coresPerNode

 

A similar setting in GnuPa will look like:

 

setenv parallel /usr/local/other/GnuParallel/parallel-20110722/bin/parallel
set coresPerNode = 1
$parallel --jobs $coresPerNode --sshloginfile $PBS_NODEFILE -W$PWD < myExecutableFile

 

Here coresPerNode refers to the selected number of cores (CPUs) per node. In the above setting, only one core will be used for each available node.

 

 

Experiment 1: Random Calculations

We wrote a simple Fortran 90 application that implements the Metropolis Monte Carlo method for numerical integration. The code generates  a series of random numbers that are used to approximate the integral. The application was compiled using the command:

 

gfortran -o metropolisScheme.exe metropolisScheme.F90

 

 

We call the application 300 times with PoDS, GnuPa and SLURM srun.

 

The following command line was used for GnuPa:

 

setenv parallel /usr/local/other/GnuParallel/parallel-20110722/bin/parallel

seq 300 | $parallel -j $coresPerNode -u --sshloginfile $PBS_NODEFILE "cd $PWD;./metropolisScheme.exe"

 

The timing results are summarized in the following table:

 

 

Number of Nodes
Core/Node
PoDS Time (s)
GnuPa Time(s)
SLURM srun (s)
1145053613

222271807

41125910

8571458

12375341300





2122561707

21130990

4571459

8286254

12196167152

 

Each time the application is exectuted, we have a the same number of operations on different random numbers. The amout of work done is exactly the same for the runs presented on the table. We have a clear indication how PoDS and GnuPa perform. GnuPa is fasters than PoDS. We will later analyze why PoDS is slower than GnuPa.

 

Experiment 2: Processing netCDF Files

We have a set of netCDF files and we want to do the following (using Python, Numpy and netCDF4):

  • Open each file and read all the fields that have a time dimension
  • Compute the time average of the fields read in
  • Create a new netCDF file and write out the average values

 

We use the GnuPa command line:

 

cat ncInpuFiles.txt | $parallel --jobs $coresPerNode --sshloginfile $PBS_NODEFILE -W$PWD ./fileProc.sh {}

 

where ncInputFiles.txt is a file containing the list of netCDF files to be manipulated, and fileProc.sh is a simple Cshell script that set the envionment variables and calls the Python code (processing the input file).

 

We processed 121 netCDF files with more than 30 variables having time dimension. The timing results are summarized in the following table:

 

 

Number of Nodes
Core/Node
PoDS Time (s)
GnuPa Time(s)
SLURM srun (s)
1131852973

216801544

4791825

8407490

12329TBDTBD





2116271745

2800806

4480539

8260321

12249TBDTBD

 

 

Experiment 4: mpi4Py Applications

We have two mpi4Py scripts:

  1. Do Mandelbrot calculations (here the root mpi4py script uses a Fortran executable to do calculations and communications)
  2. Compute the number Pi.

 

We adopt the principles presented on the PoDS and MPI Tasks link. We created an executable file that contains the Mandelbrot and Pi tasks. The following table present the timing information generated by using PoDS and GnuP.

 

 

Number of Nodes
MPI Processes
PoDS Time (s)
GnuPa Time(s)
12410449

4171166

6118109

89288

107660

127662




22276247

49775

65958

85345

103634

123442

 

We observe that in this application, GnuPa is faster than PoDS and that both tools display acceptable parallel performance when handling MPI tasks. Since PoDS provides the time each task takes to be completed, we record the computing times of the Mandelbrot and Pi tasks:

 

 

Number of Nodes
MPI Processes
Pi Task (s)
Mandelbrot (s)
12272134

49372

65552

84442

103337

122840




22273133

49473

65652

84945

103333

122830

 

The table shows that the parallel performance of individual tasks does don deteriorate with PoDS. We can also note that (using the two tables) that PoDS adds some overhead that is more noticeable when we have too few "short" tasks. If the overhead is not taken into consideration, PoDS performs as well as GnuPa or does even better.

 

Comments

 

  • PoDS and GnuPa can handle MPI applications as long as each MPI task is executed within a node only.
  • PoDS introduces overheads that cause it to be slower than GnuPa. The reason PoDS is slower is because of the way it monitores running tasks. It uses "sleep" commands to ensure tasks are complete before new ones are executed. Initial tests with a different approach showed that PoDS does slightly better than GnuPa.
  • PoDS and GnuPa do not automatically pass environment variables to remote processes. The envionment variables have to be provided from the command line (a future version of PoDS will include such an option).
  • We had issues with GnuPa command lines. For instance in Experiment 2, we had to try several methods to identify which one could run the tasks.
  • PoDS is an in-house tool that is flexible and can be modified to meet users' requirements. Works are currently done on PoDS to make it more robust and reduce its overhead.
Comments (0)
USAGov logo NASA Logo - nasa.gov