Parallel Programming

More than parallel For loops.

A Functional Style for Process Generation

A general way to parallelize many kinds of computation is to replace a function g occurring in a functional operation by Composition[Queue,g]. This new operation will cause all instances of calls of the function g to be queued for parallel evaluation. The result of this composition is a process id (pid) that will appear inside the structure constructed with the outer computation where g occurred. To put back the results of the computation of g, wrap your whole expression in Wait. Wait will replace any pid inside its expression by the result returned by the corresponding process.

Parallel Map

A parallel version of Map is easy to develop with this paradigm. The sequential Map wraps a function f around all elements in a list.

[Graphics:../Images/MSRI98_gr_192.gif]
[Graphics:../Images/MSRI98_gr_193.gif]

Simply use Composition[Queue,f] instead of f to schedule all mappings for parallel execution. The result is a list of pids.

[Graphics:../Images/MSRI98_gr_194.gif]
[Graphics:../Images/MSRI98_gr_195.gif]

Finally, simply wait for the processes to finish. Every pid will be replaced with the result of its associated process.

[Graphics:../Images/MSRI98_gr_196.gif]
[Graphics:../Images/MSRI98_gr_197.gif]
A Parallel Generalized Inner Product

Say, you want a generalized inner product,

[Graphics:../Images/MSRI98_gr_198.gif]
[Graphics:../Images/MSRI98_gr_199.gif]

where d is the common last dimension of a and first dimension of b.
(Think of p as plus and t as times.)

[Graphics:../Images/MSRI98_gr_200.gif]
[Graphics:../Images/MSRI98_gr_201.gif]

You can use Composition[Queue,p] in place of p to cause all calculcations of p[...] to be queued for concurrent execution.
The result is a tensor of process IDs.

[Graphics:../Images/MSRI98_gr_202.gif]
[Graphics:../Images/MSRI98_gr_203.gif]

Now, simply wait for all processes in this expression.

[Graphics:../Images/MSRI98_gr_204.gif]
[Graphics:../Images/MSRI98_gr_205.gif]

This scheduling is not optimal in terms of communication. One better possibility is to parallelize only along the dimensions of a and send the tensor b once to each processor before scheduling the jobs.

Parallel Tables and Sums

This code generates a 1010 matrix of random numbers, where each row is computed in parallel.

[Graphics:../Images/MSRI98_gr_206.gif]
[Graphics:../Images/MSRI98_gr_207.gif]

Here is a sum whose elements are computed in parallel. Each element of the sum is a numerical integration.

[Graphics:../Images/MSRI98_gr_208.gif]
[Graphics:../Images/MSRI98_gr_209.gif]

Here is the corresponding table of parallel exact results. The value of [Graphics:../Images/MSRI98_gr_210.gif] is [Graphics:../Images/MSRI98_gr_211.gif].

[Graphics:../Images/MSRI98_gr_212.gif]
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8

Explicit Scheduling: Gaxpy

Taken verbatim from Golub&van Loan, Matrix Computations, [Graphics:../Images/MSRI98_gr_213.gif]ed., The Johns Hopkins Univ. Press, 1996.

Y=Y+A.X, A an nn matrix, X, Y, n-vectors

[Graphics:../Images/MSRI98_gr_214.gif]

Sample (symbolic) matrices and vectors.

[Graphics:../Images/MSRI98_gr_215.gif]
[Graphics:../Images/MSRI98_gr_216.gif]
[Graphics:../Images/MSRI98_gr_217.gif]

Scheduling. Find a number of processors that is a divisor of n.

[Graphics:../Images/MSRI98_gr_218.gif]
[Graphics:../Images/MSRI98_gr_219.gif]

The workload (number of rows) for each processor.

[Graphics:../Images/MSRI98_gr_220.gif]
[Graphics:../Images/MSRI98_gr_221.gif]

The client code

[Graphics:../Images/MSRI98_gr_222.gif]

Export the code (not the data!).

[Graphics:../Images/MSRI98_gr_223.gif]

Start all remote processes.

[Graphics:../Images/MSRI98_gr_224.gif]
[Graphics:../Images/MSRI98_gr_225.gif]

Document converted by Mathematica of Wolfram Research