More than parallel For loops.
A general way to parallelize many kinds of computation is to replace a function g occurring in a functional operation by Composition[Queue,g]. This new operation will cause all instances of calls of the function g to be queued for parallel evaluation. The result of this composition is a process id (pid) that will appear inside the structure constructed with the outer computation where g occurred. To put back the results of the computation of g, wrap your whole expression in Wait. Wait will replace any pid inside its expression by the result returned by the corresponding process.
A parallel version of Map is easy to develop with this paradigm. The sequential Map wraps a function f around all elements in a list.
![[Graphics:../Images/MSRI98_gr_192.gif]](../Images/MSRI98_gr_192.gif)
Simply use Composition[Queue,f] instead of f to schedule all mappings for parallel execution. The result is a list of pids.
![[Graphics:../Images/MSRI98_gr_194.gif]](../Images/MSRI98_gr_194.gif)
Finally, simply wait for the processes to finish. Every pid will be replaced with the result of its associated process.
![[Graphics:../Images/MSRI98_gr_196.gif]](../Images/MSRI98_gr_196.gif)
Say, you want a generalized inner product,
![[Graphics:../Images/MSRI98_gr_198.gif]](../Images/MSRI98_gr_198.gif)
where d is the common last dimension of a and first dimension of b.
(Think of p as plus and t as times.)
![[Graphics:../Images/MSRI98_gr_200.gif]](../Images/MSRI98_gr_200.gif)
You can use Composition[Queue,p] in place of p to cause all calculcations of p[...] to be queued for concurrent execution.
The result is a tensor of process IDs.
![[Graphics:../Images/MSRI98_gr_202.gif]](../Images/MSRI98_gr_202.gif)
Now, simply wait for all processes in this expression.
![[Graphics:../Images/MSRI98_gr_204.gif]](../Images/MSRI98_gr_204.gif)
This scheduling is not optimal in terms of communication. One better possibility is to parallelize only along the dimensions of a and send the tensor b once to each processor before scheduling the jobs.
This code generates a 1010 matrix of random numbers, where each row is computed in parallel.
![[Graphics:../Images/MSRI98_gr_206.gif]](../Images/MSRI98_gr_206.gif)
Here is a sum whose elements are computed in parallel. Each element of the sum is a numerical integration.
![[Graphics:../Images/MSRI98_gr_208.gif]](../Images/MSRI98_gr_208.gif)
Here is the corresponding table of parallel exact results. The value of
is
.
![[Graphics:../Images/MSRI98_gr_212.gif]](../Images/MSRI98_gr_212.gif)
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
Taken verbatim from Golub&van Loan, Matrix Computations,
ed., The Johns Hopkins Univ. Press, 1996.
Y=Y+A.X, A an nn matrix, X, Y, n-vectors
![[Graphics:../Images/MSRI98_gr_214.gif]](../Images/MSRI98_gr_214.gif)
Sample (symbolic) matrices and vectors.
![[Graphics:../Images/MSRI98_gr_215.gif]](../Images/MSRI98_gr_215.gif)
![[Graphics:../Images/MSRI98_gr_216.gif]](../Images/MSRI98_gr_216.gif)
![[Graphics:../Images/MSRI98_gr_217.gif]](../Images/MSRI98_gr_217.gif)
Scheduling. Find a number of processors that is a divisor of n.
![[Graphics:../Images/MSRI98_gr_218.gif]](../Images/MSRI98_gr_218.gif)
The workload (number of rows) for each processor.
![[Graphics:../Images/MSRI98_gr_220.gif]](../Images/MSRI98_gr_220.gif)
The client code
![[Graphics:../Images/MSRI98_gr_222.gif]](../Images/MSRI98_gr_222.gif)
Export the code (not the data!).
![[Graphics:../Images/MSRI98_gr_223.gif]](../Images/MSRI98_gr_223.gif)
Start all remote processes.
![[Graphics:../Images/MSRI98_gr_224.gif]](../Images/MSRI98_gr_224.gif)
Document converted by Mathematica of Wolfram Research