- First three sections: text and figures by Jason Cooke
- Last three sections: original figures by Kevin Wright; text and revised figures by Dr. Steven R. Lantz

The Fortran code used for the modeling was originally written by David Bryan, a participant in SPUR '94, under the guidance of Dr. Lantz. To take a look at his project, click here.

By running simulations of different models, researchers can attempt to match their models to reality. Simulations also show how changes in parameters affect the behavior of the model. However, the more complicated the model becomes, the longer a program that simulates the model will take to run. Also, more complex programs tend to require more computer memory. These are both roadblocks to testing complex models.

Parallel computing offers an interesting solution to both problems posed above. By distributing the work load to many processors, the amount of time spent by each processor computing decreases. If the computer is a distributed memory machine, like the IBM SP2 at the Cornell Theory Center, the amount of data stored in each node can be reduced.

Because of the possibility of running simulations on data sets too large or time consuming for a single computer, we attempted to write a parallel program that would serve as a modifiable basis for running different models of the solar convection zone on a parallel machine. To test our code, we used a basic model that did not harness the full power of running a job in parallel.

Below is the section of the convection zone we are considering, with the rest of the sun removed. A further approximation is imposed that there is only two-dimensional fluid motion, that is, no fluid moves parallel to the axis of the cylinder. This is a good approximation if the cylinder is rotating quickly enough about its axis (Busse and Or, JFM, 1986). Thus, the entire cylinder can be represented by one cross section.

Another assumption is that all fields will be periodic along the disk. Then, the full disk can be represented by a section containing a single period. This section can then be approximated by a rectangle. It is on this rectangle that all the calculations are done in our simulation.

tells how a piece of fluid moves when there is a force acting on it; this is the basic equation of fluid flow.

describes how heat flows into or out of a moving piece of fluid. Notice that this equation also involves the velocity.

says that the amount of fluid coming into a point is equal to the amount of fluid leaving that point.

relates the density of a piece of fluid to the temperature of that piece of fluid. Notice that since the Navier-Stokes equation depends on the density, by this equation it also depends on the temperature.

Since our equations are differential equations, we need to specify what the edge conditions are in order to solve them.

For the left and right edges of our rectangle, we require that the temperature and velocity fields must match on both sides. This is because the rectangle is a periodic section of the disk.

For the top and bottom edges, we require that the velocity of the fluid is parallel to the boundary. This keeps the fluid from flowing out of the top or bottom of the rectangle. For the temperature, we keep it fixed at the top and bottom corresponding to given outside and inside temperatures.

The temporal components of our problem are also discrete. We have the equations set up to compute what the fields are at given time step intervals. As with the spatial approximation, variations in the fields that occur in a shorter time period than the time step are not observed. Therefore, the time step size needs to be selected carefully, as with the control area.

Because our fields are discrete, taking the derivatives poses a problem. Recalling from calculus, the derivative is defined by taking a limit:

However, on a discrete grid there are no points arbitrarily close to each grid point, so the actual derivative cannot be calculated. Approximations to the actual derivatives can be computed using Taylor's Theorem.

By applying Taylor's Theorem, this formula can be shown to be second-order accurate in the grid spacing. The second derivative is found by taking the derivative of the derivative. While asymmetric finite difference approximations are used for both derivatives, the result is symmetric.

Because the Navier-Stokes and the heat equations can give the explicit time derivatives for the fields, provided that the values of the fields at the current time are known, this integration method could be used to advance the field through time. The asymmetrical derivative approximation used above (inside the second derivative) can be manipulated to give us an approximation to the time derivative integration:

The above is a Runge-Kutta method, accurate only to first order by Taylor's Theorem. More accuracy is available from using a higher order method, at the cost of being more computationally expensive. By experimentation, we found that a third-order Runge-Kutta method was stable and accurate enough for our problem. Below are the general equations for a third-order Runge-Kutta:

In addition, the following relations between coefficients must be satisfied for the method to be third-order accurate:

Notice that there are 8 variables and only 6 equations. This allows for classes of solutions to these equations.

Using a method that uses fewer working arrays could help to optimize our program for memory usage. Dr. Lantz proposed a method that uses only the field array plus just one working array. The structure of an algorithm that allows this is:

The extra equations that have to be satisfied for this to work, besides the 6 requirements mentioned previously, are the following:

Using Maple, we solved the set of equations for the m's, a's, b's and c's after applying some extra conditions. Our extra conditions were to make as many coefficients as possible in the above Runge-Kutta algorithm equal to zero or one. When zero, no computation is required with the corresponding term. When one, no floating point multiplication is required. Each case could lead to increased program performance.

It turned out that some combinations gave valid solutions, while others did not. The results from solving the equations for are given here.

But because of the various partial derivatives that must be computed, there is coupling between points on the grid. In most of the equations, derivatives are computed by finite difference formulas that depend only on neighboring points on the grid. So it seems that neighbors should be kept together as much as possible when assigning grid points to processors. However, there is also an elliptic equation (a Poisson equation) to be solved. When discretized, the Poisson equation becomes a large matrix equation, coupling together all the points along a row or column of the grid. This complicates the choices for how to divide up the grid among processors. We decided to cut our domain into horizontal strips, so that all data coupled horizontally would be present together on the same processor.

We set out to create a data-parallel program, in which processors are doing computations on different parts of the domain. So the first issue to address is: how should distinct strips of data be sent out to the different processors? It is very convenient to do this in MPI, because such strip initializations can be done with a single call to MPI_Scatter at the start of the job (figure). The inverse of this operation, MPI_Gather, is similarly useful for collecting data back to a single node. In our program, the gather operation occurs every time an output file is to be written; such output events occur at the end of a run and at regular intervals during the run.

Communication between processors must also occur at very frequent intervals during a run, in order to advance the code in time. There are two types of operations requiring communication. First, for computing derivatives in the time-evolution equations, data from neighboring points is needed; and it sometimes happens that these neighboring points reside on another processor, necessitating communication. Second, a communication pattern involving all the processors together is needed for solving the previously-mentioned global, elliptic equation. Parallelizing the elliptic solver is the tougher of the problems, so we will discuss it in detail in the next section. For the moment, we'll look only at the nearest-neighbor communications.

MPI provides a natural way for neighboring processors to communicate with each other, called MPI_Sendrecv. The sendrecv operation sends the data from a buffer in the local processor across to the designated processor, while at the same time receiving data from that processor (or another one) into a different buffer. With MPI_Sendrecv, data can be swapped between processors and received into guard cells, as we see in this figure:

The operation needs to be done for each of the computational arrays. In
practice, it is more efficient to pack a special buffer with all the
necessary data from different arrays (temperature, vorticity,
streamfunction) and do just one sendrecv operation (rather than two) in
order to cut down on *latency* overhead--where latency is defined to be
the cost to initiate a new message.

A special case arises when the strip has a physical boundary along one edge,
because no communication is required. Instead, guard cell values are
computed in the following way: imagine the physical boundary of the system
to lie just beyond the last row of grid points. In particular, fix the
location of the physical boundary to be exactly between the final row of
data points and row of the guard cells next to it. Then, each guard cell
value is defined such that the value *halfway in between* interpolates
to the proper, physical boundary value. (The boundary condition on the
temperature perturbation, *e.g.*, is such that its value must be zero
on the physical boundary.) The interpolating polynomial used for this
procedure is second-order, so values from the last two rows of data, as well
as the value at the physical boundary, are involved.

In our model, the time evolution of the 2D velocity field is not computed in
the direct way from the Navier-Stokes equation. Rather, we manipulate our
fluid equations mathematically so that we can deal instead with a quantity
called the *vorticity* (*omega*), defined as the vector curl of
the velocity. In 2D, the vorticity becomes a scalar, and it is therefore a
more convenient quantity for computations than the two-component velocity
vector.

An evolution equation for the vorticity can be derived by taking the curl of
the Navier-Stokes equation. But the velocity is still needed in all the
equations, so it somehow must be determined from the vorticity. This is
done by exploiting the fact that velocity is divergenceless in the
Boussinesq approximation. This implies that veolocity can be written as the
curl of yet another field, called the *streamfunction* (*phi*).
One can show that the vorticity and the streamfunction are related by

which is the Poisson equation. It is an elliptic partial differential
equation to be solved for *phi*, given *omega*. This *phi*
in turn determines the velocity: it is computed simply by taking derivatives
(or finite differences) of the streamfunction, once it becomes known.
Altogether, the computational procedure is as follows: first, the vorticity
is explicitly advanced in time; then, the Poisson equation for *phi* is
solved; then, velocities for the next time step are computed.

The first step is to Fourier-transform the above equation in the *x*
dimension. This is a natural enough thing to do, given the periodic
boundary conditions that apply in this direction. The transformed equation
is

where derivatives in *x* are simply replaced by multiples of *ik*.
This equation can be discretized very readily. It is already discrete in
*k* by virtue of the periodic boundary conditions, and the finite mesh
spacing has determined the maximum *k*. In the z-direction, the
remaining second derivative can be discretized using its finite-difference
approximation. The overall result is

which can be rewritten as

.

Notice that the equations are completely independent in *k* but are
coupled in the index *i*. This results in a set of independent
matrix problems to be solved, one for each *k*:

The corners of the matrix have been neglected here; in general they contain
special values that depend on the boundary conditions. The above matrix is
a special, sparse type of matrix called *tridiagonal*, for which very
fast algorithms exist (order N operations, to solve an N by N matrix).
Unfortunately, these fast algorithms are inherently sequential!

- we have a large number of these matrices to solve all at once (one for
each
*k*); and - the problems are all completely independent of one another.

- // Take Fast Fourier Transforms (FFT's) of the vorticity in the x-direction. (Remember, the data all start out in horizontal strips.)
- Transpose the transformed vorticity data across processors so that each
processor now has all the
*i*'s for just a subset of the*k*'s. - // Solve the independent tridiagonal matrix equations--one for each
*k*--by using the best available serial algorithm locally. - Transpose the streamfunction data back across processors, so that each
processor has all the
*k*'s for just a subset of the*i*'s (same as initially). - // Take inverse FFT's to determine the streamfunction.

It seemed at first that the specified data movements could be done very simply by using the MPI_Alltoall subroutine. In MPI_Alltoall, each processor sends a message to, and receives a messages from, all the other processors. In our case, this message should simply consist of a submatrix of the original matrix. But it turns out that there are additional, local transposes that need to be done on the submatrices in order to get the data into the correct places, both before and after the MPI_Alltoall operation. This is because MPI_Alltoall does not transpose any of the data within a submatrix; it merely transposes submatrices intact. Furthermore, MPI_Alltoall only works when each submatrix has its elements stored in contiguous memory locations. So, in place of the simple call to MPI_Alltoall (in steps 2 and 4 above), the following three-step procedure has to be performed:

- // Reorder the data so that the local column vector of submatrices becomes a row vector (assuming the data are ordered in memory in the standard Fortran or "column major" style). This operation can be called an "s-transpose", an abbreviation for either "submatrix transpose" or "stick transpose", because sticks of data are being moved around in memory.
- Using MPI_Alltoall, do a global s-transpose of the submatrices across
processors. Each processor now has all the
*i*'s for just a subset of the*k*'s. But the data are still ordered so that the*k*'s are counted first. - // Perform a normal matrix transpose on the entire
matrix defined by the row vector of submatrices (
*i.e.*, on the whole local piece of the distributed matrix), in order to have the*i*'s counted first.

The global matrix transpose scheme can best be explained by a picture. (Dr. Lantz has also created a color version.) Note that the transpose procedure works equally well if the order of the steps is reversed.

Overall, we succeeded in meeting the main goals of this project, which were:

- To parallelize Dr. Lantz's code for simulating thermal convection in a rotating frame of reference, and
- To improve its serial performance through more efficient Runge-Kutta time stepping.

- Trials of several tableaux from the new family of Runge-Kutta schemes showed that they all ran stably and well, proving to be about 10% faster than the original code.
- A working parallel version of the code was assembled, and tests with it
were successful, in that they generated physically correct results. This
was proved by setting the code's parameters to correspond to a regime where
solutions to the fluid equations are expected to be stable thermal Rossby
waves (Busse and Or, JFM, 1986). A plot of
angular frequency vs. the rotation parameter (
*eta*) shows excellent agreement between the results of our simulations and the theoretically predicted values for the phase speed of this type of traveling wave. - Unfortunately, a plot of execution time vs. number of processors shows that the parallel version of the code did not produce a speedup for any number of processors. In other words, the original, single-node version always ran faster than the parallel version.

Back to Cover Page