Milestone 8
Parallelization
Learning goals
The student will…
- …learn how to work with the message passing interface (MPI).
- …be able to use domain decomposition as a parallelization strategy.
- …learn how to run parallel simulations on supercomputers.
Introduction
The system sizes that are accessible with our code are limited by the computational power of a single CPU. Parallelization allows us to use multiple CPUs for the computing task at hand. In this milestone, we will parallelize the code using a strategy called domain decomposition. The physical domain of our simulation is decomposed in to subdomains that each “live” on a single CPU. More specifically, we will use a Cartesian decomposition of the overall domain. Since both potential encounterd so far (Lennard-Jones and EAM) are short-ranged, meaning the interaction only extends to a certain distance, the necessary subdomain need to “talk” only to their respective neighbors. This leads to a communication pattern that effectively scales extremely well, easily beyond \(100\,000\) CPUs.
MPI prerequisites
Starting with this milestone, we will make use of the Message Passing Interface (MPI). For this, we need to setup the build system and we need to initialize our application correctly.
Build system
The skeleton repository that we provided for you needs to augmented with the configuration for MPI. The toplevel meson.build
needs the following statements:
mpi = dependency('mpi', language: 'cpp')
You then need to add mpi
as a dependency to your build targets using the optional dependencies
argument. This will then enable MPI for your project. This means from now on all your code is compiled with the MPI compiler wrapper mpicc
, mpicxx
or mpic++
that takes care of linking you program to the correct MPI libraries. You can still run your program (milestone08
) in a normal fashion (as you have done in the past milestones); this is equivalent to running on a single process via mpirun -n 1 ./milestone08
.
Running MPI codes
For running your code in parallel, you need to start the executable throuhg mpirun
. For example on 4 processes, you have to run mpirun -n 4 ./milestone08
from the command line. Note that if you have less than 4 CPUs in your computer, mpirun
may not allow you to run your code on 4 processes, but this may still be useful for testing purposes (although you won’t see any speedup). You can force mpirun
execute your command by specifying the --oversubscribe
option: mpirun -n 4 --oversubscribe ./milestone08
Preparing your code for MPI
Before you can execute any MPI commands in your own code, you need to initialize MPI. You main
needs to have the following structure:
#include "mpi.h"
int main(int argc, char *argv[]) {
MPI_Init(&argc, &argv);
...
MPI_Finalize();
}
MPI communication can be used in between MPI_Init
and MPI_Finalize
.
MPI support module
MPI does not officially have C++ bindings. The C bindings are described for example here. You will have to use the C bindings for your code, but we provide a simple C++ support module that eases usage of the C library. One particular advantage is that we can overload functions in C++, providing functionality for different data types while in C you have to specify the data type directly. The C++ support module can be download here: mpi_support.h
As an example, mpi_support.h contains a wrapper for the MPI_Allreduce
function that we will use to compute sums across all MPI processes, for example to sum up the energy contributions. A bare C implementation for summing the kinetic energy could look like this:
double ekin_total = 0; // sum of total kinetic energy
double ekin_local; // kinetic energy on current process
epot_local = kinetic_energy(...);
MPI_Allreduce(&ekin_local, 1, MPI_DOUBLE, &ekin_total, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
// ekin_total now contains the sum
The wrappers in the support module deduce the type automatically, allowing you to write
double ekin_total{MPI::allreduce(kinetic_energy(...), MPI_SUM, MPI_COMM_WORLD)};
// ekin_total now contains the sum
This leads to more compact (and more readable) code. It is also less error prone as the C++ wrapper makes sure the right MPI type (MPI_DOUBLE
, MPI_INT
, etc.) is passed to the underlying communication function.
The header mpi_support.h also contains helper functions for serialization (or “packing”) of buffers for subsequent sent operations. Inspect the file mpi_support.h and try to understand what it does.
Domain decomposition
We now parallelize the molecular dynamics code using domain decomposition. We provide for you a domain decomposition module that takes care of communicating atoms and ghost atoms. The module implements the Domain
class and can be downloaded here: domain.h, domain.cpp. Inspect these files and try to understand how they work.
So far, our simulation code is agnostic of the spatial domain we are simulating. The gold cluster of Milestone 07 can be located anywhere in space. For the purpose of domain decomposition, we need to specify a bounding box (the simulation domain) in which our atoms live. When atoms leave this bounding box they are either discarded or wrapped periodically back to the opposite end of this box. The Domain
class is a representation of this bounding box or simulation domain.
The Domain
class can also subdivide the global domain into an integer number of subdomains. The constructor takes for arguments: The MPI communicator, the length of the domain in the three cartesian direction, the number of subdomains for these directions and whether these directions should be treated periodically. For example
Domain domain(MPI_COMM_WORLD, {3.5, 4.5, 5.5}, {1, 1, 4}, {0, 0, 1});
instantiates a domain a size \(3.5~Å\times 4.5~Å\times 5.5~A\). The origin is located at position \(0,0,0\). This domain is subdivided into \(4\) subdomain in \(z\)-direction. (The code hence needs to be run with \(4\) MPI processes.) Only the \(z\)-direction is periodic. Note that the Domain
class is useful even for serial calculations to model periodic domains. In this case, the Domain
class simply replicates atoms from the simulation domain as ghost atoms.
The Domain
class knows two states of the simulation. First, a replicated state where each MPI process contains all atoms, i.e. the respective Atoms
objects are simply replicated across the MPI processes. Second, a decomposed state in which each MPI process only contains the atoms within the subdomain (and possibly ghost atoms). After instantiation, the Domain
object is in the replicated state. File I/O needs to happen in the replicated state. Files needs to be read on all processes but written only on a single process.
The Domain
class requires the Atoms
class to have a resize
method that resizes all arrays (positions, velocities, etc.) without loosing existing data. You will need to implement this method for your Atoms
class.
Enabling/disabling domain decomposition
The enable
method of the Domain
class switches into the decomposed state. Before starting the integrator loop, you would execute
domain.enable(atoms);
to switch to the decomposed state. You can switch back to the replicated state by issuing
domain.disable(atoms);
In an integrator loop, you need to disable domain decomposition before writing snapshots to trajectory files using write_xyz
.
Moving atoms between subdomains
An atom may leave the current subdomain during an integration step. A call to
domain.exchange_atoms(atoms);
moves these atoms to the neighboring domain.
Populating ghost atoms
Before computing forces, you need to populate the Atoms
object with ghost atoms from the neighboring subdomain. Use
domain.update_ghosts(atoms, border_width);
to do this. The parameter border_width
specifies up to what distance to the subdomain ghost atoms are required. Note that for an EAM potential, the border_width
needs to be twice the cutoff of the potential. Note that atoms exchange or disabling and subsequent enabling of the domain decomposition removes all ghost atoms.
Back to the gold cluster
Incorporate domain decomposition into your MD code and run the gold clusters from Milestone 07 on \(1\), \(2\), \(4\) and \(8\) MPI processes. Show that a pure microcanonical (NVE) run conserves energy.
When modifying your code you need to think about how to compute the potential energy. The implementation of the EAM potential is entirely unaware of the decomposed state of your calculation; it will simply return the potential energy for your subdomain including the ghost atoms. You will need to compute the potential energy for just the atoms in the subdomain, excluding the ghosts. A way to achieve this is to introduce a per-atom potential energy and sum that energy only over the atoms local to the current subdomain. The global potential (and kinetic) energy then needs to be summed over all MPI processes using a reduce operation.
Task Summary
This milestone requires the following tasks:
- Parallelize the code using the domain-decomposition module provides by us
- Optional: Develop your own domain-decomposition module
- Carry out first parallel runs on the gold clusters from Milestone 07
We ask you to provide and discuss the following figures in your final report:
- Plot of the total energy versus time to demonstrate energy conservation
We provide the following files for you: