How to "survive" OpenMP parallelism

From KratosWiki
Jump to: navigation, search

Contents

OpenMP parallelism is active by default

It is now (beginning of 2010) a while that OpenMP is a widespread reality, and OpenMP 3.0 is currently supported by a number of different compilers. For this reason the decision was made to make OpenMP parallelization of some critical parts of the code active by default.

The most crucial changes are:

  • Building phase parallel by default, that is that the ResidualBasedBuilderAndSolver is now performing the FE assembly in a parallel manner
  • Solution of the resulting linear system of equations, is also parallel by default.


Understanding the Build Phase

The assembly phase is likely to create to unexperienced programmers the most of the problems. The underlying idea of the parallelization of the build phase is in any case fairly simple: in the assembly, centralized in the class ResidualBasedEliminationBuilderAndSolver, a loop on all of the elements (and conditions) is performed, each elements calculates a LocalLHS and LocalRHS contribution. Finally such local contributions are written in the correct position in the system, identified by the vector EquationId.

The OpenMP parallelization simply takes the loop over the elements and splits it between the various processors, taking care to ensure that no conflict is created in writing the resulting data to the global system matrix.

This system relies critically on one crucial assumption. Elements and conditions MUST be "thread-safe", meaning that the code should be guaranteed to work without conflicts when multiple threads read run the same code.

Unfortunately, such (admittedly bad) definition of thread-safety does not provide any practical help to the novel programmer. In the practice thread safety of the Kratos's elements and conditions is guaranteed once the following points are met

  • Element does not make use of any static variable
  • No global variables
  • Element does not write data to the nodal database nor to the elemental one


Warn_icon.gif Warning. Please, note that:

  • if one of those assumptions is not met ... the element will not work in parallel

dealing with static variables

In writing effective elements, static storage is often important to provide a scratch-space to be shared by all of the elements of a same type. By definition such a space is shared by all of the elements, which means that in the case in which two elements are processed at once, the two elements will write at the same time on the same storage ... having a conflict. The OpenMP (3.0 or greater) provides the means of getting rid of this problem by making static variables "threadprivate", in other words by guaranteeing that the each threads has its own static storage.

Warn_icon.gif Warning. Please, note that:

  • unfortunately this feature is still buggy on most compilers. Intel 11.1 still has a problem with it. Intel 10 features OpenMP 2.5 and consequently can not compile it. As a workaround global threadprivate variables can be used as described in the following paragraph


using global variables in Kratos

We want to start this section by stating that global variables in the Kratos namespace are explicitly forbidden by the Kratos programming guidelines. Nevertheless, as a workaround for the usage of threadprivate static variables, it is possible to define global variables in other namespaces. See "TotalLagrangian.cpp" in "structural_application/custom_elements/" for an example of usage.

Warn_icon.gif Warning. Please, note that:

  • PERFORMANCE WARNING: note that using threadprivate variables implies a performance penalty. In some cases it may be conh convenient allocating locally the scratch space rather than defining it global and threadprivate.


Writing to the nodal database

In some cases, the element is used directly to write to the nodal database. An example of this is for the calculation of the NODAL_MASS. The user should avoid doing this in the build phase. If it is absolutely necessary to perform such operation, the user should "protect" the writing to the database, guaranteeing that the writing will be performed correctly in parallel. The two techniques that can be used for this techniques are

  • using #pragma omp atomic
  • using Kratos Locks.

Example of Atomic

Example of Nodal Locks

Basic OpenMP operations using OpenMPUtils

This section provides a practical example of how to divide an iteration over nodes, elements, conditions or other lists of Kratos objects between threads.

A simple example: parallel loop

This is a very simple example taken from vel_pr_criteria.h. The basic idea here is that we want to copy the current values of VELOCITY and PRESSURE nodal variables (obtained once the last iteration converged) to the non-historical database, so we can use them as a reference to compute the relative error of the next iteration. To do this in parallel, we will need to define a partition, that is, which thread will take care of updating each node. The simplest way to do this is to divide the node list in contiguous blocks and assign a block to each thread. As creating a partition is a very common task, we will want to define this operation in its own function and reuse it in our code:

 inline void DivideInPartitions(
               const int NumTerms,
               const int NumThreads,
               std::vector<int>& Partitions)
       {
           Partitions.resize(NumThreads + 1);
           int PartitionSize = NumTerms / NumThreads;
           Partitions[0] = 0;
           Partitions[NumThreads] = NumTerms;
           for(int i = 1; i < NumThreads; i++)
               Partitions[i] = Partitions[i-1] + PartitionSize ;
       }

This function takes as input the number of items in our list and the number of threads, and fills an empty vector with the positions of the first term assigned to each thread. It also places the numbers of terms at the end of the list, which will prove useful for iteration purposes.

To use this, we need to know how many threads will our program use and how many items contains the array we want to iterate over. Using OpenMP, we can get this information and generate the partition with this piece of code:

 int NodeNum = rModelPart.Nodes().size(); // Get Number of nodes
 std::vector<int> NodeDivision; // Initialize an empty vector to store the partition
 int NumThreads = omp_get_max_threads(); // Get total number of threads
 DivideInPartitions(NodeNum,NumThreads,NodeDivision); // Generate the partition

Once the partition is created, each thread will iterate over its bloc of nodes using:

 #pragma omp parallel
 {
    int k = omp_get_thread_num(); // My thread id
    // My block of nodes begins at:
    typename ModelPart::NodesContainerType::iterator NodeBegin = rModelPart.NodesBegin() + NodeDivision[k];
    // And ends before:
    typename ModelPart::NodesContainerType::iterator NodeEnd = rModelPart.NodesBegin() + NodeDivision[k+1];
    // Loop over my nodes
    for(typename ModelPart::NodesContainerType::iterator itNode = NodeBegin; itNode != NodeEnd; itNode++)
    {
        itNode->GetValue(VELOCITY) = itNode->FastGetSolutionStepValue(VELOCITY);
        itNode->GetValue(PRESSURE) = itNode->FastGetSolutionStepValue(PRESSURE);
    }
 }

This kind of approach can be used to convert most loops over Kratos objects to OpenMP, but note that, due to the way the partition is defined and used, this kind of strategy will break if the loop modifies the array of nodes, adding or removing terms.

Using OpenMP enabled code

An important issue to keep in mind is that we want our code to work with and without OpenMP, so the lines in the example above would typically be enclosed in an #ifdef block:

 #ifnndef _OPENMP
 // Serial code
 #else
 // Parallel code
 #endif

The preprocessor flag _OPENMP only defined when we are using OpenMP, which means that your compiler will use the first block of code if OpenMP is not enabled or the second block otherwise. This is necessary to ensure that Kratos works properly in both scenarios, but it also means that we will need to write the same code twice and, what is worse, debug it twice!

To avoid this problem, we have defined the OpenMPUtils class, which provides access to the OpenMP functions that we needed in the previous example to partition and loop over a node array and a suitable scalar alternative. This would allow us to rewrite the previous example as:

 #include <utilities/OpenMPUtils.h>
 // ...
 typedef OpenMPUtils::PartitionVector PartitionVector;
 // ...
 int NodeNum = rModelPart.Nodes().size(); // Get Number of nodes
 PartitionVector NodeDivision; // Initialize an empty vector to store the partition
 int NumThreads = OpenMPUtils::GetNumThreads(); // Get total number of threads
 OpenMPUtils::DivideInPartitions(NodeNum,NumThreads,NodeDivision); // Generate the partition
 #pragma omp parallel
 {
    int k = OpenMPUtils::ThisThread(); // My thread id
    // My block of nodes begins at:
    typename ModelPart::NodesContainerType::iterator NodeBegin = rModelPart.NodesBegin() + NodeDivision[k];
    // And ends before:
    typename ModelPart::NodesContainerType::iterator NodeEnd = rModelPart.NodesBegin() + NodeDivision[k+1];
    // Loop over my nodes
    for(typename ModelPart::NodesContainerType::iterator itNode = NodeBegin; itNode != NodeEnd; itNode++)
    {
        itNode->GetValue(VELOCITY) = itNode->FastGetSolutionStepValue(VELOCITY);
        itNode->GetValue(PRESSURE) = itNode->FastGetSolutionStepValue(PRESSURE);
    }
 }

This piece of code does exactly the same as the one shown in the previous block, but will also work when OpenMP is not enabled. The checks for _OPENMP are hidden inside the OpenMPUtils functions, which are prepared to return a suitable value when compiled without OpenMP.

Note that we defined a PartitionVector type. This is not really necessary, as it is just a std::vector<int>, but will be helpful if we decide to change it someday.

Personal tools
Categories