Vector Fabrics home page Vector Fabrics Blog RSS View Vector Fabrics' Twitter account View Vector Fabrics' Facebook profile View Vector Farbics' LinkedIn profile Google+

Sharing insights in multicore programming

Multithreading examples for C programs, Part 1

Have you ever parallelized a C or C++ program? Then you got a slowdown first, didn’t you? Parallelization is a tough call because of data dependencies hiding behind pointers, unclear multithreading overheads, unexpected processor and OS behaviour, let alone potential starvation and deadlocks. Luckily, in the past 40 years computer engineers have been constantly accumulating best practices of industry-proven solutions to common concurrency challenges. They call them parallelization patterns.

It’s worth pointing out the difference between parallelization patterns for control-intensive and compute-intensive applications. The former are dominant in distributed systems such as web servers and network systems, where mostly independent coarse-grain processes are run in parallel on distributed machines. To dig deeper in this domain, check out the great book “Pattern-oriented software architecture, Volume II” by D. Schmidt et al. The latter compute-intensive parallelization patterns are often found in high performance computing and embedded system designs, packed nowadays with multimedia and communication software. For example, the data elements of multimedia streams flowing through multicores of embedded systems can often be processed concurrently by fine-grain threads, involving, though, a good deal of inter-thread dependencies. The “Patterns for parallel programming” book by Mattson et al. is an excellent read for those who want to master this domain in depth. In general, there exist many resources surveying parallel programming patterns, check out, for example, this collection of links created by Marc Snir at Illinois University.

At Vector Fabrics we are also constantly exploring ways to expand applicability of our parallelization tools to cover more parallelization patterns. In practice, we have built a collection of more than 20 minimalistic C programs and semantically-equivalent parallel implementations, capturing popular parallelization patterns from compute-intensive applications. Let’s run through three examples that can easily surprise a beginner and please the intelligence of an experienced parallelization hacker.

Pattern “ignored dependencies”

Can you see the difference between the figures below? I can’t. However, the Linux diff command claims they are different.

image inpainted by the original sequential code  Image inpainted in parallel after breaking up dependencies

The image on the left is the result of the original sequential OpenCV’s image restoration function cv::inpaint(). The image on the right came out from our modified parallel version of the sequential code, breaking inter-pixel data dependencies and dependencies on the kernel’s priority queue data structure. A development version of vfEmbedded shows me there are tons of dependencies blocking parallelization of a top-level loop in the kernel. However, if you ignore them in a parallel version the visual perception of the result is similar and... the parallel code runs nearly 4 times faster than the sequential version on an NVIDIA’s Tegra 3 chip. This is a recurring pattern, when some dependencies, for example, on standard output or even pixel data can be ignored to enable parallelism. In fact, this is a typical interaction between the algorithm and implementation development teams, involving trading off algorithmic quality for implementation's performance. Note, however, that ignoring dependencies on certain data structures (responsible, for example, for tracking allocated objects on the heap) may easily result in a segfault. So, it is crucial to know all dependencies involved in parallelization.

Pattern “linked list”

Traversing a linked list is a sequential process, because in each iteration we compute the address of the next element in the list. No concurrency there? Yes, there is! Modern multithreading language extensions such as OpenMP 3.0 support this case with the concept of tasks:

/* sequential code */
for(lmn = llist; lmn; lmn = lmn->next)
    lmn->x = process(lmn->x);
/* parallel implementation */
#pragma omp parallel
#pragma omp single private (lmn)
for(lmn = llist; lmn; lmn = lmn->next)
  #pragma omp task
  lmn->x = process(lmn->x);

Note, that in the parallel implementation the linked list is still traversed sequentially, to compute the next element's address lmn. However, the long process() functions can now be overlapped, enabling impressive speedups for this seemingly sequential loop with dependencies between multiple loop iterations also known as loop-carried dependencies.

Pattern “data partitioning with loop-carried dependencies”

In the linked list pattern processing of each linked list’s element was independent of the others. What if a loop iteration does depend on a previous one? Well, then we can still run the loop iterations in parallel, but the loop-carried dependencies need to be satisfied by proper synchronization.

Check out the code snippet below, in which calculation of lc_dep[​i+1] depends on lc_dep[​i] from the previous iteration:

for(i = 0; i < PROBLEM_SZ; i++)
 array[​i] = fun1(array[​i]);
 lc_dep[​i+1] = fun2(lc_dep[​i] + array[​i]);

We can parallelize this loop, but have to ensure that lc_dep[​i] is ready before computing lc_dep[​i+1]. Luckily, OpenMP provides a decent solution - the ordered construct:

#pragma omp parallel for ordered schedule(static, 1)
for(i = 0; i < PROBLEM_SZ; i++)
  array[​i] = fun1(array[​i]);
  #pragma omp ordered
  lc_dep[​i+1] = fun2(lc_dep[​i] + array[​i]);

The ordered construct enforces the original ordering of execution for a specified section of a loop, whereas the rest of the loop body may be scheduled concurrently. For example, below is an execution schedule of this loop on two cores with arrows indicating dependencies between functions fun1() and fun2(). Note, that fun1() and fun2() from iterations i=1 and i=0, respectively, execute in parallel.

Two threads schedule with inter-function dependencies

By the way, be careful with the ordered construct, if you use the GNU OpenMP library of the gcc compiler. The current implementation of this construct in gomp is blocking concurrency, if the ordered is specified in the beginning of the loop, see this gcc bug report. Interestingly, this loop also exposes a functional partitioning pattern, which suggests spawning threads running different functions (e.g. fun1() and fun2()). But this is yet another pattern...

All in all, we had a lot of fun coding many parallelization patterns in different languages, chasing deadlocks and race conditions, benchmarking examples on different hardware, as well as reading and discussing related books. The parallelization patterns guide the programmer towards a safe solution. However, even if you follow the patterns, without dependency-tracking and parallel performance estimation tools, your first multithreaded implementation is bound to either segfault or slowdown. ;-)

In part 2 of this post I will run some of the above examples through our vfEmbedded tool and show you how it can help analyze dependencies and create parallel programs using POSIX threads for platforms that don't support OpenMP, such as the Android OS.

Posted in category: Market & Skills on Friday, November 18, 2011 - 16:32


Add comment

Notify me of follow-up comments?

Please enter the word you see in the image below: