Parallelizing a sequential C program using vfEmbedded

In this blog post I will explain how to use vfEmbedded to parallelize a small sequential C program. The program in question calculates the electric potential in a two dimensional plane created by a discrete set of point charges. The analysis and implementation of the recipes provided by vfEmbedded were done in three hours, speeding up the program with a factor 3.4.

The electric potential at a point Vp is calculated with:

Where q is the set of point charges and r_i the distance from point p to the point charge in question. So, for one point V_p one has to loop over every discrete point charge. This results in three nested loops to calculate the electric potential for all points in a two dimensional plane. You can take a look at the calculate function in main.c for more details. This program is CPU bound so we would like to divide the work over multiple cores. Using vfEmbedded we will make an informed decision about which loop to parallelize, and we will use vfEmbedded to predict our speed up before we start implementing.

Analyze the program with vfEmbedded

First, start a new project in vfEmbedded and upload the source file main.c. Second, open the Settings dialog and change the target platform to Intel i5 & hardware functions. Now we are ready to build and analyze our program! Since vfEmbedded uses both static and dynamic code analysis it will take sometime for the analyze to finish. Wait for the analysis to complete and switch to the Profile tab. The Profile tab tells us how much time is spent in each function. This allows you to quickly identify the most interesting parts of your program to parallelize. For example, 99.99% of the execution time of our program is spent in the calculate function (see image).

Next, we would like to find out which loop to parallelize. Instead of writing multiple versions of our program and benchmarking the performance, we can use vfEmbedded to do the hard labor for us and predict our increase in performance. We will start naively and parallelize the innermost loop: just right click on it in the Profile tab and select Parallelize. This will open the Partitions tab where we are able to specify the number of threads. If we suggest to use 4 threads for example, vfEmbedded predicts a speed up of 3x:

So why is the predicted speed up 3x and not 4x? To find the answer we will take a look at the thread schedule. Apply the partitioning and switch back to the Profile tab. Here we can see that the loop is replaced by a new function which is being called in parallel. We can examine its execution schedule by right clicking the Parallel node and selecting Show Schedule. The schedule shows us that 13% of the time is spent on creating threads and 12% is spent on properly cleaning up threads. Besides this we can see that there is a dependency between the threads as depicted with the blue line. To learn more about this dependency, we select it and take a look in the bottom left corner to view its properties. There we find that the dependency is due to the summation of the electric potential. This corresponds to line 57 in main.c:

We have learned that significant amount of time is spent on thread creation and clean up, we can overcome this problem by spawning fewer threads but doing more work in each of them. In other words, we should parallelize one of the two other (outer) loops. First, right click the Parallel node in the Profile tab and select Remove Partition. Now repeat the process of paralleling a loop by right clicking one of the two other loops and selecting Parallelize. vfEmbedded predicts a speed up of 4x for both loops if we suggest to use 4 threads.

Implementing the changes

Once we are satisfied with the predicted speed up it is time the implement the changes necessary to achieve them. Implementing these changes is easy since vfEmbedded gives you a recipe: a step by step guide which tells you how to refactor your program. Each step on its own is small and you should be able to compile and run your code after each step. To view the step-by-step recipe simply go to the My changes tab:

After implementing the changes it is time to benchmark and see if the predicted speed ups are achieved in reality. For the purpose of simplicity I will use the Linux time command:

> time ./original

real	0m7.062s
user	0m6.970s
sys	0m0.000s

> time ./parallel

real	0m2.072s
user	0m7.850s
sys	0m0.000s

We see that vfEmbedded was a bit pessimistic about our performance increase since the actual measured performance increase is approximately 3.4x. Because of vfEmbedded we know we parallelized the right loop, we didn't waste any time on less favorable options. Besides this, implementing the vfEmbedded recipe is straight forward: analyzing a small program and implementing the recipe given by vfEmbedded took no more than three hours.

For those interested, the source code with Makefiles of both versions is available in the attached zip file.

 
Posted in category: Company News on Wednesday, November 9, 2011 - 07:45

Comments

No comments found.

Add comment

(required)
(required, will not be published)
(will not be published)
(will not be published)
Notify me of follow-up comments?

Please enter the word you see in the image below: