The tool just started. Showing main in the profile and 2D profile. Cheat sheets on the right.
Focusing on a loop inside the conway function. Showing iteration statistics and dependencies.
The profile information includes loops, not just functions. Here showing cross-reference to source code
Highlighting a loop (22-26), showing iteration stats, computation time, and cache/memory penalties
Parallelizing a loop. This loop can be parallelized, but threads turn out to mostly wait on each other.
Parallelizing a loop, achieving a 2.0x speedup, 1.8x overall on this quad-core architecture. We can do better.
Threads aren’t waiting on each other in this loop, however thread creation and cleanup overhead slows is very large.
Parallelizing this loop achieves a 4.0x speedup of the loop, 3.2x globally. Thread creation and synchronization overhead is minimal.
A more complex loop that gets invoked twice, and has more dependencies.
Bottom left shows the recipes to achieve this 3.2x speedup.
Step by step instructions with clear descriptions that show how to refactor the code to implement the parallelism.