The profile information includes loops, not just functions. Here showing cross-reference to source code
Parallelizing a loop. This loop can be parallelized, but threads turn out to mostly wait on each other.
Parallelizing a loop, achieving a 2.0x speedup, 1.8x overall on this quad-core architecture. We can do better.
Threads aren’t waiting on each other in this loop, however thread creation and cleanup overhead slows is very large.
Parallelizing this loop achieves a 4.0x speedup of the loop, 3.2x globally. Thread creation and synchronization overhead is minimal.