Yesterday my co-worker Klaas van Gend published a first post on how I
the idTech4 game engine behind Doom3. This has led to a lot of
responses on sites like Reddit/programming
news. We got very valuable feedback (thank you!), including some
positive comments about my work but also quite a few questions about
our tool, methodology and results. Today, in this follow-up post I
will try to clarify our previous statements and provide some
answers. Please leave a comment below (or even in any of the
aforementioned news websites) if anything needs further clarification,
we will put great care in answering them.
First of all, it is important to realize that the goal of this
exercise was to see how far I could get in optimizing the idTech4
engine for a multi-core architecture, with help of Vector Fabrics'
Pareon tool in limited amount of time. We chose IdTech4 as a benchmark
because it represents a very typical use case: a large code base,
optimized for single core execution.
The approach that I took is probably not the same as experts
would have taken. I took a top-down approach, looking for coarse-grain
parallelism rather than optimizing low-level rendering
instructions. For this approach Pareon was useful. I believe that two
methods could be combined to achieve even better results. For example,
the engine appears to ship a very optimized math library exploiting
Our Pareon tool is not aimed to replace parallelization experts. We
want the tool to be useful for any programmer who is looking
for paralellization opportunities in either his own code or in code
that he is unfamiliar with. A domain expert (game programmer)
can probably save some time because of his in-depth knowledge.
A detailed report on my method and results is available in this whitepaper and to some extent in
this presentation on
the Dutch T-Dose conference. The summary is as follows:
My goal was to optimize the idTech4 game engine (which is mostly
single threaded) for a multicore machine. Therefore I looked at data
parallelism at a high level as opposed to optimizing low-level
In 3 weeks time, I parallelized part of the frame rendering code by
adding 183 lines of code and replacing 226 lines. This resulted in an
increase in frame rate of 15% on a 4-core machine.
In order to achieve this, I split a large loop into four new loops and
then parallelized two of them. One loop was sped up by a factor of 3.8
and the other loop was sped up by a factor of 1.8. Together, they got
sped up by a factor of 2. The other two loops isolated calls that
could not be parallelized (OpenGL).
This speedup was observed in the performance benchmarking mode of
Doom3: the 'timedemoquit' mode runs a pre-recorded demo as fast as
possible. In this mode, there is no cap on the frame rate or
synchronization with refreshing the screen. This also implies that for
my setup, frame rate and time to render a frame are inversely
I used our Pareon tool to profile the engine and search for
parallelization opportunities. This way I was able to quickly focus on
the right pieces of code where a speedup gain was likely. It also gave
me insight in the changes I had to make to the code to guarantee a
correct resulting program.
Because performance optimizations requires reproducible measurements,
I only optimized the (pre-) renderer part. I did not look into
parallelizing the physics and AI calculation, which are also part of
rendering a frame, as I could not benchmark those parts in a
So now for the specific feedback and questions we got. I will try to
address those as detailed as I can below.
Some feedback states that the results not that impressive. The three
additional cores on my four-core machine are only used during part of
a rendering cycle. They are sleeping the rest of the time. If I would
parallelize other pieces of code in the cycle as well, it would
increase utilization and I would expect the frame rate to improve even
The frame rate shown in the demo run in the previous blog post is not
related to my results. The frame rate shown is lower because while
the demo was running it was also recording itself, saving each frame
to disk. This extra delay results in a lower reported frame rate.
Regarding the way Pareon performs its analysis: it relies on
instrumentation and therefore coverage is key. The parts of the code
of interest should be sufficiently covered. Partly it is the user's
responsibility to make sure the right data sets are used. However
Pareon helps the user by showing the achieved coverage of the parts
you want to parallelize.
Some people wondered if the HUD textures are messed up in the
optimized version, because the demo shows that as soon as you get hit
by a monster the health and score numbers seem to blow up in your
face. The answer is that this is part of the game play and
already present in the original unoptimized version.
Yesterday, we submitted the patch to the iodoom3 project with bugzilla ID
The omission of specifying the exact graphics card was for a reason:
it is not your average card. In an office for parallelization tool
developers, true graphics cards are superfluous. The only one I had
available at the time was an Nvidia Tesla C2050 card, mainly used for
high-performance computing (we also research CUDA and OpenCL) and is
not meant for games. According to this wikipedia article on
Nvidia Tesla, this card is based on the GF100 core, which in early
2010 also appeared in the GTX465, GTX470
and GTX480: high end graphics cards at the time of launch but
totally outdated today. Hence I stuck with the term "midrange".
I hope this clarifies some of the comments and questions. If you have
more, please comment in the existing threads about this topic or use
the comment field below this post and we will try our best to answer