Yesterday my co-worker Klaas van Gend published a first post on how I accelerated the idTech4 game engine behind Doom3. This has led to a lot of responses on sites like Reddit/programming and Hacker news. We got very valuable feedback (thank you!), including some positive comments about my work but also quite a few questions about our tool, methodology and results. Today, in this follow-up post I will try to clarify our previous statements and provide some answers. Please leave a comment below (or even in any of the aforementioned news websites) if anything needs further clarification, we will put great care in answering them.
First of all, it is important to realize that the goal of this exercise was to see how far I could get in optimizing the idTech4 engine for a multi-core architecture, with help of Vector Fabrics' Pareon Profile tool in limited amount of time. We chose IdTech4 as a benchmark because it represents a very typical use case: a large code base, optimized for single core execution.
The approach that I took is probably not the same as experts would have taken. I took a top-down approach, looking for coarse-grain parallelism rather than optimizing low-level rendering instructions. For this approach Pareon was useful. I believe that two methods could be combined to achieve even better results. For example, the engine appears to ship a very optimized math library exploiting SSE vectorization.
Our Pareon tool is not aimed to replace parallelization experts. We want the tool to be useful for any programmer who is looking for paralellization opportunities in either his own code or in code that he is unfamiliar with. A domain expert (game programmer) can probably save some time because of his in-depth knowledge.
A detailed report on my method and results is available in this whitepaper and to some extent in this presentation on the Dutch T-Dose conference. The summary is as follows:
- My goal was to optimize the idTech4 game engine (which is mostly single threaded) for a multicore machine. Therefore I looked at data parallelism at a high level as opposed to optimizing low-level rendering routines.
- In 3 weeks time, I parallelized part of the frame rendering code by adding 183 lines of code and replacing 226 lines. This resulted in an increase in frame rate of 15% on a 4-core machine.
- In order to achieve this, I split a large loop into four new loops and then parallelized two of them. One loop was sped up by a factor of 3.8 and the other loop was sped up by a factor of 1.8. Together, they got sped up by a factor of 2. The other two loops isolated calls that could not be parallelized (OpenGL).
- This speedup was observed in the performance benchmarking mode of Doom3: the 'timedemoquit' mode runs a pre-recorded demo as fast as possible. In this mode, there is no cap on the frame rate or synchronization with refreshing the screen. This also implies that for my setup, frame rate and time to render a frame are inversely related.
- I used our Pareon tool to profile the engine and search for parallelization opportunities. This way I was able to quickly focus on the right pieces of code where a speedup gain was likely. It also gave me insight in the changes I had to make to the code to guarantee a correct resulting program.
- Because performance optimizations requires reproducible measurements, I only optimized the (pre-) renderer part. I did not look into parallelizing the physics and AI calculation, which are also part of rendering a frame, as I could not benchmark those parts in a reproducible way.
So now for the specific feedback and questions we got. I will try to address those as detailed as I can below.
- Some feedback states that the results not that impressive. The three additional cores on my four-core machine are only used during part of a rendering cycle. They are sleeping the rest of the time. If I would parallelize other pieces of code in the cycle as well, it would increase utilization and I would expect the frame rate to improve even more.
- The frame rate shown in the demo run in the previous blog post is not related to my results. The frame rate shown is lower because while the demo was running it was also recording itself, saving each frame to disk. This extra delay results in a lower reported frame rate.
- Regarding the way Pareon performs its analysis: it relies on instrumentation and therefore coverage is key. The parts of the code of interest should be sufficiently covered. Partly it is the user's responsibility to make sure the right data sets are used. However Pareon helps the user by showing the achieved coverage of the parts you want to parallelize.
- Some people wondered if the HUD textures are messed up in the optimized version, because the demo shows that as soon as you get hit by a monster the health and score numbers seem to blow up in your face. The answer is that this is part of the game play and already present in the original unoptimized version.
- Yesterday, we submitted the patch to the iodoom3 project with bugzilla ID 5790
- The omission of specifying the exact graphics card was for a reason: it is not your average card. In an office for parallelization tool developers, true graphics cards are superfluous. The only one I had available at the time was an Nvidia Tesla C2050 card, mainly used for high-performance computing (we also research CUDA and OpenCL) and is not meant for games. According to this wikipedia article on Nvidia Tesla, this card is based on the GF100 core, which in early 2010 also appeared in the GTX465, GTX470 and GTX480: high end graphics cards at the time of launch but totally outdated today. Hence I stuck with the term "midrange".
I hope this clarifies some of the comments and questions. If you have more, please comment in the existing threads about this topic or use the comment field below this post and we will try our best to answer them.