I’ve been programming next-gen consoles for a while now, and I have to say: it’s not getting any easier to write games to take advantage of multi-core systems. The conventional wisdom at my company is: writing multithreaded code is hard, so “ordinary programmers” should not be doing it – let’s leave it to the senior engineers who write the systems code to get it right. Ordinary gameplay/AI/animation/etc engineers shouldn’t have to worry themselves with this stuff.
I’m currently a lead engineer (which puts me in the “senior” category – although technically at my company “lead” is not a job rank per se). Lead engineers typically have a long history in the industry, and usually have a background in tackling the hard tasks of engineering lower-level systems in games. Many leads have done rendering, some have physics experience, many have experience of architecting games, and a few (myself included) also have a networking background. Basically, these correspond pretty well to the skills that one can’t hire for love nor money. As the highest-ranking engineers who still actually write code, it falls to us to provide technical direction for the projects in our studio and our company.
I can partly see why “leave it to the senior engineers” is a reasonable stance – for example, dealing with the intricacies of memory barriers is a task I wouldn’t wish on anyone. This stuff is hard enough to get right that the fewer people who have to touch it, the better. But on the other hand, I don’t think we’re doing enough to get safe forms of concurrency into the hands of the engineers who write the game code.
We (both internally, and I think, in the wider industry) are making some progress on some systems code being multithreaded. However, it’s not great progress. Of course, one problem is that PCs are still lagging behind next-gen consoles: the min spec target, even for games looking at shipping this time next year, is still a single-core P4 based system. Hyperthreaded if we’re lucky. Multicore PCs are out there, and are getting more numerous, but don’t yet have the market share to support dropping the lower end. And of course both Xbox 360 and PS3 still have more cores than a current top-end PC (6 hardware threads and 2 PPU hardware threads + 6 SPUs respectively).
For the most part, we are still moving single systems to other threads – there is nothing general about our approach to multithreading yet, and it’s not going to scale except in a very few cases. We have managed to separate “game” and “render” threads, but both of them are still fairly monolithic. There is some hope for physics simulation being threaded, and this is mostly being worked on by companies like Havok and Ageia. There is also some work being done on lower-level AI tasks like pathfinding. Some game tasks (like culling objects against the view frustum) can be relatively easily split onto another thread. And some game tasks continue to be threaded, the way they have been for years (audio and networking spring to mind).
But the major problem area for multithreading is where the actual gameplay happens: the intersection of animation, “high-level” AI, player control and to some extent physics. When I talk about “high-level” AI I mean the decision-making and glue code that controls what entities in the game do. The main game loop of object updates has multiple ordering dependencies among these systems and updating one system for a single object often causes cascading requests to other systems. For example: animation must be posed before physics is resolved (because posing animation may cause an arm to stick through a wall that a character is standing close to, and the physics system will then apply constraints to push the arm back outside the wall). Another example is that AI will decide to change what a character is doing and therefore will drive animation blending. A third example is that AI code will often want to know whether two characters can see each other in order to decide what to do. This involves asking the physics system to do a line-of-sight trace between the objects, and the result is needed to make the AI decision, so the call is typically synchronous.
So, animation, AI and physics are often tightly coupled in the update of a single object, and the update of a single object will often touch the data of half a dozen other objects (e.g. the weapon I’m carrying, what my enemy is doing, what state my buddies are in, etc). It doesn’t take a genius to realise that this is a pretty bad state of affairs if we’re trying to split parts of these tasks on to multiple threads. And these are not lightweight tasks either: posing animation involves evaluating state machines and a bunch of matrices and quaternion calculations; a physics trace often involves a heavyweight data structure traversal (octtree, BSP or similar); AI involves complex state machines, sometimes interpreted script, and a lot of conditions (something, incidentally, that the PowerPC architecture of the Xbox 360 and PS3 is quite poor at).
So what’s to be done? Well, I have a couple of ideas designed to make it easier for “ordinary engineers” to take advantage of multiprocessing in at least a semi-safe way. The first idea is Software Transactional Memory or STM. You’ll find the literature if you care to look around the web. I think this is the only practical solution to achieving some concurrency in updating the thousands of objects that we need to, each of which touches half a dozen others. And I think we should apply it at the object level. This will require a bit of extra memory, but if done right, the performance gains could be significant. And importantly, it will scale. For the first time, I think we have the potential to update thousands of objects across as many cores as we have. Of course, I don’t yet have any practical experience, and we may find that a new bottleneck (e.g. memory access) springs up.
The second idea I have to enable more multiprocessing of mid- and systems-level code (and perhaps some higher-level code) by functional programming. I bet that more than 90% of the loops we write can be implemented in terms of map and fold. The challenge is finding a way to abstract this in C++ (much as I would like to write games in Haskell, I don’t see it being adopted any time soon – but I may well experiment with a scripting language embedded in Haskell). Imagine if AI and player control engineers could, instead of writing a synchronous for-loop to find the closest enemy, write that same functionality in terms of map and fold (which is easy) and automatically have it be distributed across cores? The beauty of this idea is that the concurrency work is done once to make map and fold themselves multicore, and then (with appropriate forethought and data constraints) everyone can take advantage of that.
So the only thing I need to do now is actually write this code, get some preliminary use cases together, try it out, and then introduce it to my team. Wish me luck…