Utilizing MulticoreProcessors with OpenMP Pete Isensee, Microsoft Corporation Represented by Eric Cheng, CGGT 12
Why do we need to utilizing multicore• CPU clock speeds are not improving at the rates we’ve become accustomed over the past decade• Games now days are more complex, with more objects in scenes than ever• We have to leverage the power of multicore processors in the game engines
OpenMP• The industry often needs to schedule dozens and sometimes hundreds of independent processors that have a shared view of memory• OpenMP is one of the solution• OpenMP is a portable, industry-standard API and programming protocol for C/C++ that supports parallel programming
OpenMP Support• OpenMP is managed by the non-proﬁt technology consortium OpenMP Architecture Review Board (or OpenMP ARB), jointly deﬁned by a group of major computer hardware and software vendors, like AMD, IBM, Intel, Cray, HP, Fujitsu, NVIDIA, NEC, Microsoft, Texas Instruments,VMware, Oracle Corporation, and more.• The OpenMP Architecture Review Board (ARB) published its ﬁrst API speciﬁcations, OpenMP for Fortran 1.0, in October 1997. October the following year they released the C/C++ standard. 2000 saw version 2.0 of the Fortran speciﬁcations with version 2.0 of the C/C++ speciﬁcations being released in 2002. Version 2.5 is a combined C/C++/Fortran speciﬁcation that was released in 2005.• Version 3.0, released in May, 2008. Included in the new features in 3.0 is the concept of tasks and the task construct. These new features are summarized in Appendix F of the OpenMP 3.0 speciﬁcations.• Version 3.1 of the OpenMP speciﬁcation was released July 9, 2011.
OpenMP Implementation• OpenMP has been implemented in many commercial compilers. For instance,Visual C++ 2005, 2008 and 2010 support it (in their Professional, Team System, Premium and Ultimate editions), as well as Intel Parallel Studio for various processors. Sun Studio compilers and tools support the latest OpenMP speciﬁcations with productivity enhancements for Solaris OS (UltraSPARC and x86/x64) and Linux platforms. The Fortran, C and C++ compilers from The Portland Group also support OpenMP 2.5. GCC has also supported OpenMP since version 4.2.• A few compilers have early implementation for OpenMP 3.0, including• GCC 4.3.1, Nanos compiler, Intel Fortran and C/C++ versions 11.0 and 11.1 Compilers, Intel C/C++ and Fortran Composer XE 2011 and Intel Parallel Studio, IBM XL C/C++ Compiler• Sun Studio 12 update 1 has a full implementation of OpenMP 3.0
OpenMP Example: Particle System• Suppose your game has a particle system that recalculates the positions of all particles once per frame:• Say if we have two threads, we want to split up like this:• OpenMP allows you to do that with this:
What compiler does here• An OpenMP-enabled compiler generates code that automatically splits this for loop into multiple parallel sections, each of which executes independently. The number of parallel sections used depends on the hardware, the OpenMP runtime, and conﬁguration settings that the programmer has established
Performance• Suppose that numParticles is 100,000 and GetNewParticlePos takes a small ﬁxed amount of time• The table below shows the performance metrics on two different systems using the Visual Studio 2005 compiler. The ﬁrst system is a desktop PC with dual Xeon processors, each with two hyperthreads. The second system is an Xbox 360, which has three CPUs, each with two hardware threads Windows Threads Hardware OpenMP Threads OpenMP Perf Gain Perf GainDual-core 2.3-GHz 4 2.9x 2.9x Pentium Triple-core 3.2- 6 4.6x 4.5x GHz Xbox 360
Comparison with Windows thread calls• Writing the previous example with Windows thread calls and synchronization primitives required over 60 lines of code, not to mention considerable debugging and tuning effects• The performance gain was virtually the same using OpenMP directly. In fact, on Xbox 360, the overhead of calling Windows synchronization primitives was higher than using OpenMP, because the OpenMP runtime is turned to call kernel exports directly• From the example, we learn from the fact that OpenMP can provide major beneﬁts with a very small investment
OpenMP Example: Collision Detection• Dynamic scheduling is used to tell the compiler to schedule the thread team at runtime rather than simply dividing the iterations evenly between the threads
Performance • Suppose that numParticles is 1000, SphereIntersect takes a small, ﬁxed amount of time, and ObjectIntersect takes 100 times as long as SphereIntersect. Also assume a 10% sphere intersection rate and 1% object intersection rate • OpenMP typically adds very little runtime overhead in terms of either size or speed. The Visual Studio OpenMP DLL is only 60k. Hardware OpenMP Threads OpenMP Perf Gain Dual-core 2.3-GHz 4 3.3x PentiumTriple-core 3.2-GHz Xbox 6 5.4x 360
Function Parallelism• Apart from data parallelism, OpenMP can also be used for function-level parallelism. Consider QuickSort• Each of the recursive calls to qsort is completely independent of each other
Function Parallelism• One way is to split the independent recursive calls into their own OpenMP parallel sections• To execute the calls in parallel, they’re wrapped in braces so that OpenMP knows what portion to run in parallel. On most platforms, however, the overhead of the recursive parallel sections is likely to outweigh the beneﬁts
A better solution• Calculate a handful of partitions, then do a high-level parallelization on the resulting partitions. Given a decent partition function, each partition will be roughly the same size, and the resulting performance gains can be worth the effort
Performance • Given an array of random integers of 1,000,000 Hardware OpenMP Threads OpenMP Perf Gain Dual-core 2.3-GHz 4 1.5x PentiumTriple-core 3.2-GHz Xbox 6 1.4x 360
OpenMP ﬂaws• OpenMP is not designed for all multithreading issues. Complex multithreading scenarios and synchronization are best done using native thread techniques. If you’re writing a new engine, using OpenMP is not the way to go.• OpenMP-enabled compilers typically don’t check to ensure that your code will parallelize correctly.
OpenMP ﬂaws• Most compilers will compile the above code without errors. The problem is that the second and third sections rely on the top section ﬁrst completing with a valid value for n. Very likely the program will crash. It’s up to the programmer to ensure that OpenMP is only applied to constructs that are not order-dependent.• Another gotcha is debugging. When the compiler encounters an OpenMP block, it generates custom code that called into the OpenMP runtime. Unfortunately, the internals of OpenMP are a black box. This can make debugging very difﬁcult.• Finally, using OpenMP does not guarantee that you will improve performance on multiprocessor systems. Depending on your usage, the runtime overhead of OpenMP can dwarf any beneﬁts. For instance, when numParticles in the ﬁrst example was on the order of 100, performance gain was negligible.
Conclusion• OpenMP is a quick and useful technique for utilizing multicore and hyperthreaded processors. It’s easy enough to be used for last-minute optimizations, yet ﬂexible enough to use in cross-platform code. It’s not without ﬂaws; but used properly, OpenMP can be very beneﬁcial• Potential applications of OpenMP in games include particle systems, skinning, collision detection, simulations, pathﬁnding, vertex transforms, signal processing, procedural synthesize, and fractals• Even though OpenMP has been around for quite a while, it’s a new technology for most game developers. The best resource is the OpenMP speciﬁcation, which is available at http://www.openmp.org. The speciﬁcation is concise and very readable.