Introduction to SIMD Programming


Published on

An introductory presentation for students regarding SIMD programming.

Published in: Software

Introduction to SIMD Programming

  1. 1. Accelerating Real-time Program Performance Koray Hagen An Introduction to SIMD and Hardware Intrinsics
  2. 2. Prerequisite Knowledge 1. Exposure to C++ 2. Exposure to an assembly language And that’s it.
  3. 3. The Agenda 1. Terminology and an introduction to SIMD 2. Examples and benefits of SIMD Programming in C++ 3. Some tradeoffs, caveats, and insights 4. Closing thoughts and further reading
  4. 4. Instructions and Data Classification of computer architectures
  5. 5. Terminology and Definitions 1. Concurrent – Events or processes which appear to occur or progress at the same time 1. Related terms often heard: 1. Threads 2. Mutexes 3. Semaphores 4. Monitors 2. Parallel – Events or processes which do occur or progress at the same time 1. Parallel programming is not the same as concurrent programming, they are distinct 2. It refers to techniques that provides parallel execution of operations SIMD is a computer architecture that makes use of parallel execution
  6. 6. Flynn’s Taxonomy A classification of computer architectures that was proposed by Michael Flynn in 1966 1.Characterized by a few criteria 1. Parallelism exhibited with its instruction streams 2. Parallelism exhibited with its data streams 2.Four distinct classifications 1. SISD 2. SIMD 3. MISD 4. MIMD The instruction stream and data stream can both be either single (S) or multiple (M)
  7. 7. SISD – Single Instruction Single Data 1. Single CPU systems 1. Specifically uniprocessors 2. Co-processors do not count. 2. Concurrent processing 1. Pipelined execution 2. Prefetching 3. Concurrent Execution 1. Independent concurrent tasks can execute different sequences of operations 4. Examples 1. Personal computers
  8. 8. SIMD – Single Instruction Multiple Data 1. One instruction stream is broadcasted to all processors 1. In reality, that processor will be simple, such as an ALU (Arithmetic Logic Unit) 2. All active processors execute the same instructions synchronously, but on different data 3. The data items are aligned in an array (or vector) 4. An instruction can act on the complete array in one cycle 5. Examples later on
  9. 9. MISD – Multiple Instruction Single Data 1. Uncommon architecture type that is used for specialized purposes 1. Redundancy 2. Fault Tolerance 3. Task replication 1. Example 1. Space shuttle flight control computer
  10. 10. MIMD – Multiple Instruction Multiple Data 1. Processors are asynchronous 1. Have the ability to independently execute on different data sets 2. Communication is handled by either 1. Shared or distributed memory 2. Message passing 1. Examples 1. Computer aided manufacturing 2. Simulation 3. Modeling 4. Communication switches
  11. 11. The importance of SIMD 1. SIMD intrinsics are supported by most modern CPU’s in commercial use 1. As time passes, that support is ever increasing 2. Intel Haswell (2013) microarchitecture supports MMX, SSE 1 - SSE 4.2, AVX - AVX2, etc. 3. Intel Pentium (1996) microarchitecture supported MMX for the first time 1. SIMD as an architecture is conceptually easier to leverage, understand, and debug for programmers 1. It bears large similarities to sequential and concurrent programming 2. SIMD usage has been successful in multimedia application for decades 1. Instruction and data level parallelism have many real-world applications
  12. 12. SIMD Programming C++ parallel computation examples
  13. 13. But first, some additional information 1. I will be showing two real-world examples for this portion 1. The first example is simple and will showcase SISD 2. The second example is simple and will showcase SIMD 2. All programs were written in C++ on Windows 8.1, using Visual Studio 2013 3. Visual Studio 2013 features auto-vectorization which I have turned off 1. In other words, I have disabled the compiler’s ability to used SIMD automatically
  14. 14. SISD Simple Program Source
  15. 15. SISD Simple Program Disassembly Disassembly of function add SISD register usage Sequential addition operations
  16. 16. SIMD Simple Program Source, Part 1 Definition of vec4_t The xmmintrin.h header gives the programmer access to intrinsic types and intrinsic operations There is an alignment requirement to store all four 32 bit floating point values in a 128 bit (or 16 byte) aligned intrinsic type A four dimensional vector is composed of x, y, z, and w components
  17. 17. SIMD Simple Program Visualization These example programs are using Streaming SIMD Extensions (SSE) intrinsics 1. SSE originally provides access to eight 128 bit registers known as XMM0 through XMM7 2. SSE is not the only intrinsics instruction set, there are many others such as AVX, Altivec, F16c, etc
  18. 18. SIMD Simple Program Source, Part 2 Both functions compute addition of two four dimensional vectors 1. The first function uses normal SISD instructions to do so, similar to our first SISD example 2. The second function uses _mm_add_ps, which is an intrinsic function that consumes two m128 types, does parallelized component addition using SIMD, and returns the resulting m128 type 3. As shown in the first function, we typically pass objects by const reference to function parameters for read operations. But with SIMD, the cost of indirection through references is ill- advised. It is cheaper to copy the values.
  19. 19. SIMD Simple Program Disassembly, Part 1 Partial disassembly of function add (SISD) FPU (SISD) register usage Sequential addition operations
  20. 20. SIMD Simple Program Disassembly, Part 2 Disassembly of function add (SIMD) Immediate storage into SIMD registers A single, parallelized add operation using addps
  21. 21. Challenges and Insights A primer to thinking about SIMD programming
  22. 22. SIMD is not an automatic silver bullet 1. SIMD and intrinsic usage has a steep learning curve 1. Documentation and good tutorials that show best-practice are often rare 2. Advanced usage requires expert-level knowledge 2. It can be misused very easily 1. Programmers may not entirely understand or analyze the control flow of data through SIMD registers 2. This can result in ultimately slower programs
  23. 23. Awareness when using SIMD 1. Do not allow implicit casts between SIMD types and SISD scalar types 1. In order to get scalar values into the right SIMD locations, values must be copied from SISD registers to SIMD registers 2. There is a large performance penalty associated with copying, since the FPU and SIMD pipelines must be flushed 2. Do not create a class to abstract the SIMD type, instead use a typedef 1. This enables the compiler to perform specific optimizations per-platform more easily 3. Exposing the raw SIMD value, as was shown with the Vector4 struct, benefits manual optimization 4. Do not mix operator overloading with SIMD 1. This can result in the compiler refraining from proper instruction reordering and scheduling because it is now bound by mathematical operation rules
  24. 24. How to approach SIMD 1. As with all things, jump in and start small 2. “Premature optimization is the root of all evil” – Donald Knuth 1. SIMD usage should be the result of identifying a potential and/or needed optimization within a program’s run-time 3. Profile performance metrics as the program matures and SIMD usage is increased 4. Ensure that intrinsics used are compatible with all targeted hardware configurations
  25. 25. Further Reading and References For those who seek knowledge regarding SIMD
  26. 26. The definitive reference for this presentation Computer Architecture, Fifth Edition A Quantitative Approach 1.In-depth looks at modern CPU and GPU architectures 2.Thorough analysis of memory hierarchy design 3.Covers data, instruction, and thread level parallelism This book is a highly recommended read.
  27. 27. References 1. Corden, Martyn. "Intel® Developer Zone." Intel® Compiler Options for Intel® SSE and Intel® AVX Generation (SSE2, SSE3, SSSE3, ATOM_SSSE3, SSE4.1, SSE4.2, ATOM_SSE4.2, AVX, AVX2) and Processor- specific Optimizations. N.p., n.d. Web. 30 Sept. 2013. 2. Hennessy, John L., and David A. Patterson. Computer Architecture: A Quantitative Approach. Amsterdam: Elsevier, 2012. Print. 3. Jha, Ashish, and Darren Yee. "Intel® Developer Zone." Increasing Memory Throughput With Intel® Streaming SIMD Extensions 4 (Intel® SSE4) Streaming Load. N.p., n.d. Web. 30 Sept. 2013. 4. Siewert, Sam. "Intel® Developer Zone." Using Intel® Streaming SIMD Extensions and Intel® Integrated Performance Primitives to Accelerate Algorithms. N.p., n.d. Web. 30 Sept. 2013.