x86 SIMD Instructions
Flavius Anton
University Politehnica of Bucharest
MMX
• introduced in 1997 with Pentium P5
• 8 new 64 bit registers, MM0 to MM7
• were aliases for the existing 80 bit FPU registers
• only integer operations
SSE
• introduced in 1999 with Pentium III
• 8 new 128 bit registers, XMM0 to XMM7
• solved MMX problems
• response to AMD’s 3DNow!
SSE
• requires support from the Operating System
• requires data alignment (Cell déjà vu)
• offers cacheability control
• offers data prefetching
SSE instructions
• data movement:
• MOVUPS
• MOVAPS
• arithmetic:
• ADDPS / ADDSS
• SUBPS / SUBSS
• logical:
• ANDPS
• XORPS
• shuffle:
• SHUFPS
SSE Parallel
X1
Y1
X2 X3 X4
Y2 Y3 Y4
X1 + Y1 X2 + Y2 X3 + Y3 X4 + Y4
SSE Scalar
X1
Y1
X2 X3 X4
Y2 Y3 Y4
X1 X2 X3 X4 + Y4
SSE usages
• massively parallel
• floating point based
• regular memory access patterns
• data independent flow
Example
// A 16 byte = 128bit vector struct
struct vector4 {
float x, y, z, w;
};
!
// Add two constant vectors and return the resulting vector
vector4 SSE_add(struct vector4 *v1, struct vector4 *v2)
{
struct vector4 ret;
__asm {
MOV EAX v1
MOV EBX, v2
MOVUPS XMM0, [EAX]
MOVUPS XMM1, [EBX]
ADDPS XMM0, XMM1
MOVUPS [ret], XMM0
}
return ret;
}
Thank you!
• Code snippet: http://swarm.cs.pub.ro/~fanton/
sse-demo

X86 SIMD Instructions

  • 1.
    x86 SIMD Instructions FlaviusAnton University Politehnica of Bucharest
  • 2.
    MMX • introduced in1997 with Pentium P5 • 8 new 64 bit registers, MM0 to MM7 • were aliases for the existing 80 bit FPU registers • only integer operations
  • 3.
    SSE • introduced in1999 with Pentium III • 8 new 128 bit registers, XMM0 to XMM7 • solved MMX problems • response to AMD’s 3DNow!
  • 4.
    SSE • requires supportfrom the Operating System • requires data alignment (Cell déjà vu) • offers cacheability control • offers data prefetching
  • 5.
    SSE instructions • datamovement: • MOVUPS • MOVAPS • arithmetic: • ADDPS / ADDSS • SUBPS / SUBSS • logical: • ANDPS • XORPS • shuffle: • SHUFPS
  • 6.
    SSE Parallel X1 Y1 X2 X3X4 Y2 Y3 Y4 X1 + Y1 X2 + Y2 X3 + Y3 X4 + Y4
  • 7.
    SSE Scalar X1 Y1 X2 X3X4 Y2 Y3 Y4 X1 X2 X3 X4 + Y4
  • 8.
    SSE usages • massivelyparallel • floating point based • regular memory access patterns • data independent flow
  • 9.
    Example // A 16byte = 128bit vector struct struct vector4 { float x, y, z, w; }; ! // Add two constant vectors and return the resulting vector vector4 SSE_add(struct vector4 *v1, struct vector4 *v2) { struct vector4 ret; __asm { MOV EAX v1 MOV EBX, v2 MOVUPS XMM0, [EAX] MOVUPS XMM1, [EBX] ADDPS XMM0, XMM1 MOVUPS [ret], XMM0 } return ret; }
  • 10.
    Thank you! • Codesnippet: http://swarm.cs.pub.ro/~fanton/ sse-demo