• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Lect10 sse introduction
 

Lect10 sse introduction

on

  • 404 views

 

Statistics

Views

Total Views
404
Views on SlideShare
404
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Lect10 sse introduction Lect10 sse introduction Presentation Transcript

    • SSE( Streaming SIMD Extensions)Introduction
    • The SSE technology enhance the INTEL x86 architecturein four ways• 8 new 128-bit SIMD floating-point registers thatcan be directly addressed;• 50 new instructions that work on packedFloating-point data;• 8 new instructions designed to controlcacheability of all MMX and 32-bit x86 datatypes, including the ability to stream data tomemory without polluting the caches, and toprefetch data before it is actually used;• 12 new instructions that extend the MMXinstruction set.
    • SSE Cacheability Control• Data referenced by a program can have temporal (data will be usedagain) or spatial (data will be in adjacent locations, such as thesame cache line) locality.• But some multimedia data types are non-temporal (referencedonce and not reused in the immediate future).• non-temporal data should not overwrite the application’ s cachedcode and data: the cacheability control instructions enable the programmer tocontrol caching so that non-temporal accesses will minimize cachepollution. In addition, the execution engine needs to be fed such that it doesnot become stalled waiting for data. SSE allows the programmer to prefetch data long before its finaluse to minimize memory latency. Prior to SSE, read miss latency and execution and subsequent storemiss latency comprised total execution in a serial fashion.
    • SSE lets read miss latency overlap execution via the use of prefetching, and itallowes store miss latency to be reduced and overlap execution via streamingstores.
    • minimizing cache pollution SSE instructions• The following three instructions provide programmatic control for minimizingcache pollution when writing data to memory from either MMX or SSE registers. MASKMOVQ stores data from an MMX register to the location specified by theEDI register. The most significant bit in each byte of the second MMX maskregister is used to selectively write the data of the first register on a per-bytebasis. This instruction does not write-allocate (i.e., the processor will not fetch thecorresponding cache line into the cache hierarchy, prior to performing the store),and so minimizes cache pollution. MOVNTQ stores data from an MMX register to memory; this instruction isimplicitly weakly-ordered, does not write-allocate, and minimizes cache pollution. MOVNTPS stores data from a SIMD floating-point register to memory. Thememory address must be aligned to a 16-byte boundary; if it is not aligned, ageneral protection exception will occur. The instruction is implicitly weaklyordered, does not write-allocate, and minimizes cache pollution.• PREFETCH loads either non-temporal data or temporal data in the specified cachelevel. As this instruction merely provides a hint to the hardware, it will notgenerate exceptions or faults.
    • SSE for most media processing applications• SSE instructions set enables the programmer todevelop algorithms that can mix packed, single-precision, floating-point and integer using bothSSE and MMX instructions respectively.• This approach was chosen because most mediaprocessing applications have the followingcharacteristics:• inherently parallel• wide dynamic range, hence floating-point based• regular memory access patterns• data independent control flow.
    • Features of SSE technology• Intel SSE provides eight 128-bit general-purpose registers, each ofwhich can be directly addressed using the register names XMM0 toXMM7.• Each register consists of four 32-bit single precision, floating-pointnumbers, numbered 0 through 3.• MMX registers are mapped onto the floating-point registers,requiring the EMMS instruction to pass from MMX code to x87floating-point code;• since SIMD floating-point registers are a separate register file,MMX or floating-point instructions can be mixed with SSEinstructions without execution of a special instruction such asEMMS.• On the downside, they require support from the operating system,since they must be saved when switching tasks.
    • PS /SS SSE instructionsSSE ScalarSSE parallelSSE instructions operate on either all or the least significant pairs of packed dataoperands in parallel. The packed instructions (with PS suffix) operate on a pair ofoperands, while scalar instructions (with SS suffix) always operate on the leastsignificant pair of the two operands; for scalar operations, the three uppercomponents from the first operand are passed through to the destination.
    • SSE instructions• The SSE set consists of 70 instructions: the following sections give abrief overview of each group of instructions in the SSE set and theinstructions within each group.• Data movement instructions• Arithmetic instructions• Reciprocal instructions• Comparison instructions• Conversion instructions• Logical instructions• Additional SIMD integer instructions (SSE Primer)• Shuffle instructions• State Management instructions• Cacheability Control instructions•
    • SSE ArithmeticADDPS (parallel) and ADDSS (scalar) add the pair of operands.SUBPS (parallel) and SUBSS (scalar) subtract the pair of operands.MULPS (parallel) and MULSS(scalar) multiply the pair of operandsDIVPS (parallel) and DIVSS(scalar) divides the pair of operands.SQRTPS (parallel) andSQRTSS (scalar) return the square root of the source operandto the destination register.MAXPS (parallel) and MAXSS (scalar) return the maximum of the pair of operands:DestReg[i] = Max(DestReg[i], SrcReg[i])MINPS (parallel) and MINSS (scalar) return the minimum of the pair of operands:DestReg[i] = Min(DestReg[i], SrcReg[i])
    • ADDSS xmm1, xmm2/m32Adds the low single-precision floating-point values from the source operand(second operand) and the destination operand (first operand), and stores thesingle-precision floating-point result in the destination operand. The sourceoperand can be an XMM register or a 32-bit memory location. The destinationoperand is an XMM register. The three high-order doublewords of the destinationoperand remain unchanged.DEST[31-0] ← DEST[31-0] + SRC[31-0];ADDSS __m128 _mm_add_ss(__m128 a, __m128 b)
    • ADDPS xmm1, xmm2/m128Performs a SIMD add of the four packed single-precision floating-point values fromthe source operand (second operand) and the destination operand (first operand),and stores the packed single-precision floating-point results in the destinationoperand. The source operand can be an XMM register or a 128-bit memory location.The destination operand is an XMM register.DEST[31-0] ← DEST[31-0] + SRC[31-0];DEST[63-32] ← DEST[63-32] + SRC[63-32];DEST[95-64] ← DEST[95-64] + SRC[95-64];DEST[127-96] ← DEST[127-96] + SRC[127-96];ADDPS __m128 _mm_add_ps(__m128 a, __m128 b)
    • SUBSS xmm1, xmm2/m32Subtracts the low single-precision floating-point value in the source operand (secondoperand) from the low single-precision floating-point value in the destination operand(first operand), and stores the single-precision floating-point result in the destinationoperand. The source operand can be an XMM register or a 32-bit memory location. Thedestination operand is an XMM register. The three high-order doublewords of thedestination operand remain unchanged.DEST[31-0] ← DEST[31-0] - SRC[31-0];SUBSS __m128 _mm_sub_ss(__m128 a, __m128 b)
    • SUBPS xmm1 xmm2/m128Performs a SIMD subtract of the four packed single-precision floating-point values in the sourceoperand (second operand) from the four packed single-precision floating-point values in thedestination operand (first operand), and stores the packed single-precision floating-point resultsin the destination operand. The source operand can be an XMM register or a 128-bit memorylocation. The destination operand is an XMM register.DEST[31-0] ← DEST[31-0] − SRC[31-0];DEST[63-32] ← DEST[63-32] − SRC[63-32];DEST[95-64] ← DEST[95-64] − SRC[95-64];DEST[127-96] ← DEST[127-96] − SRC[127-96];SUBPS __m128 _mm_sub_ps(__m128 a, __m128 b)
    • MULSS xmm1, xmm2/m32Multiplies the low single-precision floating-point value from the source operand(second operand) by the low single-precision floating-point value in the destinationoperand (first operand), and stores the single-precision floating-point result in thedestination operand. The source operand can be an XMM register or a 32-bit memorylocation. The destination operand is an XMM register. The three high-orderdoublewords of the destination operand remain unchanged.DEST[31-0] ← DEST[31-0] * SRC[31-0];MULSS __m128 _mm_mul_ss(__m128 a, __m128 b)
    • MULPS xmm1, xmm2/m128Performs a SIMD multiply of the four packed single-precision floating-point values from thesource operand (second operand) and the destination operand (first operand), and stores thepacked single-precision floating-point results in the destination operand. The source operandcan be an XMM register or a 128-bit memory location. The destination operand is an XMMregister.DEST[31-0] ← DEST[31-0] * SRC[31-0];DEST[63-32] ← DEST[63-32] * SRC[63-32];DEST[95-64] ← DEST[95-64] * SRC[95-64];DEST[127-96] ← DEST[127-96] * SRC[127-96];MULPS __m128 _mm_mul_ps(__m128 a, __m128 b)
    • DIVSS xmm1, xmm2/m32Divides the low single-precision floating-point value in the destination operand(first operand) by the low single-precision floating-point value in the sourceoperand (second operand), and stores the single-precision floating-point result inthe destination operand. The source operand can be an XMM register or a 32-bitmemory location. The destination operand is an XMM register. The three high-order doublewords of the destination operand remain unchanged.DEST[31-0] ← DEST[31-0] / SRC[31-0];DIVSS __m128 _mm_div_ss(__m128 a, __m128 b)
    • SQRTSS xmm1, xmm2/m32Computes the square root of the low single-precision floating-point value in thesource operand (second operand) and stores the single-precision floating-pointresult in the destination operand. The source operand can be an XMM register or a32-bit memory location. The destination operand is an XMM register. The threehigh-order doublewords of the destination operand remains unchanged.DEST[31-0] ← SQRT (SRC[31-0]);SQRTSS __m128 _mm_sqrt_ss(__m128 a)
    • SQRTPS xmm1, xmm2/m128Performs a SIMD computation of the square roots of the four packed single-precisionfloating point values in the source operand (second operand) stores the packed single-precision floating point results in the destination operand. The source operand canbe an XMM register or a 128-bit memory location. The destination operand is an XMMregister.DEST[31-0] ← SQRT(SRC[31-0]);DEST[63-32] ← SQRT(SRC[63-32]);DEST[95-64] ← SQRT(SRC[95-64]);DEST[127-96] ← SQRT(SRC[127-96]);SQRTPS __m128 _mm_sqrt_ps(__m128 a)
    • DIVPS xmm1, xmm2/m128Performs a SIMD divide of the two packed single-precision floating-point values in the destinationoperand (first operand) by the two packed single-precision floating-point values in the sourceoperand (second operand), and stores the packed single-precision floating-point results in thedestination operand. The source operand can be an XMM register or a 128-bit memory location.The destination operand is an XMM register.DEST[31-0] ← DEST[31-0] / (SRC[31-0]);DEST[63-32] ← DEST[63-32] / (SRC[63-32]);DEST[95-64] ← DEST[95-64] / (SRC[95-64]);DEST[127-96] ← DEST[127-96] / (SRC[127-96]);DIVPS __m128 _mm_div_ps(__m128 a, __m128 b)
    • SQRTSS xmm1, xmm2/m32Computes the square root of the low single-precision floating-point value in thesource operand (second operand) and stores the single-precision floating-pointresult in the destination operand. The source operand can be an XMM register or a32-bit memory location. The destination operand is an XMM register. The threehigh-order doublewords of the destination operand remains unchanged.DEST[31-0] ← SQRT (SRC[31-0]);SQRTSS __m128 _mm_sqrt_ss(__m128 a)
    • SQRTPS xmm1, xmm2/m128Performs a SIMD computation of the square roots of the four packed single-precisionfloating point values in the source operand (second operand) stores the packed single-precision floating point results in the destination operand. The source operand can bean XMM register or a 128-bit memory location. The destination operand is an XMMregister.DEST[31-0] ← SQRT(SRC[31-0]);DEST[63-32] ← SQRT(SRC[63-32]);DEST[95-64] ← SQRT(SRC[95-64]);DEST[127-96] ← SQRT(SRC[127-96]);SQRTPS __m128 _mm_sqrt_ps(__m128 a)