A brief history of ARM

                                                             First ARM prototype came alive on ...
Cortex-A9 MPCore Multicore Structure                                                                                      ...
Power Domains                                                                              Single-thread Coremarks/MHz
ARM Architecture evolution                                                                Dummies’ guide to Si implementat...
Data Types                                                                                                  Registers
How to use NEON                                                                                 For NEON instruction refer...
NEON in opensource                                                     Many different levels of parallelism
 Bluez – offic...
Skia library S32A_D565_Opaque
                                                         Size   Reference Google v6 NEON   R...
Multiple 1-Element Structure Access
Exercise 2 - summing a vector
+                                                                           +

+      ...
Multiple 2-Element Structure Access                                                            Multiple 3/4-Element Struct...
Dual issue [Cortex-A8 only]                                                  Thank you!
 NEON can dual issue NEON in the f...
Upcoming SlideShare
Loading in …5

Lect.10.arm soc.4 neon


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lect.10.arm soc.4 neon

  1. 1. A brief history of ARM First ARM prototype came alive on 26-Apr-1985, 3um technology, 24800 transistors 50mm2, consumed 120mW of power ARM Architecture & NEON Acorn’s commercial ARM2 processor: 8-MHz, 26-bit addressing, 3-stage pipeline ARM founded in October 1990, separate company (Apple had 43% stake) ARM610 for Newton in 1992, ARM7TDMI for Nokia in 1994 Ian Rickards Stanford University 28 Apr 2010 1 2 ARM 25 years later: Cortex-A9 MP Cortex-A9 Processor Microarchitecture 1-4 way MP with optimized MESI Introduces out-of- order instruction 16KB, 32KB, 64KB I & D caches issue and completion 128KB-8MB L2 Multi-issue, Speculation, Renaming, OOO Register renaming to High performance FPU option enable execution speculation NEON SIMD option Thumb-2 Non-blocking memory system AXI bus with load-store forwarding Gatecount: 500K (32KB I/D L1’s), 600K (core), 500K (NEON) Fast loop mode in instruction pre- 40G “Low Power” macro: ~5mm2, 800MHz, 0.5W fetch to lower power 40G “High Performance” macro: ~7mm2 2GHz (typ), 2W consumption 3 4
  2. 2. Cortex-A9 MPCore Multicore Structure Hard Macro Configuration and Floorplan Configurable Between 1 and Hardware Coherence for 4 CPUs with optional Cache, MMU and TLB NEON and/or Floating-point maintenance operations Unit FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE FPU/NEON TRACE Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Cortex-A9 CPU Coherent access to Flexible configuration processor caches and power-aware from accelerators interrupt controller Instruction Data Instruction Data Instruction Data Instruction Data and DMA Cache Cache Cache Cache Cache Cache Cache Cache falcon_cpu floorplan Snoop Control Unit (SCU) Generalized Accelerator Osprey configuration includes level 2 cache controller Interrupt Control Coherence and Cortex A9 integration level and Distribution Cache-2-Cache Snoop Port Transfers Filtering Timers Top level includes Coresight PTM, CTI and CTM Implementation using r1p1 version of Cortex A9 Dual core 32k I$ and D$ Advanced Bus Interface Unit NEON present on both cores Design flexibility PTM interface present Secure and over memory Virtualization aware throughput and 128 interrupts interrupt and IPI latency communications L2 Cache Controller (PL310) 128K-8MB ACP present Primary AMBA 3 64bit Interface Optional 2nd I/F with Address Filtering Two AXI master ports Elba top level floorpan Level 2 cache memories external (interface exposed) 5 6 Why is ARM looking at “G” processes? Understanding power “G” can achieve around double the MHz than “LP” Fundamental power parameters Active power is lower on “G” than “LP” Average power => battery life Thermal Power sustained power @ max performance Example, Push 40LP to 800MHz, to compare with 800MHz MID macro GUI updates web page render music The estimated LP numbers correlate to an accelerated implementation of an A8 Power Traditional LP process G is close in terms of power if lowered to same performance as 2-3x faster on LP. clock Power 40G process G can scale much higher in terms of performance than LP. Key requirement is “run and power power off” quickly off power off power off Power Osprey 7 8
  3. 3. Power Domains Single-thread Coremarks/MHz HiP and MID macros have same power Single-thread performance is key for GUI based applications domains A9_PL310 Both use distributed coarse grain power A9_PL310_noram switches Power plan for CPUs is symmetric “Osprey macro” Atom 1.85 A9 core and its L1 is power gated in Data Data Engine 0 Engine 1 lockstep PTM/Debug Cortex-A9 2.95 Note that all power domains are only ON A9 CORE 1 A9 CORE 0 or OFF, there is no hardware retention + 32K I/D + 32K I/D Cortex-A8 2.72 mode Software routine enables retention to RAM SCU + PL310_noram 1004K 2.33 L2 Cache RAM 74K 2.30 512/1024KB 0.00 0.50 1.00 1.50 2.00 2.50 3.00 9 10 Floating Point Performance Higher Flash Actionscript from A9 Intel 11 12
  4. 4. ARM Architecture evolution Dummies’ guide to Si implementation Some not-entirely-RISC features Basic Fab tech LDM / STM 65nm, 40nm, 32nm, 28nm, etc. Full predicated execution (ADDEQ r0, r1, r2) G vs. LP technology Carefully designed with customer/partner input considering gatecount 40G is 0.9V process, 40LP is 1.1V process Much lower leakage with LP, but half the performance Thumb Intermediate “LPG” from TSMC too! Island of G within LP 16-bit instruction set (mostly using r0-r7) selected for compiler requirements Vt’s – each Vt requires additional mask step Design goals: performance from 16-bit wide ROM, codesize HVt – lower leakage, but slower Thumb-2 in Cortex-A extends original Thumb (allows 32-bit/16-bit mix) RVt – regular Vt Beneficial today – better performance from small caches LVt – faster, but high leakage esp. at high temperature Jazelle Cell library track size CPU mode allows direct execution of Java bytecodes 9-track, 12-track, 15-track (bigger => more powerful) ~60% of Java bytecodes directly executed by datapath Backed off implementation vs. pushed implementation Top of Java stack stored in registers High-K metal Gate Widely used in Nokia & DoCoMo handsets Clock gating … Well biasing… 13 14 ARM Architecture Evolution What is NEON? NEON is a wide SIMD data processing architecture Key Technology Additions by Extension of the ARM instruction set Architecture Generation 32 registers, 64-bits wide (dual view as 16 registers, 128-bits wide) Thumb-EE Execution NEON Instructions perform “Packed SIMD” processing VFPv3 Environments: Registers are considered as vectors of elements of the same data type ARM11 Improved memory use Data types can be: signed/unsigned 8-bit, 16-bit, 32-bit, 64-bit, single prec. float NEON™ Adv SIMD Instructions perform the same operation in all lanes Improved Thumb®-2 Media and Source Source DSP Registers Registers ARM9 TrustZone™ Elements Dn ARM10 Dm SIMD Low Cost Operation MCU VFPv2 Dd Destination Jazelle® Thumb-2 Only Register V5 V6 V7 A&R V7 M Lane 15 16
  5. 5. Data Types Registers NEON natively supports a set of common data types NEON provides a 256-byte register file Integer and Fixed-Point; 8-bit, 16-bit, 32-bit and 64-bit Distinct from the core registers 32-bit Single-precision Floating-point Extension to the VFPv2 register file (VFPv3) .S8 Signed, Unsigned 8/16-bit Signed, .I8 D0 Integers; .8 .U8 Two explicitly aliased views Q0 Unsigned Integers; D1 Polynomials .P8 32 x 64-bit registers (D0-D31) Polynomials D2 Q1 .S16 .I16 16 x 128-bit registers (Q0-Q15) D3 .16 .U16 .P16 : : .I32 .S32 Enables register trade-off D30 32-bit Signed, .32 .U32 64-bit Signed, Vector length Q15 D31 Unsigned .F32 Unsigned Integers; Floats .S64 Integers; Available registers .64 .I64 .U64 Also uses the summary flags in the VFP FPSCR Adds a QC integer saturation summary flag Data types are represented using a bit-size and format letter No per-lane flags, so ‘carry’ handled using wider result (16bit+16bit -> 32-bit) 17 18 Vectors and Scalars NEON in Audio Registers hold one or more elements of the same data type FFT: 256-point, 16-bit signed complex numbers Vn can be used to reference either a 64-bit Dn or 128-bit Qn register FFT is a key component of AAC, Voice/pattern recognition etc. A register, data type combination describes a vector of elements Hand optimized assembler in both cases 63 0 127 0 FFT time No NEON With NEON Dn Qn (v6 SIMD asm) (v7 NEON asm) I64 D0 F32 F32 F32 F32 Q0 Cortex-A8 500MHz 15.2 us 3.8 us S32 S32 D7 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 S8 Q7 Actual silicon (x 4.0 performance) 64-bit 128-bit Some instructions can reference individual scalar elements Extreme example: FFT in ffmpeg: 12x faster Scalar elements are referenced using the array notation Vn[x] F32 F32 F32 F32 C code -> handwitten asm Q0 Q0[3] Q0[2] Q0[1] Q0[0] Scalar -> vector processing Array ordering is always from the least significant bit Unpipelined FPU -> pipelined NEON single precision FPU 19 20
  6. 6. How to use NEON For NEON instruction reference OpenMAX DL library Official NEON instruction Set reference is “Advanced SIMD” in Library of common codec components and signal processing routines ARM Architecture Reference Manual v7 A & R edition Status: Released on http://www.arm.com/products/esd/openmax_home.html Available to partners & www.arm.com request system Vectorizing Compilers Exploits NEON SIMD automatically with existing source code Status: Released (in RVDS 3.1 Professional and later) Status: Codesourcery 2007q3 gcc and later C Instrinsics C function call interface to NEON operations Supports all data types and operations supported by NEON Status: Released (in RVDS 3.0+ and Codesourcery 2007q3 gcc) Assembler For those who really want to optimize at the lowest level Status: Released (in RVDS 3.0+ & Codesourcery 2007q3 gcc/gas) 21 22 ARM RVDS & gcc vectorising compiler Intrinsics Include intrinsics header file |L1.16| VLD1.32 {d0,d1},[r0]! #include <arm_neon.h> int a[256], b[256], c[256]; SUBS r3,r3,#1 foo () { armcc -S --cpu cortex-a8 VLD1.32 {d2,d3},[r1]! int i; -O3 -Otime --vectorize test.c VADD.I32 q0,q0,q1 Use special NEON data types which correspond to D and Q registers, e.g. VST1.32 {d0,d1},[r2]! BNE |L1.16| int8x8_t D-register containing 8x 8-bit elements for (i=0; i<256; i++){ int16x4_t D-register containing 4x 16-bit elements a[i] = b[i] + c[i]; int32x4_t Q-register containing 4x 32-bit elements } } .L2: add r1, r0, ip add r3, r0, lr Use special intrinsics versions of NEON instructions add r2, r0, r4 gcc -S -O3 -mcpu=cortex-a8 add r0, r0, #8 vin1 = vld1q_s32(ptr); -mfpu=neon -ftree-vectorize cmp r0, #1024 vout = vaddq_s32(vin1, vin2); -ftree-vectorizer-verbose=6 fldd d7, [r3, #0] vst1q_s32(vout, ptr); test.c fldd d6, [r2, #0] vadd.i32 d7, d7, d6 fstd d7, [r1, #0] bne .L2 Strongly typed! armcc generates better NEON code Use vreinterpret_s16_s32( ) to change the type (gcc can use Q-regs with ‘-mvectorize-with-neon-quad’ ) 23 24
  7. 7. NEON in opensource Many different levels of parallelism Bluez – official Linux Bluetooth protocol stack NEON sbc audio encoder Pixman (part of cairo 2D graphics library) Compositing/alpha blending X.Org, Mozilla Firefox, fennec, & Webkit browsers e.g. fbCompositeSolidMask_nx8x0565neon 8x faster using NEON Multi-issue parallelism ffmpeg – libavcodec LGPL media player used in many Linux distros NEON Video: MPEG-2, MPEG-4 ASP, H.264 (AVC), VC-1, VP3, Theora NEON Audio: AAC, Vorbis, WMA NEON SIMD parallelism x264 – Google Summer Of Code 2009 GPL H.264 encoder – e.g. for video conferencing Android – NEON optimizations Multi-core parallelism Skia library, S32A_D565_Opaque 5x faster using NEON Available in Google Skia tree from 03-Aug-2009 Eigen2 linear algebra library Ubuntu 09.04 – supports NEON NEON versions of critical shared-libraries 25 26 ffmpeg (libavcodec) performance Scalability with SMP on Cortex-A9 git.ffmpeg.org snapshot 21-Sep-09 YouTube HQ video decode 480x270, 30fps Including AAC audio Real silicon measurements OMAP3 Beagleboard ARM A9TC NEON ~2x overall performance 27 28
  8. 8. Skia library S32A_D565_Opaque Size Reference Google v6 NEON RVDS C asm asm 60 100% 128% 24% 64% NEON optimization example 64 100% 128% 22% 68% 68 100% 127% 23% 63% 980 100% 73% 23% 58% 986 100% 73% 23% 58% 29 30 Processing code Cortex-A8 TRM vmovn.u16 d4, q12 vshr.u16 q8, q14, #5 vshr.u16 q11, q12, #5 vshr.u16 q9, q13, #6 vshr.u16 q10, q12, #6+5 vaddhn.u16 d6, q14, q8 vmovn.u16 d5, q11 vshr.u16 q8, q12, #5 vmovn.u16 d6, q10 vaddhn.u16 d5, q13, q9 vshl.u8 d4, d4, #3 vqadd.u8 d6, d6, d0 vshl.u8 d5, d5, #2 vaddhn.u16 d4, q12, q8 vshl.u8 d6, d6, #3 vmovl.u8 q14, d31 vqadd.u8 d6, d6, d0 vmovl.u8 q13, d31 vqadd.u8 d5, d5, d1 vmovl.u8 q12, d31 vqadd.u8 d4, d4, d2 vmvn.8 d30, d3 vshll.u8 q10, d6, #8 vmlal.u8 q14, d30, d6 vshll.u8 q3, d5, #8 vmlal.u8 q13, d30, d5 vshll.u8 q2, d4, #8 vmlal.u8 q12, d30, d4 vsri.u16 q10, q3, #5 vsri.u16 q10, q2, #11 31 32
  9. 9. Multiple 1-Element Structure Access VLD1, VST1 provide standard array access An array of structures containing a single component is a basic array List can contain 1, 2, 3 or 4 consecutive registers Transfer multiple consecutive 8, 16, 32 or 64-bit elements [R1] x0 Quick review of NEON instructions +2 x1 [R4] x0 +4 x2 +2 x1 +6 x3 +R3 +4 x2 +8 x4 +6 x3 +10 x5 : x3 x2 x1 x0 D7 +12 x6 x3 x2 x1 x0 D3 VLD1.16 {D7}, [R4], R3 +14 x7 x7 x6 x5 x4 D4 : VST1.16 {D3,D4}, [R1] 33 34 Addition: Basic Example – adding all lanes NEON supports various useful forms of basic Input in Q0 (D0 and D1) DO D1 addition VADD.I16 D0, D1, D2 u16 input values DO D1 Normal Addition - VADD, VSUB VSUB.F32 Q7, Q1, Q4 Floating-point VADD.I8 Q15, Q14, Q15 VPADDL.U16 Q0, Q0 Integer (8-bit to 64-bit elements) VSUB.I64 D0, D30, D5 64-bit and 128-bit registers DO D1 Now Q0 contains 4x u32 values DO Long Addition - VADDL, VSUBL VADDL.U16 Q1, D7, D8 (with 15 headroom bits) Promotes both inputs before operation VSUBL.S32 Q8, D1, D5 VPADD.U32 D0, D0, D1 Signed/unsigned (8-bit to 32-bit source elements) Reducing/folding operation DO VADDW.U8 Q1, Q7, D8 needs 1 bit of headroom Wide Addition - VADDW, VSUBW DO VSUBW.S16 Q8, Q1, D5 Promotes one input before operation Signed/unsigned (8-bit 32-bit source elements) VPADDL.U32 D0, D0 35 36
  10. 10. Exercise 2 - summing a vector + + + + + + + Some NEON clever features + + + + DO D1 + + DO + + DO + 37 38 Data Movement: Table Lookup Element Load Store Instructions Uses byte indexes to control byte look up in a table All treat memory as an array of structures (AoS) Table is a list of 1,2,3 or 4 adjacent registers SIMD registers are treated as structure of arrays (SoA) Enables interleaving/de-interleaving for efficient SIMD processing 11 4 8 13 26 8 0 3 D3 Transfer up to 256-bits in a single instruction x3 z2 y2 x2 z1 y1 x1 z0 y0 x0 0 p o n m l k j i h g f e d c b a {D1,D2} element 3-element structure l e i n 0 i a d D0 Three forms of Element Load Store instructions are provided VTBL.8 D0, {D1, D2}, D3 Forms distinguished by type of register list provided Multiple Structure Access e.g. {D0, D1} VTBL : out of range indexes generate 0 result Single Structure Access e.g. {D0[2], D1[2]} VTBX : out of range indexes leave destination unchanged Single Structure Load to all lanes e.g. {D0[], D1[]} 39 40
  11. 11. Multiple 2-Element Structure Access Multiple 3/4-Element Structure Access VLD2, VST2 provide access to multiple 2-element structures VLD3/4, VST3/4 provide access to 3 or 4-element structures List can contain 2 or 4 registers Lists contain 3/4 registers; optional space for building 128-bit vectors Transfer multiple consecutive 8, 16, or 32-bit 2-element structures Transfer multiple consecutive 8, 16, or 32-bit 3/4-element structures [R3] x0 [R1] x0 [R1] x0 +2 y0 +2 y0 +2 y0 [R1] x0 +4 x1 +4 z0 +4 z0 +2 y0 +6 y1 +6 x1 ! +6 x1 +4 x1 +8 x2 +8 y1 x3 x2 x1 x0 D0 +8 y1 +6 y1 +10 y2 x3 x2 x1 x0 D0 +10 z1 ! +10 z1 D1 +8 x2 +12 x3 +12 x2 x3 x2 x1 x0 D3 x7 x6 x5 x4 D1 +12 x2 +10 y2 : : y3 y2 y1 y0 D2 y3 y2 y1 y0 D4 : +12 x3 x3 x2 x1 x0 D2 +28 x7 y3 y2 y1 y0 D2 +20 y3 +20 y3 D3 +14 y3 +30 y7 +22 z3 z3 z2 z1 z0 D5 y3 y2 y1 y0 D3 y7 y6 y5 y4 D3 +22 z3 : : : z3 z2 z1 z0 D4 : VLD2.16 {D2,D3}, [R1] VLD2.16 {D0,D1,D2,D3}, [R3]! VST3.16 {D3,D4,D5}, [R1] VLD3.16 {D0,D2,D4}, [R1]! 41 42 Logical Alignment hints on NEON load/store NEON supports bitwise logical operations NEON data load/store: VLDn/VSTn Full unaligned support for NEON data access Instruction contains ‘alignment hint’ which permits implementations to be faster when VAND D0, D0, D1 address is aligned and hint is specified. VAND, VBIC, VEORR, VORN, VORR VORR Q0, Q1, Q15 Usage: base address specified as [<Rn>:<align>] Bitwise logical operation VEOR Q7, Q1, Q15 Note it is a programming error to specify hint, but use incorrectly aligned address VORN D15, D14, D1 Independent of data type VBIC D0, D30, D2 Alignment hint can be :64, :128, :256 (bits) depending on number of D-regs 64-bit and 128-bit registers VLD1.8 {D0}, [R1:64] D0 VLD1.8 {D0,D1}, [R4:128]! VBIT, VBIF, VBSL D1 VLD1.8 {D0,D1,D2,D3}, [R7:256]!, R2 Bitwise multiplex operations 0 1 0 1 1 0 D2 ARM ARM uses “@” but this is not recommended in source code Insert True, Insert False, Select GNU gas currently only accepts “[Rn,:128]” syntax – note extra “,” 3 versions overwrite different registers D1 64-bit and 128-bit registers Applies to both Cortex-A8 and Cortex-A9 (see TRM for detailed instruction timing) Used with masks to provide selection VBIT D1, D0, D2 43 44
  12. 12. Dual issue [Cortex-A8 only] Thank you! NEON can dual issue NEON in the following circumstances ARM Architecture has evolved with a balance of pure RISC No register operand/result dependencies and customer driven input NEON data processing (ALU) instruction NEON load/store or NEON byte permute instruction or MRC/MCR VLDR/VSTR, VLDn/VSTn, VMOV, VTRN, VSWP, VZIP, VUZIP, VEXT, VTBL, VTBX NEON offers a clean architecture targeted at compiler code VLD1.8 {D0}, [R1]! generation, offering VMLAL.S8 Q2, D3, D2 Unaligned access Structure load/store operations VEXT.8 D0, D1, D2, #1 Dual D-register/Q-register view to optimize register bank SUBS r12, r12, #1 Balance of performance vs. gatecount Also can dual-issue NEON with ARM instructions Cortex-A9 and ARM’s hard macros provide a scalable low- VLD1.8 {D0}, [R1]! power solution that is suitable for a wide range of high- SUBS r12, r12, #1 performance consumer applications 45 46