Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ti DSP Optimization over Jacinto
Hank
2015/06/09
Generation
● 2002 OMAP
● 2006 Jacinto 1
● 2008 Jacinto 3
● 2010 Jacinto 4/5
● 2016 Jacinto 6
OMAP
– Application:
● CD-DA and CD-
ROM/DVD-
ROM/USB/SD with
MP3, WMA, and
AAC audio decoder
support
– Software platform:
...
Jacinto 1-DDR
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Improvement:
● C64x+ fixed-poin...
Jacinto 3
– Application:
● Compressed video
playback, Bluetooth
A2DP audio
streaming
– Hardware
improvement:
● ARM Cortex ...
Jacinto 4/5
– Application:
● Full-HD 1080p
video
decode/endcode
● QNX CAR 2
platfomr
– Hardware
improvement:
● Dual ARM Co...
Jacinto 6
– Application:
● Advanced Driver
Assistance System
(ADAS)
– Hardware
improvement:
● ARM Cortex A15
● DSP C66x
● ...
DSP Generation
C6000 DSP Optimization
● Code generation tool support languages:
– ANSI C (C89)
– ISO C++ (C++98)
– C6000 DSP assembly
– C...
Optimization-Five key concepts
● Core(architecture)
– parallel processing
● Pipeline
– High throughput
● Software pipelini...
C6000 Core
● 8 paralleral function unit
– D: data load/store
(.D1, .D2)
– S: shift, branch(.S1, .S2)
– M: mulitply(.M1, .M...
C6000 Core (conti.)
● 32 32-bit registers for each
side of function units
– A0-A31(.D1, .S1, .M1, .L1)
– B0-B31(.D2, .S2, ...
Core optimizationC++C++
Compiled parallel
Assembly
Pseudo assembly
Pipeline
F: fetch D: decode E: execute
C6000 pipeline
● Divide fetch, decode, execute into more
substages: 4-stage fetch, 2-stage decode, 10-
stage execute
Delay slots
● Pipeline will not optimize
– Current instruction depends on results of previous
instruction and it takes mor...
Software pipelining
● Enable
● Codes in C, just
add compiler
option -o3 to
enable software
pipelining
● Drawback
● Assembl...
SPLOOP buffer
● Support platform
– C64x+, C674x, and C66x
● SPLOOP buffer sotres a single scheduled
iteration of the loop ...
Compiler Optimization
● Using C compiler to generate assembly codes
that utilize C6000 functional units and pipeline
as fu...
Loop qualification (option -k -mw)
Compiler feedback (option -k -mw)
Dependency & resource information
● Minimize iteration interval
– The loop carried dependency bound
● Distance of the larg...
Loop carried path
Explicit code optimization
● Previous solution is suitable but
– Function calls in a loop
– Complex, hard-to-implement ope...
Intrinsic operations
● Sample
– Shuffle operation seperates even and odd bits of a 32-
bit value into two variables
● Intr...
Optimized DSP software libraries
● Fundational Math & signal processing
– MathLIB
– IQMath
– FastRTS
– DSPLIB
● Adaptive f...
Inline functions
● Pros
– To reduce overhead of a function call
– Make optimizer perform loop optimization
● Cons
– Size o...
Optimization flow
Profiling
Optimization practice
● Use –o3 and consider –mt for optimization; use –k and consider –
mw for compiler feedback (mt : as...
Using pragma
●
● Without minimum iterate count, compiler needs
to assume it will iterate once
– Providing factor gives com...
Unbalanced resource partition
Manual unroll
Compiler unroll
Reference
●
Texas Instruments, 『 Introduction to TMS320C6000 DSP
Optimization 』
– Recommended to read first
●
Texas Instru...
Ti DSP optimization on Jacinto
Upcoming SlideShare
Loading in …5
×

Ti DSP optimization on Jacinto

836 views

Published on

briefly introduce Jacinto platform and summarize Ti DSP optimization flow

Published in: Engineering
  • Be the first to comment

Ti DSP optimization on Jacinto

  1. 1. Ti DSP Optimization over Jacinto Hank 2015/06/09
  2. 2. Generation ● 2002 OMAP ● 2006 Jacinto 1 ● 2008 Jacinto 3 ● 2010 Jacinto 4/5 ● 2016 Jacinto 6
  3. 3. OMAP – Application: ● CD-DA and CD- ROM/DVD- ROM/USB/SD with MP3, WMA, and AAC audio decoder support – Software platform: ● Cooperate with QNX Software Systems
  4. 4. Jacinto 1-DDR – Application: ● Compressed video playback, Bluetooth A2DP audio streaming – Improvement: ● C64x+ fixed-point for graphics acceleration, compressed audio decoding, voice recognition
  5. 5. Jacinto 3 – Application: ● Compressed video playback, Bluetooth A2DP audio streaming – Hardware improvement: ● ARM Cortex A8 ● GPU PowerVR SGX
  6. 6. Jacinto 4/5 – Application: ● Full-HD 1080p video decode/endcode ● QNX CAR 2 platfomr – Hardware improvement: ● Dual ARM Cortex- M3-used for decoding video stream ● C674x DSP
  7. 7. Jacinto 6 – Application: ● Advanced Driver Assistance System (ADAS) – Hardware improvement: ● ARM Cortex A15 ● DSP C66x ● GPU SGX544
  8. 8. DSP Generation
  9. 9. C6000 DSP Optimization ● Code generation tool support languages: – ANSI C (C89) – ISO C++ (C++98) – C6000 DSP assembly – C6000 linear assembly
  10. 10. Optimization-Five key concepts ● Core(architecture) – parallel processing ● Pipeline – High throughput ● Software pipelining – Instruction scheduling ● Compiler optimization ● Optimizied software library – Intrinsic opertions in C6000, inlined functions
  11. 11. C6000 Core ● 8 paralleral function unit – D: data load/store (.D1, .D2) – S: shift, branch(.S1, .S2) – M: mulitply(.M1, .M2) – L: logic, arithmetic operations(.L1, .L2)
  12. 12. C6000 Core (conti.) ● 32 32-bit registers for each side of function units – A0-A31(.D1, .S1, .M1, .L1) – B0-B31(.D2, .S2, .M2, .L2) ● Separate program and data memory (L1P, L1D) ● 256-bit internal program bus- fetch 8 32-bit instructions from L1P every cycle ● 2 64-bit internal data buses that allows both .D1 and .D2 to fetch data from L1D every cycle
  13. 13. Core optimizationC++C++ Compiled parallel Assembly Pseudo assembly
  14. 14. Pipeline F: fetch D: decode E: execute
  15. 15. C6000 pipeline ● Divide fetch, decode, execute into more substages: 4-stage fetch, 2-stage decode, 10- stage execute
  16. 16. Delay slots ● Pipeline will not optimize – Current instruction depends on results of previous instruction and it takes more than 1 cycle – A branch is performed ● Solution – Software scheduling (software pipelining) – Hardware enhancement (SPLOOP buffer)
  17. 17. Software pipelining ● Enable ● Codes in C, just add compiler option -o3 to enable software pipelining ● Drawback ● Assembly code size increases ● Solution ● Software pipeline loop buffer
  18. 18. SPLOOP buffer ● Support platform – C64x+, C674x, and C66x ● SPLOOP buffer sotres a single scheduled iteration of the loop in a specialized buffer ● C compiler automatically utilize SPLOOP ● Cannot handle loops that exceed 14 execute packets(most 8 instructions/execute packet) – Nested loops, conditional branches inside loops, function calls inside loops
  19. 19. Compiler Optimization ● Using C compiler to generate assembly codes that utilize C6000 functional units and pipeline as fully as possible – Add additional information and instructions help compiler maximally optimize your codes ● Compiler options, e.g. -o3 ● Keywords(C or C6000), e.g, restrict ● Pragma directives, e.g. MUST_ITERATE – Understand compiler feedback
  20. 20. Loop qualification (option -k -mw) Compiler feedback (option -k -mw)
  21. 21. Dependency & resource information ● Minimize iteration interval – The loop carried dependency bound ● Distance of the largest loop carry path – Partitioned resource bound ● Maximum number of cycles any functional unit is used in a single iteration
  22. 22. Loop carried path
  23. 23. Explicit code optimization ● Previous solution is suitable but – Function calls in a loop – Complex, hard-to-implement operations ● Solutions – explicit code optimization – Intrinsic operations – Optimized C6000 DSP libraries – C inline functions
  24. 24. Intrinsic operations ● Sample – Shuffle operation seperates even and odd bits of a 32- bit value into two variables ● Intrinsic operations – Function-like statements – Leading underscore, e.g. _shfl – Not a function call, no branch needed ● Lists in “TMS320C6000 Optimizing Compiler v7.6 User's Guide“ ● Devices depend ● _abs could be used directly
  25. 25. Optimized DSP software libraries ● Fundational Math & signal processing – MathLIB – IQMath – FastRTS – DSPLIB ● Adaptive filtering, matrix computations ● Image & video processing – IMGLIB – Video Analytics & Vision Library (VLIB) – VICP Signal Processing Library
  26. 26. Inline functions ● Pros – To reduce overhead of a function call – Make optimizer perform loop optimization ● Cons – Size of codes increases ● To use – Use -O2 or -O3 to automatically make functions inline – Use explicit inline keyword
  27. 27. Optimization flow
  28. 28. Profiling
  29. 29. Optimization practice ● Use –o3 and consider –mt for optimization; use –k and consider – mw for compiler feedback (mt : assume all pointers in loop are independent) ● Apply the restrict keyword to minimize loop carried dependency bound (alternative to mt) ● Use the MUST_ITERATE and UNROLL pragmas to optimize pipeline usage ● Choose the smallest applicable data type and ensure proper data alignment to help compiler invoke ● Single Instruction Multiple Data (SIMD) operations ● Use intrinsic operations and TI libraries in case major code modification is needed (avoid standard I/O functions)
  30. 30. Using pragma ● ● Without minimum iterate count, compiler needs to assume it will iterate once – Providing factor gives compiler freedom to loop unrolling
  31. 31. Unbalanced resource partition
  32. 32. Manual unroll
  33. 33. Compiler unroll
  34. 34. Reference ● Texas Instruments, 『 Introduction to TMS320C6000 DSP Optimization 』 – Recommended to read first ● Texas Instruments, 『 In-Vehicle Connectivity is So Retro 』 ● Texas Instruments, 『 TMS320C6000 Programmer's Guide 』 ● 『 TMS320C6000 Optimizing Compiler v7.6 User's

×