Squeezing Blood From a Stone V1.2



Squeezing Blood From a Stone V1.2

  1. 1. Getting Back Memory and Performance Jen Costillo While you wait, download: V1.2 release
  2. 2. Why Optimize? Lower memory -> cheaper BOM Lower Memory footprint Low RAM footprint PerformancePerformance Bottlenecks Power considerations Maintenance Sometime your compiler can’t do everything. Sometime you want a challenge 11/15/2015 2Costillo Rebelbot
  3. 3. Things Covered Basics of: Intro to tools: Profiling tools Code Optimization RAM optimization Keil IDA RAM optimization Map files Compiler/Linker documentation secrets 11/15/2015 3Costillo Rebelbot
  4. 4. Things Not Covered (but are pretty cool) Virtual memory Caching for speed Branch optimization Processor pipeline considerationsProcessor pipeline considerations Specifics of a particular processor family Instead focus on where to find the info to accomplish the goal 11/15/2015 4Costillo Rebelbot
  5. 5. Performance Speed of execution Resolve bottlenecks S p a c eResolve bottlenecks Meeting design guidelines Meet power consumption model Time e 11/15/2015 5Costillo Rebelbot
  6. 6. Maintenance Refactoring Code Hygiene Primarily an avoidance mechanism 11/15/2015 6Costillo Rebelbot
  7. 7. Lab 0 Install toolchain Keil Download project Github Install ST Link SW and Driver STM32L476 Discovery board Hook Up Scope (Optional- /Utilities/PC SW/Saleae)Hook Up Scope (Optional- /Utilities/PC SW/Saleae) PB2 PE8 PA0 Make sure it compiles with “LAB0” defined in project options- > C/C++-> Preprocessor Symbols-> Define WARNING: Keil may decide additional packages needed 11/15/2015 7Costillo Rebelbot
  8. 8. Build and Load in Keil Open Workspace: project menu -> Open Project. Select Optimizer.uvprojx Change Options->Change Options-> C/C++-> Preprocessor Symbols-> Define Rebuild / F7 Debug/ Ctrl + F5 Run/ F511/15/2015 8Costillo Rebelbot
  9. 9. Program Structure MainThread 500ms timer thrLED Toggle LED4 thrIsrLED Sample trigger Signal Set EXTI GPIO ISR Signal Set Joystick Select Press 11/15/2015 9Costillo Rebelbot
  10. 10. Before You Dive In Don’t optimize too early Keep a baseline Profiling Keep track of memory utilizationKeep track of memory utilization Leverage tools with compiler tool chain/IDE Create your own profiling systems 11/15/2015 10Costillo Rebelbot
  11. 11. 11/15/2015 11Costillo Rebelbot
  12. 12. Performance Measurements 1. Model – what are you expecting them to be a. Tasks b. RTOS – task switching c. ISRs 2. Measure 1. Review compiler listings and map files1. Review compiler listings and map files 2. Home-grown profiling tools 3. Leverage simulators 4. IDE/RTOS toolchain tools 3. Modify – basic tools 1. Leverage toolchain intrinsics 2. Utilize compiler optimizer 3. Use Big-O to improve code structure 4. Count and shrink instruction count 5. Assembly based on processor pipeline knowledge11/15/2015 12Costillo Rebelbot
  13. 13. Measurements Interface Tradeoffs Type Pro Cons Logging –based: Human readable Serial or other live stream Happens in “real” time Serial port is slow Overhead is high Can disrupt execution order Circular buffer Faster than serial Need extraction tool Not reading in real time Limited data File system with extraction tool Stays until you extract it Size is limited by allocation Requires extraction tool Not reading in real time HW-based: Execution disruption is low Need to decode for readability GPIOs Setup is low Potentially high pin count PWMs Low pin count Overhead can be high Can be painful without spectrum analyzer DAC Low pin count 2^n bits of event levels Need oscilloscope 11/15/2015 13Costillo Rebelbot
  14. 14. Modification Improvement Strategies Type Method Impact Algorithm efficiency function • Review your Big O(n) • Leverage preprocessor intrinsics • Count instructions and write new code ROM, Scale, speed Code size function ROM, processor pipeline, write new code • Utilize optimizer flags • C/Assembly based on processor knowledge Code size function ROM, processor pipeline, Target size or call frequency Memory usage • Leverage compiler intrinsics • Utilize optimizer or linker flags RAM stack. heap Memory location RAM 11/15/2015 14Costillo Rebelbot
  15. 15. Profiling LED Sample Task MainThread thrLED thrIsrLED Sample triggerSignal SetMainThread 500ms timer thrLED Toggle LED4 trigger EXTI GPIO ISR Signal Set Signal Set 11/15/2015 15Costillo Rebelbot
  16. 16. Lab1 - Make Profiler Module Go to project Options and in C/C++, change LAB0 -> LAB1 Load Saleae logic settings in /Saleae (Optional) Exercise: How to improveExercise: How to improve Observe: Measurement accuracy 11/15/2015 16Costillo Rebelbot
  17. 17. Lab2 -Improve Profiler Module Go to project Options and in C/C++, change LAB1 -> LAB2 Exercise: Count Instruction Cycles Observe: 8MHz processor speed ~56us Interrupt Delay (448cycles)~56us Interrupt Delay (448cycles) ~12ms blip (~96k cycles) 11/15/2015 17Costillo Rebelbot
  18. 18. Estimate Instruction Count Exercise: Estimate the number of instructions, if you dare 11/15/2015 18Costillo Rebelbot
  19. 19. Intro to reading a map file Cross reference – location of data/functions Symbol Table – size in section Memory Map – memory section Image Component Sizes – by moduleImage Component Sizes – by module Callgraph – depth of stack usage * Summary *Keil uses a separate .htm file 11/15/2015 19Costillo Rebelbot
  20. 20. Image Breakdown segment Data type Contains .text/ .code READ ONLY Code Functions, const, strings literals and pre- defined values .bss/.zinit READ WRITE Zero-init UninitializedWRITE Uninitialized global static variables .data READ WRITE Initialized Global variables, static variables STACK READ WRITE Call stack Local function vars HEAP READ WRITE Malloc() 20Costillo Rebelbot
  21. 21. Estimate Instruction Count Exercise: Estimate # number of instructions executed via profiling and counting instructions. Are they close? 11/15/2015 21Costillo Rebelbot
  22. 22. Estimate Clocks Per Instruction (CPI) Method 1: CPI= Execution Time/ (Instruction Count * Clock Frequency) Instruction Cycle Count HAL_GPIO_TogglePin : LDR R2, [R0,#0x14] 2 EORS R2, R1 1 (12ms)/(x* 8Mhz) = ??? Hard to count to find reasonable number ~1-2 Method 2: CPI = weighted average of instruction types EORS R2, R1 1 STR R2, [R0,#0x14] 2 BX LR 1 + (Pipeline refill 1-3) TOTAL instructions: 4 Total cycles: 6-8 CPI 1.5 - 2 22Costillo Rebelbot
  23. 23. Instruction Counting versus Time Counting/CPI Time Tedious Counting Non-representative result Not required in most cases Look at average execution times, not instructions Breakdown only to the levelNot required in most cases Works best in small encapsulated functions() with limited call stack depth Use when concerned with nanoseconds. Breakdown only to the level of detail required Best for large procedures and subsystems Use when at microsecond and millisecond scope 11/15/2015 23Costillo Rebelbot
  24. 24. Lab2B –More Granularity Go to project Options and in C/C++, change LAB2-> LAB2B Exercise: Determine where processing time is spent Toggle on each f() call in task Observe:Observe: LCD calls are long LCD Clear LCD Display 11/15/2015 24Costillo Rebelbot
  25. 25. 11/15/2015 25Costillo Rebelbot
  26. 26. Now for something interesting thrIsrLED() includes: Creates sliding averaging window structure Collect Gyro data sample with thrIsrLED Sample trigger Collect Gyro data sample with each button press. Calculates magnitude of the 3 axis – just for fun. Adds it to a sliding averaging window Prints out average of the window on LCD screen. EXTI GPIO ISR Signal Set 11/15/2015 26Costillo Rebelbot
  27. 27. Lab 3 – Measure Data Processing Time Go to project Options and in C/C++, change LAB2 -> LAB3 Exercise: Measure code in terms of time and size Utilize compiler listings under IDAUtilize compiler listings under IDA Observations: Profiler time goes up (8ms-14ms 11ms -15ms) Algorithm choices are poor 11/15/2015 27Costillo Rebelbot
  28. 28. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 778 +62 +88 822 + 72 + 88 1008 + 136 + 88 thrIsrLED() 18 24 54 140 Slidingwindow.o 154 SWInsert&Ave() 92 Code Total 27572 27576 27648 27988 Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 thrIsrLED() 224 Slidingwindow.o 0 Total 240 240 244 240 RW Data 11/15/2015 28Costillo Rebelbot
  29. 29. Using IDA Select “New” disassemble a new file. Open “Optimizer.axf” Change Processor TypeChange Processor Type to ARM Little Endian Select “Ok” and “yes” for everything else 11/15/2015 29Costillo Rebelbot
  30. 30. LAB 3- thrIsrLEDER_IROM1:08005438 thrIsrLED …… ER_IROM1:08005446 loc_8005446 ; CODE XREF: thrIsrLED+8Aj ER_IROM1:08005446 MOV.W R2, #0xFFFFFFFF ; millisec ER_IROM1:0800544A MOVS R1, #1 ; signals ER_IROM1:0800544C MOV R0, SP ; retstr ER_IROM1:0800544E BL osSignalWait ER_IROM1:08005452 ADD R0, SP, #0x30+GyroBuffer ; pfData ER_IROM1:08005454 BL BSP_GYRO_GetXYZ ER_IROM1:08005458 VLDR S0, [SP,#0x30+GyroBuffer] ER_IROM1:0800545C VMUL.F32 S0, S0, S0 ER_IROM1:08005460 VCVTR.S32.F32 S0, S0 ER_IROM1:08005464 VMOV R1, S0 ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4] void thrIsrLED(void const *argument) { uint8_t strbuff[20]; SlidingWindow16Init(&average_gyro,AVERAGE_WINDOW_SIZ E, windowbuffer ); float GyroBuffer[3]; for (;;) { osSignalWait(0x0001, osWaitForever); //1. Get sample and calculate the magnitude BSP_GYRO_GetXYZ(GyroBuffer); int16_t sample = MAGNITUDE_3AXIS(GyroBuffer[AXIS_DIR__X],GyroBuff er[AXIS_DIR__Y],GyroBuffer[AXIS_DIR__Z]); ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4] ER_IROM1:0800546C VMUL.F32 S0, S0, S0 ER_IROM1:08005470 VCVTR.S32.F32 S0, S0 ER_IROM1:08005474 VMOV R2, S0 ER_IROM1:08005478 ADD R2, R1 ER_IROM1:0800547A VLDR S0, [SP,#0x30+GyroBuffer+8] ER_IROM1:0800547E VMUL.F32 S0, S0, S0 ER_IROM1:08005482 VCVTR.S32.F32 S0, S0 ER_IROM1:08005486 VMOV R1, S0 ER_IROM1:0800548A ADDS R0, R2, R1 ; val ER_IROM1:0800548C BL sq_rt ER_IROM1:08005490 SXTH R4, R0 ER_IROM1:08005492 MOV R1, R4 ; sample ER_IROM1:08005494 LDR R0, =average_gyro ; pwindow ER_IROM1:08005496 BL SlidingWindowInsertSampleAndUpdateAverage16 ER_IROM1:0800549A MOVS R0, #0 …… ER_IROM1:080054C2 B loc_8005446 //2. Process sample SlidingWindowInsertSampleAndUpdateAverage16(&aver age_gyro, sample); int16_t ave = 0; SlidingWindowGetAverage16(&average_gyro, &ave); //3. Print Average result BSP_LCD_GLASS_Clear(); /* Get the current menu */ sprintf((char *)strbuff, "%d", ave); BSP_LCD_GLASS_DisplayString((uint8_t *)strbuff); LED5_PROFILE__STOP; } } 11/15/2015 30Costillo Rebelbot
  31. 31. Revisit Estimate Instruction Count Exercise: Is utilizing IDA to count instructions effective? 11/15/2015 31Costillo Rebelbot
  32. 32. Quick Tips Big O notation matters: Iterations improvements pay off big Most compilers are smart Math Consideration: Data type matters Division becomes >> operations. Use powers of 2 for bufferMost compilers are smart enough to take optimize if you tell them. Use powers of 2 for buffer sizes on averaging windows. Skip % operations. They usually become some version of / and are expensive. While/subtraction loops can be faster in some cases. Pow(), sqrt(), and math.h are expensive. Focus on “good enough” NOTE: some of these will appear in the next labs 11/15/2015 32Costillo Rebelbot
  33. 33. 11/15/2015 33Costillo Rebelbot
  34. 34. RAM Optimization Symptoms Solutions You are out of space and can’t link Keep blowing your Look at your local variables inside functions. Remove debug helperKeep blowing your stack/heap (i.e. things are suddenly in the weeds or weird values) malloc() keeps failing. Remove debug helper variables. Reduce stack size if possible Alter your memory map Inputs on stack versus send pointer to struct Referencing globals and static variables 11/15/2015 34Costillo Rebelbot
  35. 35. Lab 4 – Lower RAM footprint Go to project Options and in C/C++, change LAB3 -> LAB4 Exercise: Role of global, static, and local variables in RAMRole of global, static, and local variables in RAM footprint- what happens as they shift attributions? Tradeoffs in hiding variables in the call stacks Select the right stack size for your task Observation: Smaller .DATA segment Decreased algorithm size 11/15/2015 35Costillo Rebelbot
  36. 36. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 874+124 +88 880 + 124 + 88 1008 + 136 + 88 1000 + 128+ 88 thrIsrLED() 18 24 54 140 140 Slidingwindow.o 154 146 SWInsert&Ave() 92 84 Code SWInsert&Ave() 92 84 Total 27572 27576 27648 27988 27972 Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 21 thrIsrLED() 224 Stack! Slidingwindow.o 0 0 Total 240 240 244 240 240 RW Data 11/15/2015 36Costillo Rebelbot
  37. 37. Deeper Code Space Optimization Reduce number of instructions Use listing file Check stack and heap usage * Use compiler flagsUse compiler flags Use processor intrinsics 11/15/2015 37Costillo Rebelbot
  38. 38. Lab 5 – More Code Space Optimizer Through Toolchain Go to project Options and in C/C++, change LAB4 -> LAB5 Exercise: Optimize only SlidingWindowAverage() with intrinsics.Optimize only SlidingWindowAverage() with intrinsics. Use smarter math operation selections. Is there a trade off? Observation: Check current size on listing 11/15/2015 38Costillo Rebelbot
  39. 39. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 874+124 +88 880 + 124 + 88 1008 + 136 + 88 1000 + 128+ 88 1000 + 130 + 88 thrIsrLED() 18 24 54 140 140 138 Slidingwindow.o 154 146 136 SWInsert&Ave() 92 84 72 Code Total 27572 27576 27648 27988 27972 27960 RW Data Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 21 21 thrIsrLED() 224 Stack! Stack! Slidingwindow. o 0 0 0 Total 240 240 244 240 240 240 11/15/2015 39Costillo Rebelbot
  40. 40. Open Lab - Things to try Unroll loop (need to standardize buffer size) Write a better squareroot function using a lookup table. Actually turn on Optimizer for • #pragma unroll [(n)] • Hint: Look up table Actually turn on Optimizer for space or speed Check call graph for deepest stack usage Turn on RTOS run time stat feature in FreeRTOSconfig.h file Customize map file Who can get the MOST EFFICIENT CODE? • Type –Os3 • Optimizer.htm • configGENERATE_RUN_TI ME_STATS • HINT: remove PADDING 11/15/2015 40Costillo Rebelbot
  @rebelbotJen @rebelbots