Squeezing Blood From a Stone V1.2

J
Jen CostilloEngineering Consultant at Rebelbot
Getting Back Memory and Performance
Jen Costillo
While you wait, download:
http://tinyurl.com/nha7853
V1.2 release
Why Optimize?
Lower memory -> cheaper BOM
Lower Memory footprint
Low RAM footprint
PerformancePerformance
Bottlenecks
Power considerations
Maintenance
Sometime your compiler can’t do everything.
Sometime you want a challenge
11/15/2015 2Costillo Rebelbot
Things Covered
Basics of: Intro to tools:
Profiling tools
Code Optimization
RAM optimization
Keil
IDA
RAM optimization
Map files
Compiler/Linker
documentation secrets
11/15/2015 3Costillo Rebelbot
Things Not Covered (but are pretty cool)
Virtual memory
Caching for speed
Branch optimization
Processor pipeline considerationsProcessor pipeline considerations
Specifics of a particular processor family
Instead focus on where to find the info to accomplish
the goal
11/15/2015 4Costillo Rebelbot
Performance
Speed of execution
Resolve bottlenecks
S
p
a
c
eResolve bottlenecks
Meeting design guidelines
Meet power consumption
model
Time
e
11/15/2015 5Costillo Rebelbot
Maintenance
Refactoring
Code Hygiene
Primarily an avoidance mechanism
11/15/2015 6Costillo Rebelbot
Lab 0
Install toolchain
Keil
Download project
Github
Install ST Link SW and Driver
STM32L476 Discovery board
Hook Up Scope (Optional- /Utilities/PC SW/Saleae)Hook Up Scope (Optional- /Utilities/PC SW/Saleae)
PB2
PE8
PA0
Make sure it compiles with “LAB0” defined in project options-
> C/C++-> Preprocessor Symbols-> Define
WARNING: Keil may decide additional packages needed
11/15/2015 7Costillo Rebelbot
Build and Load in Keil
Open Workspace:
project menu ->
Open Project. Select
Optimizer.uvprojx
Change Options->Change Options->
C/C++-> Preprocessor
Symbols-> Define
Rebuild / F7
Debug/ Ctrl + F5
Run/ F511/15/2015 8Costillo Rebelbot
Program Structure
MainThread
500ms timer
thrLED
Toggle LED4
thrIsrLED
Sample
trigger
Signal Set
EXTI GPIO
ISR
Signal Set
Joystick Select Press
11/15/2015 9Costillo Rebelbot
Before You Dive In
Don’t optimize too early
Keep a baseline
Profiling
Keep track of memory utilizationKeep track of memory utilization
Leverage tools with compiler tool chain/IDE
Create your own profiling systems
11/15/2015 10Costillo Rebelbot
11/15/2015 11Costillo Rebelbot
Performance Measurements
1. Model – what are you expecting them
to be
a. Tasks
b. RTOS – task switching
c. ISRs
2. Measure
1. Review compiler listings and map files1. Review compiler listings and map files
2. Home-grown profiling tools
3. Leverage simulators
4. IDE/RTOS toolchain tools
3. Modify – basic tools
1. Leverage toolchain intrinsics
2. Utilize compiler optimizer
3. Use Big-O to improve code structure
4. Count and shrink instruction count
5. Assembly based on processor pipeline
knowledge11/15/2015 12Costillo Rebelbot
Measurements Interface Tradeoffs
Type Pro Cons
Logging –based: Human readable
Serial or other live
stream
Happens in “real” time Serial port is slow
Overhead is high
Can disrupt execution order
Circular buffer Faster than serial Need extraction tool
Not reading in real time
Limited data
File system with
extraction tool
Stays until you extract it
Size is limited by allocation
Requires extraction tool
Not reading in real time
HW-based: Execution disruption is low Need to decode for readability
GPIOs Setup is low Potentially high pin count
PWMs Low pin count Overhead can be high
Can be painful without
spectrum analyzer
DAC Low pin count
2^n bits of event levels
Need oscilloscope
11/15/2015 13Costillo Rebelbot
Modification Improvement
Strategies
Type Method Impact
Algorithm efficiency
function
• Review your Big O(n)
• Leverage preprocessor
intrinsics
• Count instructions and
write new code
ROM,
Scale, speed
Code size function ROM, processor pipeline,
write new code
• Utilize optimizer flags
• C/Assembly based on
processor knowledge
Code size function ROM, processor pipeline,
Target size or call
frequency
Memory usage • Leverage compiler
intrinsics
• Utilize optimizer or
linker flags
RAM
stack. heap
Memory location RAM
11/15/2015 14Costillo Rebelbot
Profiling LED Sample Task
MainThread thrLED
thrIsrLED
Sample
triggerSignal SetMainThread
500ms timer
thrLED
Toggle LED4
trigger
EXTI GPIO
ISR
Signal Set
Signal Set
11/15/2015 15Costillo Rebelbot
Lab1 - Make Profiler Module
Go to project Options and in C/C++, change
LAB0 -> LAB1
Load Saleae logic settings in /Saleae (Optional)
Exercise: How to improveExercise: How to improve
Observe:
Measurement accuracy
11/15/2015 16Costillo Rebelbot
Lab2 -Improve Profiler Module
Go to project Options and in C/C++, change
LAB1 -> LAB2
Exercise: Count Instruction Cycles
Observe:
8MHz processor speed
~56us Interrupt Delay (448cycles)~56us Interrupt Delay (448cycles)
~12ms blip (~96k cycles)
11/15/2015 17Costillo Rebelbot
Estimate Instruction Count
Exercise: Estimate the number of
instructions, if you dare
11/15/2015 18Costillo Rebelbot
Intro to reading a map file
Cross reference – location of data/functions
Symbol Table – size in section
Memory Map – memory section
Image Component Sizes – by moduleImage Component Sizes – by module
Callgraph – depth of stack usage *
Summary
*Keil uses a separate .htm file
11/15/2015 19Costillo Rebelbot
Image Breakdown
segment Data type Contains
.text/ .code READ
ONLY
Code Functions,
const, strings
literals and pre-
defined values
.bss/.zinit READ
WRITE
Zero-init
UninitializedWRITE Uninitialized
global static
variables
.data READ
WRITE
Initialized Global
variables,
static variables
STACK READ
WRITE
Call stack
Local function
vars
HEAP READ
WRITE
Malloc()
https://en.wikipedia.org/wiki/Data_segment11/15/2015 20Costillo Rebelbot
Estimate Instruction Count
Exercise: Estimate # number of instructions executed via
profiling and counting instructions. Are they close?
11/15/2015 21Costillo Rebelbot
Estimate Clocks Per Instruction
(CPI)
Method 1:
CPI= Execution Time/
(Instruction Count *
Clock Frequency)
Instruction Cycle Count
HAL_GPIO_TogglePin :
LDR R2, [R0,#0x14] 2
EORS R2, R1 1
(12ms)/(x* 8Mhz) = ???
Hard to count to find
reasonable number ~1-2
Method 2:
CPI = weighted average
of instruction types
EORS R2, R1 1
STR R2, [R0,#0x14] 2
BX LR 1 + (Pipeline
refill 1-3)
TOTAL instructions: 4 Total cycles:
6-8
CPI 1.5 - 2
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html11/15/2015 22Costillo Rebelbot
Instruction Counting versus Time
Counting/CPI Time
Tedious Counting
Non-representative result
Not required in most cases
Look at average execution
times, not instructions
Breakdown only to the levelNot required in most cases
Works best in small
encapsulated functions()
with limited call stack depth
Use when concerned with
nanoseconds.
Breakdown only to the level
of detail required
Best for large procedures
and subsystems
Use when at microsecond
and millisecond scope
11/15/2015 23Costillo Rebelbot
Lab2B –More Granularity
Go to project Options and in C/C++, change
LAB2-> LAB2B
Exercise:
Determine where processing time is spent
Toggle on each f() call in task
Observe:Observe:
LCD calls are long
LCD
Clear
LCD Display
11/15/2015 24Costillo Rebelbot
11/15/2015 25Costillo Rebelbot
Now for something interesting
thrIsrLED() includes:
Creates sliding averaging
window structure
Collect Gyro data sample with
thrIsrLED
Sample
trigger
Collect Gyro data sample with
each button press.
Calculates magnitude of the 3
axis – just for fun.
Adds it to a sliding averaging
window
Prints out average of the
window on LCD screen.
EXTI GPIO
ISR
Signal Set
11/15/2015 26Costillo Rebelbot
Lab 3 – Measure Data Processing
Time
Go to project Options and in C/C++, change LAB2 ->
LAB3
Exercise:
Measure code in terms of time and size
Utilize compiler listings under IDAUtilize compiler listings under IDA
Observations:
Profiler time goes up (8ms-14ms 11ms -15ms)
Algorithm choices are poor
11/15/2015 27Costillo Rebelbot
Bytes By the Numbers
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 868
+124+88
778 +62
+88
822 + 72 +
88
1008 +
136 + 88
thrIsrLED() 18 24 54 140
Slidingwindow.o 154
SWInsert&Ave() 92
Code
Total 27572 27576 27648 27988
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 21 21 28 21 +224
thrIsrLED() 224
Slidingwindow.o 0
Total 240 240 244 240
RW Data
11/15/2015 28Costillo Rebelbot
Using IDA
Select “New”
disassemble a new file.
Open “Optimizer.axf”
Change Processor TypeChange Processor Type
to ARM Little Endian
Select “Ok” and “yes”
for everything else
11/15/2015 29Costillo Rebelbot
LAB 3- thrIsrLEDER_IROM1:08005438 thrIsrLED
……
ER_IROM1:08005446 loc_8005446 ; CODE XREF:
thrIsrLED+8Aj
ER_IROM1:08005446 MOV.W R2, #0xFFFFFFFF ; millisec
ER_IROM1:0800544A MOVS R1, #1 ; signals
ER_IROM1:0800544C MOV R0, SP ; retstr
ER_IROM1:0800544E BL osSignalWait
ER_IROM1:08005452 ADD R0, SP, #0x30+GyroBuffer ; pfData
ER_IROM1:08005454 BL BSP_GYRO_GetXYZ
ER_IROM1:08005458 VLDR S0, [SP,#0x30+GyroBuffer]
ER_IROM1:0800545C VMUL.F32 S0, S0, S0
ER_IROM1:08005460 VCVTR.S32.F32 S0, S0
ER_IROM1:08005464 VMOV R1, S0
ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4]
void thrIsrLED(void const *argument) {
uint8_t strbuff[20];
SlidingWindow16Init(&average_gyro,AVERAGE_WINDOW_SIZ
E, windowbuffer );
float GyroBuffer[3];
for (;;) {
osSignalWait(0x0001, osWaitForever);
//1. Get sample and calculate the magnitude
BSP_GYRO_GetXYZ(GyroBuffer);
int16_t sample =
MAGNITUDE_3AXIS(GyroBuffer[AXIS_DIR__X],GyroBuff
er[AXIS_DIR__Y],GyroBuffer[AXIS_DIR__Z]);
ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4]
ER_IROM1:0800546C VMUL.F32 S0, S0, S0
ER_IROM1:08005470 VCVTR.S32.F32 S0, S0
ER_IROM1:08005474 VMOV R2, S0
ER_IROM1:08005478 ADD R2, R1
ER_IROM1:0800547A VLDR S0, [SP,#0x30+GyroBuffer+8]
ER_IROM1:0800547E VMUL.F32 S0, S0, S0
ER_IROM1:08005482 VCVTR.S32.F32 S0, S0
ER_IROM1:08005486 VMOV R1, S0
ER_IROM1:0800548A ADDS R0, R2, R1 ; val
ER_IROM1:0800548C BL sq_rt
ER_IROM1:08005490 SXTH R4, R0
ER_IROM1:08005492 MOV R1, R4 ; sample
ER_IROM1:08005494 LDR R0, =average_gyro ; pwindow
ER_IROM1:08005496 BL
SlidingWindowInsertSampleAndUpdateAverage16
ER_IROM1:0800549A MOVS R0, #0
……
ER_IROM1:080054C2 B loc_8005446
//2. Process sample
SlidingWindowInsertSampleAndUpdateAverage16(&aver
age_gyro, sample);
int16_t ave = 0;
SlidingWindowGetAverage16(&average_gyro, &ave);
//3. Print Average result
BSP_LCD_GLASS_Clear();
/* Get the current menu */
sprintf((char *)strbuff, "%d", ave);
BSP_LCD_GLASS_DisplayString((uint8_t *)strbuff);
LED5_PROFILE__STOP;
}
}
11/15/2015 30Costillo Rebelbot
Revisit Estimate Instruction Count
Exercise: Is utilizing IDA to count instructions
effective?
11/15/2015 31Costillo Rebelbot
Quick Tips
Big O notation matters:
Iterations improvements
pay off big
Most compilers are smart
Math Consideration:
Data type matters
Division becomes >>
operations.
Use powers of 2 for bufferMost compilers are smart
enough to take optimize if
you tell them.
Use powers of 2 for buffer
sizes on averaging windows.
Skip % operations. They
usually become some
version of / and are
expensive.
While/subtraction loops
can be faster in some cases.
Pow(), sqrt(), and math.h
are expensive. Focus on
“good enough”
NOTE: some of these will
appear in the next labs
11/15/2015 32Costillo Rebelbot
11/15/2015 33Costillo Rebelbot
RAM Optimization
Symptoms Solutions
You are out of space and can’t
link
Keep blowing your
Look at your local variables
inside functions.
Remove debug helperKeep blowing your
stack/heap (i.e. things are
suddenly in the weeds or
weird values)
malloc() keeps failing.
Remove debug helper
variables.
Reduce stack size if possible
Alter your memory map
Inputs on stack versus send
pointer to struct
Referencing globals and
static variables
11/15/2015 34Costillo Rebelbot
Lab 4 – Lower RAM footprint
Go to project Options and in C/C++, change
LAB3 -> LAB4
Exercise:
Role of global, static, and local variables in RAMRole of global, static, and local variables in RAM
footprint- what happens as they shift attributions?
Tradeoffs in hiding variables in the call stacks
Select the right stack size for your task
Observation:
Smaller .DATA segment
Decreased algorithm size
11/15/2015 35Costillo Rebelbot
Bytes By the Numbers
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 868
+124+88
874+124
+88
880 + 124 +
88
1008 +
136 + 88
1000 +
128+ 88
thrIsrLED() 18 24 54 140 140
Slidingwindow.o 154 146
SWInsert&Ave() 92 84
Code
SWInsert&Ave() 92 84
Total 27572 27576 27648 27988 27972
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 21 21 28 21 +224 21
thrIsrLED() 224 Stack!
Slidingwindow.o 0 0
Total 240 240 244 240 240
RW Data
11/15/2015 36Costillo Rebelbot
Deeper Code Space Optimization
Reduce number of instructions
Use listing file
Check stack and heap usage *
Use compiler flagsUse compiler flags
Use processor intrinsics
11/15/2015 37Costillo Rebelbot
Lab 5 – More Code Space
Optimizer Through Toolchain
Go to project Options and in C/C++, change
LAB4 -> LAB5
Exercise:
Optimize only SlidingWindowAverage() with intrinsics.Optimize only SlidingWindowAverage() with intrinsics.
Use smarter math operation selections. Is there a trade
off?
Observation:
Check current size on listing
11/15/2015 38Costillo Rebelbot
Bytes By the Numbers
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 868
+124+88
874+124
+88
880 + 124 +
88
1008 +
136 + 88
1000 +
128+ 88
1000 +
130 + 88
thrIsrLED() 18 24 54 140 140 138
Slidingwindow.o 154 146 136
SWInsert&Ave() 92 84 72
Code
Total 27572 27576 27648 27988 27972 27960
RW Data
Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5
Main.o 21 21 28 21 +224 21 21
thrIsrLED() 224 Stack! Stack!
Slidingwindow.
o
0 0 0
Total 240 240 244 240 240 240
11/15/2015 39Costillo Rebelbot
Open Lab - Things to try
Unroll loop (need to standardize
buffer size)
Write a better squareroot function
using a lookup table.
Actually turn on Optimizer for
• #pragma unroll [(n)]
• Hint: Look up table
Actually turn on Optimizer for
space or speed
Check call graph for deepest stack
usage
Turn on RTOS run time stat feature
in FreeRTOSconfig.h file
Customize map file
Who can get the MOST EFFICIENT
CODE?
• Type –Os3
• Optimizer.htm
• configGENERATE_RUN_TI
ME_STATS
• HINT: remove PADDING
11/15/2015 40Costillo Rebelbot
@rebelbotJen
@rebelbots
www.rebelbot.com
1 of 41

Recommended

Designing An Android Sensor Subsystem and Solving Common Sensor Problems by
Designing An Android Sensor Subsystem and Solving Common Sensor ProblemsDesigning An Android Sensor Subsystem and Solving Common Sensor Problems
Designing An Android Sensor Subsystem and Solving Common Sensor ProblemsJen Costillo
9.9K views35 slides
Designing an android sensor subsystem costillo 20120214 by
Designing an android sensor subsystem costillo 20120214Designing an android sensor subsystem costillo 20120214
Designing an android sensor subsystem costillo 20120214Jen Costillo
2K views27 slides
Postdoc Symposium - Abram Hindle by
Postdoc Symposium - Abram HindlePostdoc Symposium - Abram Hindle
Postdoc Symposium - Abram HindleICSM 2011
485 views39 slides
Jonathan bromley doulos by
Jonathan bromley doulosJonathan bromley doulos
Jonathan bromley doulosObsidian Software
355 views24 slides
Track g test strategy - delta by
Track g   test strategy - deltaTrack g   test strategy - delta
Track g test strategy - deltachiportal
369 views18 slides
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De... by
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...
Open Source Interactive CPU Preview Rendering with Pixar's Universal Scene De...Intel® Software
2.1K views52 slides

More Related Content

What's hot

Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S... by
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Intel® Software
4.1K views38 slides
Pjproject su Android: uno scontro su più livelli by
Pjproject su Android: uno scontro su più livelliPjproject su Android: uno scontro su più livelli
Pjproject su Android: uno scontro su più livelliGiacomo Bergami
1.3K views11 slides
The Architecture of 11th Generation Intel® Processor Graphics by
The Architecture of 11th Generation Intel® Processor GraphicsThe Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor GraphicsIntel® Software
12K views30 slides
Analog for all_preview by
Analog for all_previewAnalog for all_preview
Analog for all_previewAnand Udupa
1.2K views12 slides
Accelerated Android Development with Linaro by
Accelerated Android Development with LinaroAccelerated Android Development with Linaro
Accelerated Android Development with LinaroNational Cheng Kung University
1.9K views35 slides
Synopsys jul1411 by
Synopsys jul1411Synopsys jul1411
Synopsys jul1411Samsung Electronics Egypt
1.5K views27 slides

What's hot(8)

Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S... by Intel® Software
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Embree Ray Tracing Kernels | Overview and New Features | SIGGRAPH 2018 Tech S...
Intel® Software4.1K views
Pjproject su Android: uno scontro su più livelli by Giacomo Bergami
Pjproject su Android: uno scontro su più livelliPjproject su Android: uno scontro su più livelli
Pjproject su Android: uno scontro su più livelli
Giacomo Bergami1.3K views
The Architecture of 11th Generation Intel® Processor Graphics by Intel® Software
The Architecture of 11th Generation Intel® Processor GraphicsThe Architecture of 11th Generation Intel® Processor Graphics
The Architecture of 11th Generation Intel® Processor Graphics
Intel® Software12K views
Analog for all_preview by Anand Udupa
Analog for all_previewAnalog for all_preview
Analog for all_preview
Anand Udupa1.2K views
Brief Introduction To Teseda by Rhokanson
Brief Introduction To TesedaBrief Introduction To Teseda
Brief Introduction To Teseda
Rhokanson575 views
Android Internals (This is not the droid you’re loking for...) by Giacomo Bergami
Android Internals (This is not the droid you’re loking for...)Android Internals (This is not the droid you’re loking for...)
Android Internals (This is not the droid you’re loking for...)
Giacomo Bergami3.8K views

Similar to Squeezing Blood From a Stone V1.2

Dot Net Application Monitoring by
Dot Net Application MonitoringDot Net Application Monitoring
Dot Net Application MonitoringRavi Okade
1.4K views30 slides
May2010 hex-core-opt by
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
459 views101 slides
Works on My Machine Syndrome by
Works on My Machine SyndromeWorks on My Machine Syndrome
Works on My Machine SyndromeKamran Bilgrami
308 views18 slides
Performance and Power Profiling on Intel Android Devices by
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android DevicesIntel® Software
1.3K views38 slides
Denis Nagorny - Pumping Python Performance by
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceSergey Arkhipov
626 views16 slides
Closed-Loop Platform Automation by Tong Zhong and Emma Collins by
Closed-Loop Platform Automation by Tong Zhong and Emma CollinsClosed-Loop Platform Automation by Tong Zhong and Emma Collins
Closed-Loop Platform Automation by Tong Zhong and Emma CollinsLiz Warner
89 views27 slides

Similar to Squeezing Blood From a Stone V1.2(20)

Dot Net Application Monitoring by Ravi Okade
Dot Net Application MonitoringDot Net Application Monitoring
Dot Net Application Monitoring
Ravi Okade1.4K views
May2010 hex-core-opt by Jeff Larkin
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
Jeff Larkin459 views
Performance and Power Profiling on Intel Android Devices by Intel® Software
Performance and Power Profiling on Intel Android DevicesPerformance and Power Profiling on Intel Android Devices
Performance and Power Profiling on Intel Android Devices
Intel® Software1.3K views
Denis Nagorny - Pumping Python Performance by Sergey Arkhipov
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
Sergey Arkhipov626 views
Closed-Loop Platform Automation by Tong Zhong and Emma Collins by Liz Warner
Closed-Loop Platform Automation by Tong Zhong and Emma CollinsClosed-Loop Platform Automation by Tong Zhong and Emma Collins
Closed-Loop Platform Automation by Tong Zhong and Emma Collins
Liz Warner89 views
Closed Loop Platform Automation - Tong Zhong & Emma Collins by Liz Warner
Closed Loop Platform Automation - Tong Zhong & Emma CollinsClosed Loop Platform Automation - Tong Zhong & Emma Collins
Closed Loop Platform Automation - Tong Zhong & Emma Collins
Liz Warner186 views
Python* Scalability in Production Environments by Intel® Software
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
Intel® Software1.1K views
Larson and toubro by anoopc1998
Larson and toubroLarson and toubro
Larson and toubro
anoopc199896 views
Performance Verification for ESL Design Methodology from AADL Models by Space Codesign
Performance Verification for ESL Design Methodology from AADL ModelsPerformance Verification for ESL Design Methodology from AADL Models
Performance Verification for ESL Design Methodology from AADL Models
Space Codesign459 views
Coverage Solutions on Emulators by DVClub
Coverage Solutions on EmulatorsCoverage Solutions on Emulators
Coverage Solutions on Emulators
DVClub797 views
Using GitLab CI by Lingvokot
Using GitLab CIUsing GitLab CI
Using GitLab CI
Lingvokot426 views
Using GitLab CI by ColCh
Using GitLab CIUsing GitLab CI
Using GitLab CI
ColCh9.9K views
Five cool ways the JVM can run Apache Spark faster by Tim Ellison
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison2.1K views

Recently uploaded

An approach of ontology and knowledge base for railway maintenance by
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenanceIJECEIAES
12 views14 slides
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... by
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...ahmedmesaiaoun
12 views84 slides
FLOW IN PIPES NOTES.pdf by
FLOW IN PIPES NOTES.pdfFLOW IN PIPES NOTES.pdf
FLOW IN PIPES NOTES.pdfDearest Arhelo
90 views10 slides
LFA-NPG-Paper.pdf by
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfharinsrikanth
40 views13 slides
What is Whirling Hygrometer.pdf by
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdfIIT KHARAGPUR
11 views3 slides
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...Anowar Hossain
10 views34 slides

Recently uploaded(20)

An approach of ontology and knowledge base for railway maintenance by IJECEIAES
An approach of ontology and knowledge base for railway maintenanceAn approach of ontology and knowledge base for railway maintenance
An approach of ontology and knowledge base for railway maintenance
IJECEIAES12 views
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th... by ahmedmesaiaoun
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
Performance of Back-to-Back Mechanically Stabilized Earth Walls Supporting th...
ahmedmesaiaoun12 views
What is Whirling Hygrometer.pdf by IIT KHARAGPUR
What is Whirling Hygrometer.pdfWhat is Whirling Hygrometer.pdf
What is Whirling Hygrometer.pdf
IIT KHARAGPUR 11 views
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L... by Anowar Hossain
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
DevOps to DevSecOps: Enhancing Software Security Throughout The Development L...
Anowar Hossain10 views
A multi-microcontroller-based hardware for deploying Tiny machine learning mo... by IJECEIAES
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
IJECEIAES10 views
Informed search algorithms.pptx by Dr.Shweta
Informed search algorithms.pptxInformed search algorithms.pptx
Informed search algorithms.pptx
Dr.Shweta12 views
Update 42 models(Diode/General ) in SPICE PARK(DEC2023) by Tsuyoshi Horigome
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Update 42 models(Diode/General ) in SPICE PARK(DEC2023)
Multi-objective distributed generation integration in radial distribution sy... by IJECEIAES
Multi-objective distributed generation integration in radial  distribution sy...Multi-objective distributed generation integration in radial  distribution sy...
Multi-objective distributed generation integration in radial distribution sy...
IJECEIAES15 views
cloud computing-virtualization.pptx by RajaulKarim20
cloud computing-virtualization.pptxcloud computing-virtualization.pptx
cloud computing-virtualization.pptx
RajaulKarim2082 views
fakenews_DBDA_Mar23.pptx by deepmitra8
fakenews_DBDA_Mar23.pptxfakenews_DBDA_Mar23.pptx
fakenews_DBDA_Mar23.pptx
deepmitra812 views
CHI-SQUARE ( χ2) TESTS.pptx by ssusera597c5
CHI-SQUARE ( χ2) TESTS.pptxCHI-SQUARE ( χ2) TESTS.pptx
CHI-SQUARE ( χ2) TESTS.pptx
ssusera597c520 views
Machine Element II Course outline.pdf by odatadese1
Machine Element II Course outline.pdfMachine Element II Course outline.pdf
Machine Element II Course outline.pdf
odatadese16 views
Electronic Devices - Integrated Circuit.pdf by booksarpita
Electronic Devices - Integrated Circuit.pdfElectronic Devices - Integrated Circuit.pdf
Electronic Devices - Integrated Circuit.pdf
booksarpita11 views
9_DVD_Dynamic_logic_circuits.pdf by Usha Mehta
9_DVD_Dynamic_logic_circuits.pdf9_DVD_Dynamic_logic_circuits.pdf
9_DVD_Dynamic_logic_circuits.pdf
Usha Mehta21 views

Squeezing Blood From a Stone V1.2

  • 1. Getting Back Memory and Performance Jen Costillo While you wait, download: http://tinyurl.com/nha7853 V1.2 release
  • 2. Why Optimize? Lower memory -> cheaper BOM Lower Memory footprint Low RAM footprint PerformancePerformance Bottlenecks Power considerations Maintenance Sometime your compiler can’t do everything. Sometime you want a challenge 11/15/2015 2Costillo Rebelbot
  • 3. Things Covered Basics of: Intro to tools: Profiling tools Code Optimization RAM optimization Keil IDA RAM optimization Map files Compiler/Linker documentation secrets 11/15/2015 3Costillo Rebelbot
  • 4. Things Not Covered (but are pretty cool) Virtual memory Caching for speed Branch optimization Processor pipeline considerationsProcessor pipeline considerations Specifics of a particular processor family Instead focus on where to find the info to accomplish the goal 11/15/2015 4Costillo Rebelbot
  • 5. Performance Speed of execution Resolve bottlenecks S p a c eResolve bottlenecks Meeting design guidelines Meet power consumption model Time e 11/15/2015 5Costillo Rebelbot
  • 6. Maintenance Refactoring Code Hygiene Primarily an avoidance mechanism 11/15/2015 6Costillo Rebelbot
  • 7. Lab 0 Install toolchain Keil Download project Github Install ST Link SW and Driver STM32L476 Discovery board Hook Up Scope (Optional- /Utilities/PC SW/Saleae)Hook Up Scope (Optional- /Utilities/PC SW/Saleae) PB2 PE8 PA0 Make sure it compiles with “LAB0” defined in project options- > C/C++-> Preprocessor Symbols-> Define WARNING: Keil may decide additional packages needed 11/15/2015 7Costillo Rebelbot
  • 8. Build and Load in Keil Open Workspace: project menu -> Open Project. Select Optimizer.uvprojx Change Options->Change Options-> C/C++-> Preprocessor Symbols-> Define Rebuild / F7 Debug/ Ctrl + F5 Run/ F511/15/2015 8Costillo Rebelbot
  • 9. Program Structure MainThread 500ms timer thrLED Toggle LED4 thrIsrLED Sample trigger Signal Set EXTI GPIO ISR Signal Set Joystick Select Press 11/15/2015 9Costillo Rebelbot
  • 10. Before You Dive In Don’t optimize too early Keep a baseline Profiling Keep track of memory utilizationKeep track of memory utilization Leverage tools with compiler tool chain/IDE Create your own profiling systems 11/15/2015 10Costillo Rebelbot
  • 12. Performance Measurements 1. Model – what are you expecting them to be a. Tasks b. RTOS – task switching c. ISRs 2. Measure 1. Review compiler listings and map files1. Review compiler listings and map files 2. Home-grown profiling tools 3. Leverage simulators 4. IDE/RTOS toolchain tools 3. Modify – basic tools 1. Leverage toolchain intrinsics 2. Utilize compiler optimizer 3. Use Big-O to improve code structure 4. Count and shrink instruction count 5. Assembly based on processor pipeline knowledge11/15/2015 12Costillo Rebelbot
  • 13. Measurements Interface Tradeoffs Type Pro Cons Logging –based: Human readable Serial or other live stream Happens in “real” time Serial port is slow Overhead is high Can disrupt execution order Circular buffer Faster than serial Need extraction tool Not reading in real time Limited data File system with extraction tool Stays until you extract it Size is limited by allocation Requires extraction tool Not reading in real time HW-based: Execution disruption is low Need to decode for readability GPIOs Setup is low Potentially high pin count PWMs Low pin count Overhead can be high Can be painful without spectrum analyzer DAC Low pin count 2^n bits of event levels Need oscilloscope 11/15/2015 13Costillo Rebelbot
  • 14. Modification Improvement Strategies Type Method Impact Algorithm efficiency function • Review your Big O(n) • Leverage preprocessor intrinsics • Count instructions and write new code ROM, Scale, speed Code size function ROM, processor pipeline, write new code • Utilize optimizer flags • C/Assembly based on processor knowledge Code size function ROM, processor pipeline, Target size or call frequency Memory usage • Leverage compiler intrinsics • Utilize optimizer or linker flags RAM stack. heap Memory location RAM 11/15/2015 14Costillo Rebelbot
  • 15. Profiling LED Sample Task MainThread thrLED thrIsrLED Sample triggerSignal SetMainThread 500ms timer thrLED Toggle LED4 trigger EXTI GPIO ISR Signal Set Signal Set 11/15/2015 15Costillo Rebelbot
  • 16. Lab1 - Make Profiler Module Go to project Options and in C/C++, change LAB0 -> LAB1 Load Saleae logic settings in /Saleae (Optional) Exercise: How to improveExercise: How to improve Observe: Measurement accuracy 11/15/2015 16Costillo Rebelbot
  • 17. Lab2 -Improve Profiler Module Go to project Options and in C/C++, change LAB1 -> LAB2 Exercise: Count Instruction Cycles Observe: 8MHz processor speed ~56us Interrupt Delay (448cycles)~56us Interrupt Delay (448cycles) ~12ms blip (~96k cycles) 11/15/2015 17Costillo Rebelbot
  • 18. Estimate Instruction Count Exercise: Estimate the number of instructions, if you dare 11/15/2015 18Costillo Rebelbot
  • 19. Intro to reading a map file Cross reference – location of data/functions Symbol Table – size in section Memory Map – memory section Image Component Sizes – by moduleImage Component Sizes – by module Callgraph – depth of stack usage * Summary *Keil uses a separate .htm file 11/15/2015 19Costillo Rebelbot
  • 20. Image Breakdown segment Data type Contains .text/ .code READ ONLY Code Functions, const, strings literals and pre- defined values .bss/.zinit READ WRITE Zero-init UninitializedWRITE Uninitialized global static variables .data READ WRITE Initialized Global variables, static variables STACK READ WRITE Call stack Local function vars HEAP READ WRITE Malloc() https://en.wikipedia.org/wiki/Data_segment11/15/2015 20Costillo Rebelbot
  • 21. Estimate Instruction Count Exercise: Estimate # number of instructions executed via profiling and counting instructions. Are they close? 11/15/2015 21Costillo Rebelbot
  • 22. Estimate Clocks Per Instruction (CPI) Method 1: CPI= Execution Time/ (Instruction Count * Clock Frequency) Instruction Cycle Count HAL_GPIO_TogglePin : LDR R2, [R0,#0x14] 2 EORS R2, R1 1 (12ms)/(x* 8Mhz) = ??? Hard to count to find reasonable number ~1-2 Method 2: CPI = weighted average of instruction types EORS R2, R1 1 STR R2, [R0,#0x14] 2 BX LR 1 + (Pipeline refill 1-3) TOTAL instructions: 4 Total cycles: 6-8 CPI 1.5 - 2 http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0439b/CHDDIGAC.html11/15/2015 22Costillo Rebelbot
  • 23. Instruction Counting versus Time Counting/CPI Time Tedious Counting Non-representative result Not required in most cases Look at average execution times, not instructions Breakdown only to the levelNot required in most cases Works best in small encapsulated functions() with limited call stack depth Use when concerned with nanoseconds. Breakdown only to the level of detail required Best for large procedures and subsystems Use when at microsecond and millisecond scope 11/15/2015 23Costillo Rebelbot
  • 24. Lab2B –More Granularity Go to project Options and in C/C++, change LAB2-> LAB2B Exercise: Determine where processing time is spent Toggle on each f() call in task Observe:Observe: LCD calls are long LCD Clear LCD Display 11/15/2015 24Costillo Rebelbot
  • 26. Now for something interesting thrIsrLED() includes: Creates sliding averaging window structure Collect Gyro data sample with thrIsrLED Sample trigger Collect Gyro data sample with each button press. Calculates magnitude of the 3 axis – just for fun. Adds it to a sliding averaging window Prints out average of the window on LCD screen. EXTI GPIO ISR Signal Set 11/15/2015 26Costillo Rebelbot
  • 27. Lab 3 – Measure Data Processing Time Go to project Options and in C/C++, change LAB2 -> LAB3 Exercise: Measure code in terms of time and size Utilize compiler listings under IDAUtilize compiler listings under IDA Observations: Profiler time goes up (8ms-14ms 11ms -15ms) Algorithm choices are poor 11/15/2015 27Costillo Rebelbot
  • 28. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 778 +62 +88 822 + 72 + 88 1008 + 136 + 88 thrIsrLED() 18 24 54 140 Slidingwindow.o 154 SWInsert&Ave() 92 Code Total 27572 27576 27648 27988 Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 thrIsrLED() 224 Slidingwindow.o 0 Total 240 240 244 240 RW Data 11/15/2015 28Costillo Rebelbot
  • 29. Using IDA Select “New” disassemble a new file. Open “Optimizer.axf” Change Processor TypeChange Processor Type to ARM Little Endian Select “Ok” and “yes” for everything else 11/15/2015 29Costillo Rebelbot
  • 30. LAB 3- thrIsrLEDER_IROM1:08005438 thrIsrLED …… ER_IROM1:08005446 loc_8005446 ; CODE XREF: thrIsrLED+8Aj ER_IROM1:08005446 MOV.W R2, #0xFFFFFFFF ; millisec ER_IROM1:0800544A MOVS R1, #1 ; signals ER_IROM1:0800544C MOV R0, SP ; retstr ER_IROM1:0800544E BL osSignalWait ER_IROM1:08005452 ADD R0, SP, #0x30+GyroBuffer ; pfData ER_IROM1:08005454 BL BSP_GYRO_GetXYZ ER_IROM1:08005458 VLDR S0, [SP,#0x30+GyroBuffer] ER_IROM1:0800545C VMUL.F32 S0, S0, S0 ER_IROM1:08005460 VCVTR.S32.F32 S0, S0 ER_IROM1:08005464 VMOV R1, S0 ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4] void thrIsrLED(void const *argument) { uint8_t strbuff[20]; SlidingWindow16Init(&average_gyro,AVERAGE_WINDOW_SIZ E, windowbuffer ); float GyroBuffer[3]; for (;;) { osSignalWait(0x0001, osWaitForever); //1. Get sample and calculate the magnitude BSP_GYRO_GetXYZ(GyroBuffer); int16_t sample = MAGNITUDE_3AXIS(GyroBuffer[AXIS_DIR__X],GyroBuff er[AXIS_DIR__Y],GyroBuffer[AXIS_DIR__Z]); ER_IROM1:08005468 VLDR S0, [SP,#0x30+GyroBuffer+4] ER_IROM1:0800546C VMUL.F32 S0, S0, S0 ER_IROM1:08005470 VCVTR.S32.F32 S0, S0 ER_IROM1:08005474 VMOV R2, S0 ER_IROM1:08005478 ADD R2, R1 ER_IROM1:0800547A VLDR S0, [SP,#0x30+GyroBuffer+8] ER_IROM1:0800547E VMUL.F32 S0, S0, S0 ER_IROM1:08005482 VCVTR.S32.F32 S0, S0 ER_IROM1:08005486 VMOV R1, S0 ER_IROM1:0800548A ADDS R0, R2, R1 ; val ER_IROM1:0800548C BL sq_rt ER_IROM1:08005490 SXTH R4, R0 ER_IROM1:08005492 MOV R1, R4 ; sample ER_IROM1:08005494 LDR R0, =average_gyro ; pwindow ER_IROM1:08005496 BL SlidingWindowInsertSampleAndUpdateAverage16 ER_IROM1:0800549A MOVS R0, #0 …… ER_IROM1:080054C2 B loc_8005446 //2. Process sample SlidingWindowInsertSampleAndUpdateAverage16(&aver age_gyro, sample); int16_t ave = 0; SlidingWindowGetAverage16(&average_gyro, &ave); //3. Print Average result BSP_LCD_GLASS_Clear(); /* Get the current menu */ sprintf((char *)strbuff, "%d", ave); BSP_LCD_GLASS_DisplayString((uint8_t *)strbuff); LED5_PROFILE__STOP; } } 11/15/2015 30Costillo Rebelbot
  • 31. Revisit Estimate Instruction Count Exercise: Is utilizing IDA to count instructions effective? 11/15/2015 31Costillo Rebelbot
  • 32. Quick Tips Big O notation matters: Iterations improvements pay off big Most compilers are smart Math Consideration: Data type matters Division becomes >> operations. Use powers of 2 for bufferMost compilers are smart enough to take optimize if you tell them. Use powers of 2 for buffer sizes on averaging windows. Skip % operations. They usually become some version of / and are expensive. While/subtraction loops can be faster in some cases. Pow(), sqrt(), and math.h are expensive. Focus on “good enough” NOTE: some of these will appear in the next labs 11/15/2015 32Costillo Rebelbot
  • 34. RAM Optimization Symptoms Solutions You are out of space and can’t link Keep blowing your Look at your local variables inside functions. Remove debug helperKeep blowing your stack/heap (i.e. things are suddenly in the weeds or weird values) malloc() keeps failing. Remove debug helper variables. Reduce stack size if possible Alter your memory map Inputs on stack versus send pointer to struct Referencing globals and static variables 11/15/2015 34Costillo Rebelbot
  • 35. Lab 4 – Lower RAM footprint Go to project Options and in C/C++, change LAB3 -> LAB4 Exercise: Role of global, static, and local variables in RAMRole of global, static, and local variables in RAM footprint- what happens as they shift attributions? Tradeoffs in hiding variables in the call stacks Select the right stack size for your task Observation: Smaller .DATA segment Decreased algorithm size 11/15/2015 35Costillo Rebelbot
  • 36. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 874+124 +88 880 + 124 + 88 1008 + 136 + 88 1000 + 128+ 88 thrIsrLED() 18 24 54 140 140 Slidingwindow.o 154 146 SWInsert&Ave() 92 84 Code SWInsert&Ave() 92 84 Total 27572 27576 27648 27988 27972 Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 21 thrIsrLED() 224 Stack! Slidingwindow.o 0 0 Total 240 240 244 240 240 RW Data 11/15/2015 36Costillo Rebelbot
  • 37. Deeper Code Space Optimization Reduce number of instructions Use listing file Check stack and heap usage * Use compiler flagsUse compiler flags Use processor intrinsics 11/15/2015 37Costillo Rebelbot
  • 38. Lab 5 – More Code Space Optimizer Through Toolchain Go to project Options and in C/C++, change LAB4 -> LAB5 Exercise: Optimize only SlidingWindowAverage() with intrinsics.Optimize only SlidingWindowAverage() with intrinsics. Use smarter math operation selections. Is there a trade off? Observation: Check current size on listing 11/15/2015 38Costillo Rebelbot
  • 39. Bytes By the Numbers Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 868 +124+88 874+124 +88 880 + 124 + 88 1008 + 136 + 88 1000 + 128+ 88 1000 + 130 + 88 thrIsrLED() 18 24 54 140 140 138 Slidingwindow.o 154 146 136 SWInsert&Ave() 92 84 72 Code Total 27572 27576 27648 27988 27972 27960 RW Data Component Lab0 Lab1 Lab2 Lab3 Lab4 Lab5 Main.o 21 21 28 21 +224 21 21 thrIsrLED() 224 Stack! Stack! Slidingwindow. o 0 0 0 Total 240 240 244 240 240 240 11/15/2015 39Costillo Rebelbot
  • 40. Open Lab - Things to try Unroll loop (need to standardize buffer size) Write a better squareroot function using a lookup table. Actually turn on Optimizer for • #pragma unroll [(n)] • Hint: Look up table Actually turn on Optimizer for space or speed Check call graph for deepest stack usage Turn on RTOS run time stat feature in FreeRTOSconfig.h file Customize map file Who can get the MOST EFFICIENT CODE? • Type –Os3 • Optimizer.htm • configGENERATE_RUN_TI ME_STATS • HINT: remove PADDING 11/15/2015 40Costillo Rebelbot