SlideShare a Scribd company logo
1 of 39
Download to read offline
1
PRAGMATIC
OPTIMIZATION
IN MODERN PROGRAMMING
ORDERING OPTIMIZATION APPROACHES
Created by for / 2015-2016Marina (geek) Kolpakova UNN
2
COURSE TOPICS
Ordering optimization approaches
Demystifying a compiler
Mastering compiler optimizations
Modern computer architectures concepts
3
OUTLINE
What is optimization?
Pragmatic approach
Optimization trade-offs
Knowledge which is required
Where to get the performance?
Optimization cycle
Top-Down (High-low) approach
Optimization cycle (revised)
Optimization steps overview
How to learn optimization?
Recommended literature
Summary
4
WHAT IS OPTIMIZATION?
In computing, optimization is a process of modifying a system
to make some aspect of it to work more ef ciently or use fewer
resources, in particular, a process of transforming a piece of
code to make it more ef cient without changing its output.
5
PRAGMATIC APPROACH
“Programmers waste enormous amounts of time thinking
about, or worrying about, the speed of non-critical parts of
their programs, and these attempts at ef ciency actually have
a strong negative impact when debugging and maintenance
are considered. We should forget about small inef ciencies, say
about 97% of the time; premature optimization is the root of
all evil. Yet we should not pass up our opportunities in that
critical 3%.“
-Donald Knuth, Structur ed Programming With go to Statements
1. Find what to start from (3%)
2. Know when to stop (97%)
6
OPTIMIZATION TRADE-OFFS
Code portability decreases when we go deeper
Performance portability decreases when we go deeper
The cost of maintenance and extensibility increases
when we go deeper
Optimizations are often not reusable
Optimizations become obsolete very quickly
...but still performance is a crucial requirement for most
applications.
7
KNOWLEDGE WHICH IS REQUIRED
1. The code
The problem, it solves
The algorithm, it implements
The algorithmic complexity
2. The compiler
Compilation trajectory
Compiler's capabilities and obstacles
3. The platform
Architecture capabilities
Instruction Set Architecture
Micro-architecture speci cs
8
WHERE TO GET THE PERFORMANCE?
High-level Programmer
Middle-level Compiler
Low-level Hardware
9
OPTIMIZATION CYCLE
10
TOP-DOWN (HIGH-LOW) APPROACH
1. Understand the code
2. Use appropriate algorithms
3. Optimize memory access patterns
4. Minimize number of operations
5. Shrink the critical path
6. Perform HW-speci c optimizations
7. Dive into assembly
11
OPTIMIZATION CYCLE (REVISED)
12
STEP #1: UNDERSTAND THE CODE
Different people think differently
you'll need some time to get used to the code
Understand data ow
input/output parameters
data dependencies
Identify performance limiters
Time
Pro le
Collect metrics
e.g. CPI, power consumption
13 . 1
STEP #2: USE APPROPRIATE ALGORITHM
Consider and lower big O complexity
Choose data structures wisely
Look for optimized libraries
Find opportunities to scalarize & parallelize
13 . 2
STEP #2: USE APPROPRIATE ALGORITHM
Compilers are not aware of semantics of code, taking this into
account focus on an algorithmic aspect rst.
Decrease big-O complexity
Use optimized libraries for subroutines
Restructure the code to use fewer resources
Split problem on subtasks, organize them wisely
Parallelize
What if you need to sort 100 Mb of numerical data...
WHAT SORTING ALGORITHM WOULD YOU CHOOSE?
14 . 1
STEP #3: OPTIMIZE MEMORY ACCESSES
You'll be surprised how many algorithms are memory bound!
Optimization for memory usually involves:
Data restructuring
to load only data that is really needed for computations.
Data packaging
to shrink the data in size
Loop transformations
to walk through the data in a more ef cient way,
to increase temporal & spacial locality,
to perform cache-aware optimization
14 . 2
STEP #3: OPTIMIZE MEMORY ACCESSES
Compilers are quite good at local optimization, such as
loop bodies transformations,
local functions inlining,
arithmetic expressions simpli cation
so help a compiler rather than try to outfox it.
Work cohesively with it on
enabling auto-vectorization,
optimizing critical loops,
vectorizing.
14 . 3
STEP #3: OPTIMIZE MEMORY ACCESSES
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
if (img[j * width + i] > 0)
count++;
for (int i = 0; i < width; i++)
for (int j = 0; j < height; j++)
if (img[j * width + i] > 0)
count++;
WHICH IS MORE OPTIMAL FOR CONVENTIONAL CPU PROCESSOR?
14 . 4
STEP #3: OPTIMIZE MEMORY ACCESSES
for (int j = 0; j < height; j++)
for (int i = 0; i < width; i++)
if (img[j * width + i] > 0)
count++;
for (int i = 0; i < width; i++)
for (int j = 0; j < height; j++)
if (img[j * width + i] > 0)
count++;
15 . 1
The compiler usually helps a lot here:
STEP #4: MINIMIZE NUMBER OF OPERATIONS
Reducing a program in the number of operations
doesn't necessarily decrease its runtime,
but it's a good heuristic, though.
Machine-independent
optimizations
Common Sub-expression Elimination
Constant propagation
Redundancy elimination
..
Machine-dependent
optimizations
Register allocation
Instruction selectIon
Instruction scheduling
Peephole optimization
..
15 . 2
STEP #4: MINIMIZE NUMBER OF OPERATIONS
float pows(float a,float b,float c, float d, float e, float f, float x)
{
return
a * powf(x, 5.f) +
b * powf(x, 4.f) +
c * powf(x, 3.f) +
d * powf(x, 2.f) +
e * x + f ;
}
gcc -march=armv7-a -mfpu=neon-vfpv4 -mthumb
-mfloat-abi=softfp -O3 1.c -S -o 1.s
15 . 3
... Let's apply Horner rule.
STEP #4: MINIMIZE NUMBER OF OPERATIONS
pows:
push {r3, lr}
flds s17, [sp, #56]
fmsr s24, r1
movs r1, #0
fmsr s22, r0
movt r1, 16544
fmrs r0, s17
fmsr s21, r2
fmsr s20, r3
flds s19, [sp, #48]
flds s18, [sp, #52]
bl powf(PLT)
mov r1, #1082130432
fmsr s23, r0
fmrs r0, s17
bl powf(PLT)
movs r1, #0
movt r1, 16448
fmsr s16, r0
fmrs r0, s17
bl powf(PLT)
fmuls s16, s16, s24
vfma.f32 s16, s23, s22
fmsr s15, r0
vfma.f32 s16, s15, s21
fmuls s15, s17, s17
vfma.f32 s16, s20, s15
vfma.f32 s16, s19, s17
fadds s15, s16, s18
fldmfdd sp!, {d8-d12}
fmrs r0, s15
pop {r3, pc}
15 . 4
STEP #4: MINIMIZE NUMBER OF OPERATIONS
float horner(float a, float b, float c, float d, float e, float f, float x)
{
return ((((a * x + b) * x + c) * x + d) * x + e) * x + f;
}
horner:
flds s15, [sp, #8]
fmsr s11, r0
fmsr s12, r1
flds s14, [sp]
vfma.f32 s12, s11, s15
fmsr s11, r2
flds s13, [sp, #4]
vfma.f32 s11, s12, s15
fcpys s12, s11
fmsr s11, r3
vfma.f32 s11, s12, s15
vfma.f32 s14, s11, s15
vfma.f32 s13, s14, s15
fmrs r0, s13
bx lr
15 . 5
STEP #4: MINIMIZE NUMBER OF OPERATIONS
Unfortunately, sometimes a compiler fails some optimization
steps (e.g. register allocation, scalarization) and harms the
performance by introducing redundant operations.
Starting from this optimization step it is worth to look at the
assembly code to check whether the compiler is actually
automating a particular optimization.
16 . 1
STEP #5: SHRINK THE CRITICAL PATH
Critical path is the longest sequence of operations in a code
block that must be completed in order, which is usually caused
by dependencies between steps or operations.
The critical path of a code block is hardly deducible from
high-level code and requires assembly inspection.
Knowledge about architecture capabilities is required to
estimate critical path more precisely.
Some pro lers are able to do critical path analysis.
The term could also refer to the longest sequence of
dependent steps in a pipeline that limits its parallelization.
Control- ow diagram is used to nding the critical path.
16 . 2
STEP #5: SHRINK THE CRITICAL PATH
Let's look at the critical path of the following code block.
const uint8_t* p0 = src.ptr(row0);
const uint8_t* p1 = src.ptr(row1);
uint8_t* dptr = dst.ptr(row);
for (int col = 0; col < cols; ++col)
{
dptr[col] = (p0[col*2]+p0[col*2+1]
+ p1[col*2]+p1[col*2+1]+2)>>2;
}
WHAT IS THE CRITICAL PATH OF THIS CODE LINE?
16 . 3
STEP #5: SHRINK THE CRITICAL PATH
Let's create 3-positional representation of the code block
r0 = col*2 // 1
r1 = r0+1 // 2
r2 = load(sptr0, r0) // 3
r3 = load(sptr0, r1) // 4
r4 = load(sptr1, r0) // 5
r5 = load(sptr1, r1) // 6
r6 = r2+r3 // 7
r7 = r6+r4 // 8
r8 = r7+r5 // 9
r9 = r8+2 // 10
r10 = shl(r9, 2) // 11
11 ?
Let's construct the dependency graph...
16 . 4
STEP #5: SHRINK THE CRITICAL PATH
8 ?
16 . 5
But, the compiler reorders instructions
since integer math is associative
STEP #5: SHRINK THE CRITICAL PATH
6 ?
16 . 6
And let's assume that hardware schedules
1 arithmetic and 1 memory operation per clock.
STEP #5: SHRINK THE CRITICAL PATH
AND BACK TO 8 AGAIN
17 . 1
STEP #6: DO HW-SPECIFIC OPTIMIZATION
It requires comprehensive understanding of the target HW,
which usually goes beyond compiler's abilities
Using special hardware capabilities
Overcoming micro-architecture weakness
Using instructions, which are speci c for concrete HW
balancing usage of different instruction types
A classical example here is a question of recomputing
temporal v.s. getting it from the memory.
17 . 2
STEP #6: DO HW-SPECIFIC OPTIMIZATION
Modern hardware is quite advanced,
deep pipelines,
out-of-order execution,
sophisticated branch prediction,
multi-level memory hierarchies,
processor specialization.
so utilize unique properties of the hardware.
Peephole optimization is not as important
as used to be 10 years ago.
18
STEP #7: DIVE INTO ASSEMBLY
Assembler is a must-have to check the compiler
but it is rarely used to write low-level code.
Raw assembly make sense to:
Overcome compiler bugs & optimization limitations
addition of redundant instructions
suboptimal register allocation
Use speci c hardware features
which are not expressed in higher level ISA
Keep in mind that:
Assembly writing is the least portable optimization
In-line assembly limits compiler optimizations
19
HOW TO LEARN OPTIMIZATION?
Optimization is a craft rather than a science.
Practice more
Do not make practical knowledge too theoretical.
Look, what other people do
Do nd real use cases of different optimization
approaches and techniques.
Dig into an architecture
HW evolves rapidly hence devices obsolete in a wink.
Comprehensive knowledge helps see beforehand.
20 . 1
RECOMMENDED LITERATURE
by and
Computer Architecture, Fifth Edition:
A Quantitative Approach
John L. Hennessy David A. Patterson.
20 . 2
RECOMMENDED LITERATURE
byThe Mature Optimization Handbook Carlos Bueno
20 . 3
RECOMMENDED LITERATURE
by
Is Parallel Programming Hard,
And, If So, What Can You Do About It?
Paul E. McKenney
20 . 4
RECOMMENDED LITERATURE
by and
Engineering a Compiler
Keith Cooper Linda Torczon
21
SUMMARY
Practice, look what others do and dig into an architecture.
The main task of an optimizer is nding the critical part.
Optimizer's mastership is to know where to stop.
Knowledge about the code, the compiler and the platform
is a must-have.
Optimization is a measure-analyze-optimize-check cycle.
Stick to the high-to-low approach.
Get the performance from algorithmic and data structure
choices rst,
... ensure memory access patterns next,
... then go deeper.
22
THE END
/ 2015-2016MARINA KOLPAKOVA

More Related Content

What's hot

TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition 艾鍗科技
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Hsien-Hsin Sean Lee, Ph.D.
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ ClaireRISC-V International
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Hsien-Hsin Sean Lee, Ph.D.
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMLinaro
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Hsien-Hsin Sean Lee, Ph.D.
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontrollerRouyun Pan
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreHsien-Hsin Sean Lee, Ph.D.
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilersAnastasiaStulova
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 艾鍗科技
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V International
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkAlexey Smirnov
 
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Hsien-Hsin Sean Lee, Ph.D.
 
07 processor basics
07 processor basics07 processor basics
07 processor basicsMurali M
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly LanguageMotaz Saad
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsLinaro
 

What's hot (20)

TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
 
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
Lec6 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Instruction...
 
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
An Open Discussion of RISC-V BitManip, trends, and comparisons _ Claire
 
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
Lec17 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Me...
 
Compilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVMCompilation of COSMO for GPU using LLVM
Compilation of COSMO for GPU using LLVM
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
Lec19 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Pr...
 
EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
 
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- MulticoreLec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
 
FIFOPt
FIFOPtFIFOPt
FIFOPt
 
Challenges in GPU compilers
Challenges in GPU compilersChallenges in GPU compilers
Challenges in GPU compilers
 
5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron) 5.MLP(Multi-Layer Perceptron)
5.MLP(Multi-Layer Perceptron)
 
RISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentorRISC-V 30907 summit 2020 joint picocom_mentor
RISC-V 30907 summit 2020 joint picocom_mentor
 
GEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions FrameworkGEM - GNU C Compiler Extensions Framework
GEM - GNU C Compiler Extensions Framework
 
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Lec8 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
 
07 processor basics
07 processor basics07 processor basics
07 processor basics
 
Introduction to Assembly Language
Introduction to Assembly LanguageIntroduction to Assembly Language
Introduction to Assembly Language
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
 

Similar to Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCAFxX
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAAiman Hud
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Intel® Software
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative ProgrammingSchalk Cronjé
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniquesHardik Devani
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCinside-BigData.com
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track fAlona Gradman
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensAlona Gradman
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionShobha Kumar
 
Improving Code Quality Through Effective Review Process
Improving Code Quality Through Effective  Review ProcessImproving Code Quality Through Effective  Review Process
Improving Code Quality Through Effective Review ProcessDr. Syed Hassan Amin
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPA B Shinde
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingDVClub
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 

Similar to Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches (20)

What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Compiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flatteningCompiler optimizations based on call-graph flattening
Compiler optimizations based on call-graph flattening
 
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
POLITEKNIK MALAYSIA
POLITEKNIK MALAYSIAPOLITEKNIK MALAYSIA
POLITEKNIK MALAYSIA
 
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
Fast Insights to Optimized Vectorization and Memory Using Cache-aware Rooflin...
 
Practical C++ Generative Programming
Practical C++ Generative ProgrammingPractical C++ Generative Programming
Practical C++ Generative Programming
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
Compiler optimization techniques
Compiler optimization techniquesCompiler optimization techniques
Compiler optimization techniques
 
The Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACCThe Past, Present, and Future of OpenACC
The Past, Present, and Future of OpenACC
 
Target updated track f
Target updated   track fTarget updated   track f
Target updated track f
 
Chip Ex2010 Gert Goossens
Chip Ex2010 Gert GoossensChip Ex2010 Gert Goossens
Chip Ex2010 Gert Goossens
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
Matopt
MatoptMatopt
Matopt
 
Cse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solutionCse viii-advanced-computer-architectures-06cs81-solution
Cse viii-advanced-computer-architectures-06cs81-solution
 
Improving Code Quality Through Effective Review Process
Improving Code Quality Through Effective  Review ProcessImproving Code Quality Through Effective  Review Process
Improving Code Quality Through Effective Review Process
 
Advanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILPAdvanced Techniques for Exploiting ILP
Advanced Techniques for Exploiting ILP
 
PCI Express Verification using Reference Modeling
PCI Express Verification using Reference ModelingPCI Express Verification using Reference Modeling
PCI Express Verification using Reference Modeling
 
pm1
pm1pm1
pm1
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 

Recently uploaded

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxabhijeetpadhi001
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupJonathanParaisoCruz
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 

Recently uploaded (20)

Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)ESSENTIAL of (CS/IT/IS) class 06 (database)
ESSENTIAL of (CS/IT/IS) class 06 (database)
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
MICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptxMICROBIOLOGY biochemical test detailed.pptx
MICROBIOLOGY biochemical test detailed.pptx
 
MARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized GroupMARGINALIZATION (Different learners in Marginalized Group
MARGINALIZATION (Different learners in Marginalized Group
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 

Pragmatic Optimization in Modern Programming - Ordering Optimization Approaches

  • 1. 1 PRAGMATIC OPTIMIZATION IN MODERN PROGRAMMING ORDERING OPTIMIZATION APPROACHES Created by for / 2015-2016Marina (geek) Kolpakova UNN
  • 2. 2 COURSE TOPICS Ordering optimization approaches Demystifying a compiler Mastering compiler optimizations Modern computer architectures concepts
  • 3. 3 OUTLINE What is optimization? Pragmatic approach Optimization trade-offs Knowledge which is required Where to get the performance? Optimization cycle Top-Down (High-low) approach Optimization cycle (revised) Optimization steps overview How to learn optimization? Recommended literature Summary
  • 4. 4 WHAT IS OPTIMIZATION? In computing, optimization is a process of modifying a system to make some aspect of it to work more ef ciently or use fewer resources, in particular, a process of transforming a piece of code to make it more ef cient without changing its output.
  • 5. 5 PRAGMATIC APPROACH “Programmers waste enormous amounts of time thinking about, or worrying about, the speed of non-critical parts of their programs, and these attempts at ef ciency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small inef ciencies, say about 97% of the time; premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.“ -Donald Knuth, Structur ed Programming With go to Statements 1. Find what to start from (3%) 2. Know when to stop (97%)
  • 6. 6 OPTIMIZATION TRADE-OFFS Code portability decreases when we go deeper Performance portability decreases when we go deeper The cost of maintenance and extensibility increases when we go deeper Optimizations are often not reusable Optimizations become obsolete very quickly ...but still performance is a crucial requirement for most applications.
  • 7. 7 KNOWLEDGE WHICH IS REQUIRED 1. The code The problem, it solves The algorithm, it implements The algorithmic complexity 2. The compiler Compilation trajectory Compiler's capabilities and obstacles 3. The platform Architecture capabilities Instruction Set Architecture Micro-architecture speci cs
  • 8. 8 WHERE TO GET THE PERFORMANCE? High-level Programmer Middle-level Compiler Low-level Hardware
  • 10. 10 TOP-DOWN (HIGH-LOW) APPROACH 1. Understand the code 2. Use appropriate algorithms 3. Optimize memory access patterns 4. Minimize number of operations 5. Shrink the critical path 6. Perform HW-speci c optimizations 7. Dive into assembly
  • 12. 12 STEP #1: UNDERSTAND THE CODE Different people think differently you'll need some time to get used to the code Understand data ow input/output parameters data dependencies Identify performance limiters Time Pro le Collect metrics e.g. CPI, power consumption
  • 13. 13 . 1 STEP #2: USE APPROPRIATE ALGORITHM Consider and lower big O complexity Choose data structures wisely Look for optimized libraries Find opportunities to scalarize & parallelize
  • 14. 13 . 2 STEP #2: USE APPROPRIATE ALGORITHM Compilers are not aware of semantics of code, taking this into account focus on an algorithmic aspect rst. Decrease big-O complexity Use optimized libraries for subroutines Restructure the code to use fewer resources Split problem on subtasks, organize them wisely Parallelize What if you need to sort 100 Mb of numerical data... WHAT SORTING ALGORITHM WOULD YOU CHOOSE?
  • 15. 14 . 1 STEP #3: OPTIMIZE MEMORY ACCESSES You'll be surprised how many algorithms are memory bound! Optimization for memory usually involves: Data restructuring to load only data that is really needed for computations. Data packaging to shrink the data in size Loop transformations to walk through the data in a more ef cient way, to increase temporal & spacial locality, to perform cache-aware optimization
  • 16. 14 . 2 STEP #3: OPTIMIZE MEMORY ACCESSES Compilers are quite good at local optimization, such as loop bodies transformations, local functions inlining, arithmetic expressions simpli cation so help a compiler rather than try to outfox it. Work cohesively with it on enabling auto-vectorization, optimizing critical loops, vectorizing.
  • 17. 14 . 3 STEP #3: OPTIMIZE MEMORY ACCESSES for (int j = 0; j < height; j++) for (int i = 0; i < width; i++) if (img[j * width + i] > 0) count++; for (int i = 0; i < width; i++) for (int j = 0; j < height; j++) if (img[j * width + i] > 0) count++; WHICH IS MORE OPTIMAL FOR CONVENTIONAL CPU PROCESSOR?
  • 18. 14 . 4 STEP #3: OPTIMIZE MEMORY ACCESSES for (int j = 0; j < height; j++) for (int i = 0; i < width; i++) if (img[j * width + i] > 0) count++; for (int i = 0; i < width; i++) for (int j = 0; j < height; j++) if (img[j * width + i] > 0) count++;
  • 19. 15 . 1 The compiler usually helps a lot here: STEP #4: MINIMIZE NUMBER OF OPERATIONS Reducing a program in the number of operations doesn't necessarily decrease its runtime, but it's a good heuristic, though. Machine-independent optimizations Common Sub-expression Elimination Constant propagation Redundancy elimination .. Machine-dependent optimizations Register allocation Instruction selectIon Instruction scheduling Peephole optimization ..
  • 20. 15 . 2 STEP #4: MINIMIZE NUMBER OF OPERATIONS float pows(float a,float b,float c, float d, float e, float f, float x) { return a * powf(x, 5.f) + b * powf(x, 4.f) + c * powf(x, 3.f) + d * powf(x, 2.f) + e * x + f ; } gcc -march=armv7-a -mfpu=neon-vfpv4 -mthumb -mfloat-abi=softfp -O3 1.c -S -o 1.s
  • 21. 15 . 3 ... Let's apply Horner rule. STEP #4: MINIMIZE NUMBER OF OPERATIONS pows: push {r3, lr} flds s17, [sp, #56] fmsr s24, r1 movs r1, #0 fmsr s22, r0 movt r1, 16544 fmrs r0, s17 fmsr s21, r2 fmsr s20, r3 flds s19, [sp, #48] flds s18, [sp, #52] bl powf(PLT) mov r1, #1082130432 fmsr s23, r0 fmrs r0, s17 bl powf(PLT) movs r1, #0 movt r1, 16448 fmsr s16, r0 fmrs r0, s17 bl powf(PLT) fmuls s16, s16, s24 vfma.f32 s16, s23, s22 fmsr s15, r0 vfma.f32 s16, s15, s21 fmuls s15, s17, s17 vfma.f32 s16, s20, s15 vfma.f32 s16, s19, s17 fadds s15, s16, s18 fldmfdd sp!, {d8-d12} fmrs r0, s15 pop {r3, pc}
  • 22. 15 . 4 STEP #4: MINIMIZE NUMBER OF OPERATIONS float horner(float a, float b, float c, float d, float e, float f, float x) { return ((((a * x + b) * x + c) * x + d) * x + e) * x + f; } horner: flds s15, [sp, #8] fmsr s11, r0 fmsr s12, r1 flds s14, [sp] vfma.f32 s12, s11, s15 fmsr s11, r2 flds s13, [sp, #4] vfma.f32 s11, s12, s15 fcpys s12, s11 fmsr s11, r3 vfma.f32 s11, s12, s15 vfma.f32 s14, s11, s15 vfma.f32 s13, s14, s15 fmrs r0, s13 bx lr
  • 23. 15 . 5 STEP #4: MINIMIZE NUMBER OF OPERATIONS Unfortunately, sometimes a compiler fails some optimization steps (e.g. register allocation, scalarization) and harms the performance by introducing redundant operations. Starting from this optimization step it is worth to look at the assembly code to check whether the compiler is actually automating a particular optimization.
  • 24. 16 . 1 STEP #5: SHRINK THE CRITICAL PATH Critical path is the longest sequence of operations in a code block that must be completed in order, which is usually caused by dependencies between steps or operations. The critical path of a code block is hardly deducible from high-level code and requires assembly inspection. Knowledge about architecture capabilities is required to estimate critical path more precisely. Some pro lers are able to do critical path analysis. The term could also refer to the longest sequence of dependent steps in a pipeline that limits its parallelization. Control- ow diagram is used to nding the critical path.
  • 25. 16 . 2 STEP #5: SHRINK THE CRITICAL PATH Let's look at the critical path of the following code block. const uint8_t* p0 = src.ptr(row0); const uint8_t* p1 = src.ptr(row1); uint8_t* dptr = dst.ptr(row); for (int col = 0; col < cols; ++col) { dptr[col] = (p0[col*2]+p0[col*2+1] + p1[col*2]+p1[col*2+1]+2)>>2; } WHAT IS THE CRITICAL PATH OF THIS CODE LINE?
  • 26. 16 . 3 STEP #5: SHRINK THE CRITICAL PATH Let's create 3-positional representation of the code block r0 = col*2 // 1 r1 = r0+1 // 2 r2 = load(sptr0, r0) // 3 r3 = load(sptr0, r1) // 4 r4 = load(sptr1, r0) // 5 r5 = load(sptr1, r1) // 6 r6 = r2+r3 // 7 r7 = r6+r4 // 8 r8 = r7+r5 // 9 r9 = r8+2 // 10 r10 = shl(r9, 2) // 11 11 ? Let's construct the dependency graph...
  • 27. 16 . 4 STEP #5: SHRINK THE CRITICAL PATH 8 ?
  • 28. 16 . 5 But, the compiler reorders instructions since integer math is associative STEP #5: SHRINK THE CRITICAL PATH 6 ?
  • 29. 16 . 6 And let's assume that hardware schedules 1 arithmetic and 1 memory operation per clock. STEP #5: SHRINK THE CRITICAL PATH AND BACK TO 8 AGAIN
  • 30. 17 . 1 STEP #6: DO HW-SPECIFIC OPTIMIZATION It requires comprehensive understanding of the target HW, which usually goes beyond compiler's abilities Using special hardware capabilities Overcoming micro-architecture weakness Using instructions, which are speci c for concrete HW balancing usage of different instruction types A classical example here is a question of recomputing temporal v.s. getting it from the memory.
  • 31. 17 . 2 STEP #6: DO HW-SPECIFIC OPTIMIZATION Modern hardware is quite advanced, deep pipelines, out-of-order execution, sophisticated branch prediction, multi-level memory hierarchies, processor specialization. so utilize unique properties of the hardware. Peephole optimization is not as important as used to be 10 years ago.
  • 32. 18 STEP #7: DIVE INTO ASSEMBLY Assembler is a must-have to check the compiler but it is rarely used to write low-level code. Raw assembly make sense to: Overcome compiler bugs & optimization limitations addition of redundant instructions suboptimal register allocation Use speci c hardware features which are not expressed in higher level ISA Keep in mind that: Assembly writing is the least portable optimization In-line assembly limits compiler optimizations
  • 33. 19 HOW TO LEARN OPTIMIZATION? Optimization is a craft rather than a science. Practice more Do not make practical knowledge too theoretical. Look, what other people do Do nd real use cases of different optimization approaches and techniques. Dig into an architecture HW evolves rapidly hence devices obsolete in a wink. Comprehensive knowledge helps see beforehand.
  • 34. 20 . 1 RECOMMENDED LITERATURE by and Computer Architecture, Fifth Edition: A Quantitative Approach John L. Hennessy David A. Patterson.
  • 35. 20 . 2 RECOMMENDED LITERATURE byThe Mature Optimization Handbook Carlos Bueno
  • 36. 20 . 3 RECOMMENDED LITERATURE by Is Parallel Programming Hard, And, If So, What Can You Do About It? Paul E. McKenney
  • 37. 20 . 4 RECOMMENDED LITERATURE by and Engineering a Compiler Keith Cooper Linda Torczon
  • 38. 21 SUMMARY Practice, look what others do and dig into an architecture. The main task of an optimizer is nding the critical part. Optimizer's mastership is to know where to stop. Knowledge about the code, the compiler and the platform is a must-have. Optimization is a measure-analyze-optimize-check cycle. Stick to the high-to-low approach. Get the performance from algorithmic and data structure choices rst, ... ensure memory access patterns next, ... then go deeper.