SlideShare a Scribd company logo
MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
Talk contents 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Talk contents 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problem statement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
Need more justification? 2/3 Improvements to compiler technology double program performance every  ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3 ,[object Object],[object Object],[object Object],On Moore’s law:
Brief cache review ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
Some cache specs ,[object Object],[object Object],L1 cache (I/D) L2 cache PS2 16K/8K †  2-way N/A GameCube 32K/32K ‡  8-way 256K 2-way unified XBOX 16K/16K 4-way 128K 8-way unified PC ~32-64K ~128-512K
Foes: 3 C’s of cache misses ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Friends: Introducing the 3 R’s ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Compulsory Capacity Conflict Rearrange X (x) X Reduce X X (x) Reuse (x) X
Measuring cache utilization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data cache optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Prefetching and preloading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Software prefetching const int kLookAhead = 4;  // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.)  Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4];  // Cache miss, non-blocking Elem b = elem[i + 1];  // Cache hit Elem c = elem[i + 2];  // Cache hit Elem d = elem[i + 3];  // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
Structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Field reordering ,[object Object],void Foo(S *p, void *key, int k) { while (p) { if (p->key == key) { p->count[k]++; break; } p = p->pNext; } } struct S { void *key; int count[20]; S *pNext; }; struct S { void *key; S *pNext; int count[20]; };
Hot/cold splitting ,[object Object],[object Object],[object Object],[object Object],Hot fields: Cold fields: struct S { void *key; S *pNext; S2 *pCold; }; struct S2 { int count[10]; };
Hot/cold splitting
Beware compiler padding ,[object Object],Decreasing size! struct X { int8 a; int64 b; int8 c; int16 d; int64 e; float f; }; struct Z { int64 b; int64 e; float f; int16 d; int8 a; int8 c; }; struct Y { int8 a, pad_a[7]; int64 b; int8 c, pad_c[1]; int16 d, pad_d[2]; int64 e; float f, pad_f[1]; };
Cache performance analysis ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tree data structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Breadth-first order ,[object Object],[object Object]
Depth-first order ,[object Object],[object Object]
van Emde Boas layout ,[object Object],[object Object]
A compact static k-d tree ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Linearization caching ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Memory allocation policy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The curse of aliasing ,[object Object],Aliasing is also missed opportunities for optimization What value is returned here? Who knows! Aliasing is multiple references to the same storage location int Foo(int *a, int *b) { *a = 1; *b = 2; return *a; } int n; int *p1 = &n; int *p2 = &n;
The curse of aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How do we do ‘anti-aliasing’? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],†  To be defined
Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
Matrix multiplication 2/3 ,[object Object],[object Object],[object Object],[object Object],[object Object],Staightforward unrolling results in this: // 16 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0]; a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1];  //(1) a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0];  //(2) a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1];  //(3) }
Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
Abstraction penalty problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch  numVals  each iteration! numVals  not a local variable! May be aliased by  pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty Since  pBuf[i]  can only alias  numVals  in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does  your  compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
Type-based alias analysis ,[object Object],[object Object],Use language types to disambiguate memory references!
Type-based alias analysis ,[object Object],[object Object],[object Object],[object Object],[object Object]
Compatibility of C/C++ types ,[object Object],[object Object],[object Object],[object Object],[object Object]
What TBAA can do for you into this: It can turn this: Possible aliasing between v[i]  and  *n No aliasing possible so fetch  *n  once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
What TBAA can also do ,[object Object],[object Object],Required by standard Allowed By gcc Illegal C/C++ code! uint32 i; float f; i = *((uint32 *)&f); uint32 i; union { float f; uchar8 c[4]; } u; u.f = f; i = (u.c[3]<<24L)+ (u.c[2]<<16L)+ ...; uint32 i; union { float f; uint32 i; } u; u.f = f; i = u.i;
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
Solving the aliasing problem The fix? Declaring the output as  restrict : ,[object Object],[object Object],[object Object],[object Object],[object Object],void Foo(float * restrict v, float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
‘ const’ doesn’t help Some might think this would work: ,[object Object],[object Object],[object Object],[object Object],Since  *c  is const,  v[i]  cannot write to it, right? void Foo(float v[], const float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
SIMD + restrict = TRUE ,[object Object],Independent loads and stores. Operations can be performed in parallel! Stores may alias loads. Must perform operations sequentially. void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  } void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  }
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tips for avoiding aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
That’s it! – Resources 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Resources 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Precomputed Voxelized-Shadows for Large-scale Scene and Many lights
Precomputed Voxelized-Shadows for Large-scale Scene and Many lightsPrecomputed Voxelized-Shadows for Large-scale Scene and Many lights
Precomputed Voxelized-Shadows for Large-scale Scene and Many lights
Seongdae Kim
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
Nikita Popov
 
Screen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space MarineScreen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space Marine
Pope Kim
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
AMD Developer Central
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Mark Kilgard
 
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Unity Technologies
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
Electronic Arts / DICE
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
Mike Acton
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
Redis Labs
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
Cass Everitt
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
Electronic Arts / DICE
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
Cass Everitt
 
Heap sort
Heap sortHeap sort
Heap sort
Mohd Arif
 
Instruction Combine in LLVM
Instruction Combine in LLVMInstruction Combine in LLVM
Instruction Combine in LLVM
Wang Hsiangkai
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
Johan Andersson
 
SEED - Halcyon Architecture
SEED - Halcyon ArchitectureSEED - Halcyon Architecture
SEED - Halcyon Architecture
Electronic Arts / DICE
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Johan Andersson
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
Johan Andersson
 

What's hot (20)

Precomputed Voxelized-Shadows for Large-scale Scene and Many lights
Precomputed Voxelized-Shadows for Large-scale Scene and Many lightsPrecomputed Voxelized-Shadows for Large-scale Scene and Many lights
Precomputed Voxelized-Shadows for Large-scale Scene and Many lights
 
A whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizerA whirlwind tour of the LLVM optimizer
A whirlwind tour of the LLVM optimizer
 
Screen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space MarineScreen Space Decals in Warhammer 40,000: Space Marine
Screen Space Decals in Warhammer 40,000: Space Marine
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well Modern OpenGL Usage: Using Vertex Buffer Objects Well
Modern OpenGL Usage: Using Vertex Buffer Objects Well
 
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019Built for performance: the UIElements Renderer – Unite Copenhagen 2019
Built for performance: the UIElements Renderer – Unite Copenhagen 2019
 
FrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in FrostbiteFrameGraph: Extensible Rendering Architecture in Frostbite
FrameGraph: Extensible Rendering Architecture in Frostbite
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
Beyond porting
Beyond portingBeyond porting
Beyond porting
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Approaching zero driver overhead
Approaching zero driver overheadApproaching zero driver overhead
Approaching zero driver overhead
 
Heap sort
Heap sortHeap sort
Heap sort
 
Instruction Combine in LLVM
Instruction Combine in LLVMInstruction Combine in LLVM
Instruction Combine in LLVM
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 
SEED - Halcyon Architecture
SEED - Halcyon ArchitectureSEED - Halcyon Architecture
SEED - Halcyon Architecture
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
 
Parallel Futures of a Game Engine
Parallel Futures of a Game EngineParallel Futures of a Game Engine
Parallel Futures of a Game Engine
 

Similar to Memory Optimization

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
brettlevin
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
PyData
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
slantsixgames
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
RichardWarburton
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
Jeanie Delos Arcos
 
Caching in
Caching inCaching in
Caching in
RichardWarburton
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
Wei-Ta Wang
 
C language
C languageC language
C language
spatidar0
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright
Andrei Alexandrescu
 
C language
C languageC language
C language
VAIRA MUTHU
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
tidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
tidwellveronique
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
Lee Hanxue
 
Cache recap
Cache recapCache recap
Cache recap
Tony Nguyen
 
Cache recap
Cache recapCache recap
Cache recap
Luis Goldster
 
Cache recap
Cache recapCache recap
Cache recap
Fraboni Ec
 

Similar to Memory Optimization (20)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Caching in
Caching inCaching in
Caching in
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
C language
C languageC language
C language
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright
 
C language
C languageC language
C language
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 

Recently uploaded

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
Neo4j
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
Edge AI and Vision Alliance
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
DianaGray10
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
Safe Software
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 

Recently uploaded (20)

Leveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and StandardsLeveraging the Graph for Clinical Trials and Standards
Leveraging the Graph for Clinical Trials and Standards
 
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
“How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-eff...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsConnector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Essentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation ParametersEssentials of Automations: Exploring Attributes & Automation Parameters
Essentials of Automations: Exploring Attributes & Automation Parameters
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 

Memory Optimization

  • 1. MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
  • 2.
  • 3.
  • 4.
  • 5. Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
  • 6. Need more justification? 2/3 Improvements to compiler technology double program performance every ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
  • 7.
  • 8.
  • 9. The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Software prefetching const int kLookAhead = 4; // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
  • 19. Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
  • 20. Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.) Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4]; // Cache miss, non-blocking Elem b = elem[i + 1]; // Cache hit Elem c = elem[i + 2]; // Cache hit Elem d = elem[i + 3]; // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
  • 21.
  • 22.
  • 23.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.  
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
  • 39.
  • 40. Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
  • 41.
  • 42.
  • 43. C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch numVals each iteration! numVals not a local variable! May be aliased by pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 44. C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 45. C++ abstraction penalty Since pBuf[i] can only alias numVals in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does your compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
  • 46.
  • 47.
  • 48.
  • 49. What TBAA can do for you into this: It can turn this: Possible aliasing between v[i] and *n No aliasing possible so fetch *n once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
  • 50.
  • 51.
  • 52. Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
  • 53. Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.