Caching InUnderstand, Measure and Use your CPU        Cache more effectively         Richard Warburton
What are you talking about?● Why do you care?● Measurement● Principles and Examples
Why do you care?Perhaps why /should/ you care.
The Problem        Very Fast Relatively Slow
The Solution: CPU Cache● Core Demands Data, looks at its cache  ○ If present (a "hit") then data returned to register  ○ I...
Multilevel Cache: Intel Sandybridge    Physical Core 0                               Physical Core N   HT: 2 Logical Cores...
CPU Stalls● Want your CPU to execute instructions● Stall = CPU cant execute because its  waiting on memory● Stalls displac...
How bad is a miss?          Location        Clockcycle Cost          Register        0          L1 Cache        3         ...
Cache Lines● Data transferred in cache lines● Fixed size block of memory● usually 64Bytes in current x86 CPUs● Purely hard...
You dont need toknow everything
Architectural Takeaways● Modern CPUs have large multilevel Caches● Cache misses cause Stalls● Stalls are expensive
MeasurementBarometer, Sun Dial ... and CPU Counters
Do you care?● DONT look at CPU Caching behaviour first!● Not all Performance Problems CPU Bound  ○   Garbage Collection  ○...
CPU Performance Counters● CPU Register which can count events   ○ Eg: the number of Level 3 Cache Misses● Model Specific R...
Measurement: Instructions Retired● The number of instructions which executed  until completion   ○ ignores branch mispredi...
Cache Misses● Cache Level  ○ Level 3  ○ Level 2  ○ Level 1 (Data vs Instruction)● What to measure  ○   Hits  ○   Misses  ○...
Cache Profilers● Open Source  ○ perfstat  ○ linux (rdmsr/wrmsr)  ○ Linux Kernel (via /proc)● Proprietary  ○   jClarity jMS...
Good Benchmarking Practice● Warmups● Measure and rule-out other factors● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark = Low Variance
You can measure Cache behaviour
Principles & Examples
Prefetching●   Prefetch = Eagerly load data●   Adjacent Cache Line Prefetch●   Data Cache Unit (Streaming) Prefetch●   Pro...
Temporal Locality            Repeatedly referring to same data in a short time spanSpatial Locality                Referri...
General Principles● Use smaller data types (-XX:+UseCompressedOops)● Avoid big holes in your data● Make accesses as linear...
Primitive Arrays// Sequential Access = Predictablefor (int i=0; i<someArray.length; i++)   someArray[i]++;
Primitive Arrays - Skipping Elements// Holes Hurtfor (int i=0; i<someArray.length; i += SKIP)   someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays● Multidimensional Arrays are really Arrays of  Arrays in Java. (Unlike C)● Some people realign the...
Bad AlignmentStrides the wrong way, badlocality.array[COLS * row + col]++;Strides the right way, goodlocality.array[ROWS *...
Data Locality vs Java Heap Layout                      class Foo {          count                         Integer count;0 ...
Data Locality vs Java Heap Layout● Serious Java Weakness● Location of objects in memory hard to  guarantee.● Garbage Colle...
General Principles● Primitive Collections (GNU Trove, etc.)● Avoid Code bloating (Loop Unrolling)● Reconsider Data Structu...
Game Event Serverclass PlayerAttack {  private long playerId;  private int weapon;  private int energy;  ...  private int ...
Flyweight Unsafe (1)static final Unsafe unsafe = ... ;long space = OBJ_COUNT * ATTACK_OBJ_SIZE;address = unsafe.allocateMe...
Flyweight Unsafe (2)class PlayerAttack {   static final long WEAPON_OFFSET = 8   private long loc;   public int getWeapon(...
False Sharing● Data can share a cache line● Not always good  ○ Some data accidentally next to other data.  ○ Causes Cache ...
Concurrent Cache Line Access                          © Intel
Current Solution: Field Paddingpublic volatile long value;public long pad1, pad2, pad3, pad4,pad5, pad6;8 bytes(long) * 7 ...
Real Solution: JEP 142class UncontendedFoo {    int someValue;    @Uncontended volatile long foo;    Bar bar;}http://openj...
False Sharing in Your GC● Card Tables   ○ Split RAM into cards   ○ Table records which parts of RAM are written to   ○ Avo...
Too Long, Didnt Listen1. Caching Behaviour has a performance effect2. The effects are measurable3. There are common proble...
Questions?           @RichardWarburtowww.akkadia.org/drepper/cpumemory.pdfwww.jclarity.com/friends/g.oswego.edu/dl/concurr...
A Brave New World   (hopefully, sort of)
Arrays 2.0● Proposed JVM Feature● Library Definition of Array Types● Incorporate a notion of flatness● Semi-reified Generi...
Value Types● Assign or pass the object you copy the value  and not the reference● Allow control of the memory layout● Cont...
The future will solve all problems!
Upcoming SlideShare
Loading in …5
×

Caching in

410
-1

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
410
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Caching in

  1. 1. Caching InUnderstand, Measure and Use your CPU Cache more effectively Richard Warburton
  2. 2. What are you talking about?● Why do you care?● Measurement● Principles and Examples
  3. 3. Why do you care?Perhaps why /should/ you care.
  4. 4. The Problem Very Fast Relatively Slow
  5. 5. The Solution: CPU Cache● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache● Fast memory is expensive, a small amount is affordable
  6. 6. Multilevel Cache: Intel Sandybridge Physical Core 0 Physical Core N HT: 2 Logical Cores HT: 2 Logical Cores Level 1 Level 1 Level 1 Level 1 Data Instruction .... Data Instruction Cache Cache Cache Cache Level 2 Cache Level 2 Cache Shared Level 3 Cache
  7. 7. CPU Stalls● Want your CPU to execute instructions● Stall = CPU cant execute because its waiting on memory● Stalls displace potential computation● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) / HyperThreading (HT)
  8. 8. How bad is a miss? Location Clockcycle Cost Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 400+NB: Sandybridge Numbers
  9. 9. Cache Lines● Data transferred in cache lines● Fixed size block of memory● usually 64Bytes in current x86 CPUs● Purely hardware consideration
  10. 10. You dont need toknow everything
  11. 11. Architectural Takeaways● Modern CPUs have large multilevel Caches● Cache misses cause Stalls● Stalls are expensive
  12. 12. MeasurementBarometer, Sun Dial ... and CPU Counters
  13. 13. Do you care?● DONT look at CPU Caching behaviour first!● Not all Performance Problems CPU Bound ○ Garbage Collection ○ Networking ○ Database or External Service ○ I/O● Consider caching behaviour when youre execution bound and know your hotspot.
  14. 14. CPU Performance Counters● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge)● Dont worry - leave details to tooling
  15. 15. Measurement: Instructions Retired● The number of instructions which executed until completion ○ ignores branch mispredictions● When stalled youre not retiring instructions● Aim to maximise instruction retirement when reducing cache misses
  16. 16. Cache Misses● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction)● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  17. 17. Cache Profilers● Open Source ○ perfstat ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc)● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012
  18. 18. Good Benchmarking Practice● Warmups● Measure and rule-out other factors● Specific Caching Ideas ...
  19. 19. Take Care with Working Set Size
  20. 20. Good Benchmark = Low Variance
  21. 21. You can measure Cache behaviour
  22. 22. Principles & Examples
  23. 23. Prefetching● Prefetch = Eagerly load data● Adjacent Cache Line Prefetch● Data Cache Unit (Streaming) Prefetch● Problem: CPU Prediction isnt perfect● Solution: Arrange Data so accesses are predictable
  24. 24. Temporal Locality Repeatedly referring to same data in a short time spanSpatial Locality Referring to data that is close together in memorySequential Locality Referring to data that is arranged linearly in memory
  25. 25. General Principles● Use smaller data types (-XX:+UseCompressedOops)● Avoid big holes in your data● Make accesses as linear as possible
  26. 26. Primitive Arrays// Sequential Access = Predictablefor (int i=0; i<someArray.length; i++) someArray[i]++;
  27. 27. Primitive Arrays - Skipping Elements// Holes Hurtfor (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  28. 28. Primitive Arrays - Skipping Elements
  29. 29. Multidimensional Arrays● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)● Some people realign their accesses:for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; }}
  30. 30. Bad AlignmentStrides the wrong way, badlocality.array[COLS * row + col]++;Strides the right way, goodlocality.array[ROWS * col + row]++;
  31. 31. Data Locality vs Java Heap Layout class Foo { count Integer count;0 bar Bar bar; Baz baz;1 baz }2 // No alignment guarantees3 for (Foo foo : foos) { foo.count = 5; ... foo.bar.visit(); }
  32. 32. Data Locality vs Java Heap Layout● Serious Java Weakness● Location of objects in memory hard to guarantee.● Garbage Collection Impact ○ Mark-Sweep ○ Copying Collector
  33. 33. General Principles● Primitive Collections (GNU Trove, etc.)● Avoid Code bloating (Loop Unrolling)● Reconsider Data Structures ○ eg: Judy Arrays, kD-Trees, Z-Order Curve● Care with Context Switching
  34. 34. Game Event Serverclass PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  35. 35. Flyweight Unsafe (1)static final Unsafe unsafe = ... ;long space = OBJ_COUNT * ATTACK_OBJ_SIZE;address = unsafe.allocateMemory(space);static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc);}
  36. 36. Flyweight Unsafe (2)class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET return unsafe.getInt(address, weapon); }}
  37. 37. False Sharing● Data can share a cache line● Not always good ○ Some data accidentally next to other data. ○ Causes Cache lines to be evicted prematurely● Serious Concurrency Issue
  38. 38. Concurrent Cache Line Access © Intel
  39. 39. Current Solution: Field Paddingpublic volatile long value;public long pad1, pad2, pad3, pad4,pad5, pad6;8 bytes(long) * 7 fields + header = 64 bytes =CachelineNB: fields aligned to 8 bytes even if smaller
  40. 40. Real Solution: JEP 142class UncontendedFoo { int someValue; @Uncontended volatile long foo; Bar bar;}http://openjdk.java.net/jeps/142
  41. 41. False Sharing in Your GC● Card Tables ○ Split RAM into cards ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM● BUT ... optimise by writing to the byte on every write● -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  42. 42. Too Long, Didnt Listen1. Caching Behaviour has a performance effect2. The effects are measurable3. There are common problems and solutions
  43. 43. Questions? @RichardWarburtowww.akkadia.org/drepper/cpumemory.pdfwww.jclarity.com/friends/g.oswego.edu/dl/concurrency-interest/
  44. 44. A Brave New World (hopefully, sort of)
  45. 45. Arrays 2.0● Proposed JVM Feature● Library Definition of Array Types● Incorporate a notion of flatness● Semi-reified Generics● Loads of other goodies (final, volatile, resizing, etc.)● Requires Value Types
  46. 46. Value Types● Assign or pass the object you copy the value and not the reference● Allow control of the memory layout● Control of layout = Control of Caching Behaviour● Idea, initial prototypes in mlvm
  47. 47. The future will solve all problems!

×