Caching InUnderstand, measure and use your CPU        Cache more effectively         Richard Warburton
What are you talking about?● Why do you care?● Hardware Fundamentals● Measurement● Principles and Examples
Why do you care?Perhaps why /should/ you care.
The Problem        Very Fast Relatively Slow
The Solution: CPU Cache● Core Demands Data, looks at its cache  ○ If present (a "hit") then data returned to register  ○ I...
Multilevel Cache: Intel Sandybridge    Physical Core 0                               Physical Core N   HT: 2 Logical Cores...
CPU Stalls● Want your CPU to execute instructions● Stall = CPU cant execute because its  waiting on memory● Stalls displac...
How bad is a miss?          Location        Latency in Clockcycles          Register        0          L1 Cache        3  ...
Cache Lines● Data transferred in cache lines● Fixed size block of memory● Usually 64 bytes in current x86 CPUs● Purely har...
You dont need toknow everything
Architectural Takeaways● Modern CPUs have large multilevel Caches● Cache misses cause Stalls● Stalls are expensive
MeasurementBarometer, Sun Dial ... and CPU Counters
Do you care?● DONT look at CPU Caching behaviour first!● Not all Performance Problems CPU Bound  ○   Garbage Collection  ○...
CPU Performance Counters● CPU Register which can count events   ○ Eg: the number of Level 3 Cache Misses● Model Specific R...
Measurement: Instructions Retired● The number of instructions which executed  until completion   ○ ignores branch mispredi...
Cache Misses● Cache Level  ○ Level 3  ○ Level 2  ○ Level 1 (Data vs Instruction)● What to measure  ○   Hits  ○   Misses  ○...
Cache Profilers● Open Source  ○ perfstat  ○ linux (rdmsr/wrmsr)  ○ Linux Kernel (via /proc)● Proprietary  ○   jClarity jMS...
Good Benchmarking Practice● Warmups● Measure and rule-out other factors● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark = Low Variance
You can measure Cache behaviour
Principles & Examples     Sequential Code
Prefetching●   Prefetch = Eagerly load data●   Adjacent Cache Line Prefetch●   Data Cache Unit (Streaming) Prefetch●   Pro...
Temporal Locality            Repeatedly referring to same data in a short time spanSpatial Locality                Referri...
General Principles● Use smaller data types (-XX:+UseCompressedOops)● Avoid big holes in your data● Make accesses as linear...
Primitive Arrays// Sequential Access = Predictablefor (int i=0; i<someArray.length; i++)   someArray[i]++;
Primitive Arrays - Skipping Elements// Holes Hurtfor (int i=0; i<someArray.length; i += SKIP)   someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays● Multidimensional Arrays are really Arrays of  Arrays in Java. (Unlike C)● Some people realign the...
Bad Access Alignment   Strides the wrong way, bad   locality.   array[COLS * row + col]++;   Strides the right way, good  ...
Data Locality vs Java Heap Layout                      class Foo {          count                         Integer count;0 ...
Data Locality vs Java Heap Layout● Serious Java Weakness● Location of objects in memory hard to  guarantee.● Garbage Colle...
Data Layout Principles●   Primitive Collections (GNU Trove, etc.)●   Arrays > Linked Lists●   Hashtable > Search Tree●   A...
Game Event Serverclass PlayerAttack {  private long playerId;  private int weapon;  private int energy;  ...  private int ...
Flyweight Unsafe (1)static final Unsafe unsafe = ... ;long space = OBJ_COUNT * ATTACK_OBJ_SIZE;address = unsafe.allocateMe...
Flyweight Unsafe (2)class PlayerAttack {   static final long WEAPON_OFFSET = 8   private long loc;   public int getWeapon(...
SLAB● Manually writing this is:   ○ time consuming   ○ error prone   ○ boilerplate-ridden● Prototype library:   ○ http://g...
Slab (2)public interface GameEvent extends Cursor {  public int getId();  public void setId(int value);  public long getSt...
Data Alignment● An address a is n-byte aligned when:  ○ n is a power of two  ○ a is divisible by n● Cache Line Aligned:  ○...
Rules and Hacks● Java (Hotspot) ♥ 8-byte alignment:  ○ new Java Objects  ○ Unsafe.allocateMemory  ○ Bytebuffer.allocateDir...
Cacheline Aligned Access● Performance Benefits:    ○    3-6x on Core2Duo    ○    ~1.5x on Nehalem Class    ○    Measured u...
Today I learned ...Access and Cache Aligned, Contiguous Data
Translation Lookaside       Buffers      TLB, not TLAB!
Virtual Memory● RAM is finite● RAM is fragmented● Multitasking is nice● Programming easier with a contiguous  address space
Page Tables● Virtual Address Space split into pages● Page has two Addresses:  ○ Virtual Address: Process view  ○ Physical ...
Translation Lookaside Buffers● TLB = Page Table Cache● Avoids memory bottlenecks in page lookups● CPU Cache is multi-level...
TLB Flushing● Process context switch => change address  space● Historically caused "flush" =● Avoided with an Address Spac...
Page Size Tradeoff:  Bigger Pages     = less pages     = quicker page lookups                     vs  Bigger Pages     = w...
Page Sizes●   Page Size is adjustable●   Common Sizes: 4KB, 2MB, 1GB●   Easy Speedup: 10%-30%●   -XX:+UseLargePages●   Ora...
Too Long, Didnt Listen1. Virtual Memory + Paging have an overhead2. Translation Lookaside Buffers used toreduce cost3. Pag...
Principles & Examples (2)       Concurrent Code
Context Switching● Multitasking systems context switch  between processes● Variants:  ○ Thread  ○ Process  ○ Virtualized OS
Context Switching Costs● Direct Cost  ○ Save old process state  ○ Schedule new process  ○ Load new process state● Indirect...
Indirect Costs (cont.)● "Quantifying The Cost of Context Switch" by  Li et al.  ○ 5 years old - principle still applies  ○...
Locking● Locks require kernel arbitration   ○ Not a context switch, but a mode switch   ○ Lots of lock contention causes c...
Java Memory Model● JSR 133 Specs the memory model● Turtles Example:  ○ Java Level: volatile  ○ CPU Level: lock  ○ Cache Le...
False Sharing● Data can share a cache line● Not always good  ○ Some data accidentally next to other data.  ○ Causes Cache ...
Concurrent Cache Line Access                          © Intel
Current Solution: Field Paddingpublic volatile long value;public long pad1, pad2, pad3, pad4,pad5, pad6;8 bytes(long) * 7 ...
Real Solution: JEP 142class UncontendedFoo {    int someValue;    @Uncontended volatile long foo;    Bar bar;}http://openj...
False Sharing in Your GC● Card Tables   ○ Split RAM into cards   ○ Table records which parts of RAM are written to   ○ Avo...
Buyer Beware:● Context Switching● False Sharing● Visibility/Atomicity/Locking Costs
Too Long, Didnt Listen1. Caching Behaviour has a performance effect2. The effects are measurable3. There are common proble...
Questions?           @RichardWarburtowww.akkadia.org/drepper/cpumemory.pdfwww.jclarity.com/friends/g.oswego.edu/dl/concurr...
A Brave New World   (hopefully, sort of)
Arrays 2.0● Proposed JVM Feature● Library Definition of Array Types● Incorporate a notion of flatness● Semi-reified Generi...
Value Types● Assign or pass the object you copy the value  and not the reference● Allow control of the memory layout● Cont...
The future will solve all problems!
Upcoming SlideShare
Loading in...5
×

Caching in (DevoxxUK 2013)

440

Published on

Modern computationally intensive tasks are rarely bottlenecked on the absolute performance of your processor cores, the real bottleneck in 2012 is getting data out of memory. CPU Caches are designed to alleviate the difference in performance between CPU Core Clockspeed and main memory clockspeed, but developers rarely understand how this interaction works or how to measure or tune their application accordingly.

This Talk aims to solve that by:

1. Describing how the CPU caches work in the latest Intel Hardware.
2. Showing people what and how to measure in order to understand the caching behaviour of their software.
3. Giving examples of how this affects Java Program performance and what can be done to address things.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
440
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Caching in (DevoxxUK 2013)

  1. 1. Caching InUnderstand, measure and use your CPU Cache more effectively Richard Warburton
  2. 2. What are you talking about?● Why do you care?● Hardware Fundamentals● Measurement● Principles and Examples
  3. 3. Why do you care?Perhaps why /should/ you care.
  4. 4. The Problem Very Fast Relatively Slow
  5. 5. The Solution: CPU Cache● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache● Fast memory is expensive, a small amount is affordable
  6. 6. Multilevel Cache: Intel Sandybridge Physical Core 0 Physical Core N HT: 2 Logical Cores HT: 2 Logical Cores Level 1 Level 1 Level 1 Level 1 Data Instruction .... Data Instruction Cache Cache Cache Cache Level 2 Cache Level 2 Cache Shared Level 3 Cache
  7. 7. CPU Stalls● Want your CPU to execute instructions● Stall = CPU cant execute because its waiting on memory● Stalls displace potential computation● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) / HyperThreading (HT)
  8. 8. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400NB: Sandybridge Numbers
  9. 9. Cache Lines● Data transferred in cache lines● Fixed size block of memory● Usually 64 bytes in current x86 CPUs● Purely hardware consideration
  10. 10. You dont need toknow everything
  11. 11. Architectural Takeaways● Modern CPUs have large multilevel Caches● Cache misses cause Stalls● Stalls are expensive
  12. 12. MeasurementBarometer, Sun Dial ... and CPU Counters
  13. 13. Do you care?● DONT look at CPU Caching behaviour first!● Not all Performance Problems CPU Bound ○ Garbage Collection ○ Networking ○ Database or External Service ○ I/O● Consider caching behaviour when youre execution bound and know your hotspot.
  14. 14. CPU Performance Counters● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge)● Dont worry - leave details to tooling
  15. 15. Measurement: Instructions Retired● The number of instructions which executed until completion ○ ignores branch mispredictions● When stalled youre not retiring instructions● Aim to maximise instruction retirement when reducing cache misses
  16. 16. Cache Misses● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction)● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  17. 17. Cache Profilers● Open Source ○ perfstat ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc)● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012
  18. 18. Good Benchmarking Practice● Warmups● Measure and rule-out other factors● Specific Caching Ideas ...
  19. 19. Take Care with Working Set Size
  20. 20. Good Benchmark = Low Variance
  21. 21. You can measure Cache behaviour
  22. 22. Principles & Examples Sequential Code
  23. 23. Prefetching● Prefetch = Eagerly load data● Adjacent Cache Line Prefetch● Data Cache Unit (Streaming) Prefetch● Problem: CPU Prediction isnt perfect● Solution: Arrange Data so accesses are predictable
  24. 24. Temporal Locality Repeatedly referring to same data in a short time spanSpatial Locality Referring to data that is close together in memorySequential Locality Referring to data that is arranged linearly in memory
  25. 25. General Principles● Use smaller data types (-XX:+UseCompressedOops)● Avoid big holes in your data● Make accesses as linear as possible
  26. 26. Primitive Arrays// Sequential Access = Predictablefor (int i=0; i<someArray.length; i++) someArray[i]++;
  27. 27. Primitive Arrays - Skipping Elements// Holes Hurtfor (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  28. 28. Primitive Arrays - Skipping Elements
  29. 29. Multidimensional Arrays● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)● Some people realign their accesses:for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; }}
  30. 30. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  31. 31. Data Locality vs Java Heap Layout class Foo { count Integer count;0 bar Bar bar; Baz baz;1 baz }2 // No alignment guarantees3 for (Foo foo : foos) { foo.count = 5; ... foo.bar.visit(); }
  32. 32. Data Locality vs Java Heap Layout● Serious Java Weakness● Location of objects in memory hard to guarantee.● Garbage Collection Impact ○ Mark-Sweep ○ Copying Collector
  33. 33. Data Layout Principles● Primitive Collections (GNU Trove, etc.)● Arrays > Linked Lists● Hashtable > Search Tree● Avoid Code bloating (Loop Unrolling)● Custom Data Structures ○ Judy Arrays ■ an associative array/map ○ kD-Trees ■ generalised Binary Space Partitioning ○ Z-Order Curve ■ multidimensional data in one dimension
  34. 34. Game Event Serverclass PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  35. 35. Flyweight Unsafe (1)static final Unsafe unsafe = ... ;long space = OBJ_COUNT * ATTACK_OBJ_SIZE;address = unsafe.allocateMemory(space);static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc);}
  36. 36. Flyweight Unsafe (2)class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET unsafe.setInt(address, weapon); }}
  37. 37. SLAB● Manually writing this is: ○ time consuming ○ error prone ○ boilerplate-ridden● Prototype library: ○ http://github.com/RichardWarburton/slab ○ Maps an interface to a flyweight
  38. 38. Slab (2)public interface GameEvent extends Cursor { public int getId(); public void setId(int value); public long getStrength(); public void setStrength(long value);}Allocator<GameEvent> eventAllocator = Allocator.of(GameEvent.class);GameEvent event = eventAllocator.allocate(100);event.move(50);event.setId(6);assertEquals(6, event.getId());
  39. 39. Data Alignment● An address a is n-byte aligned when: ○ n is a power of two ○ a is divisible by n● Cache Line Aligned: ○ byte aligned to the size of a cache line
  40. 40. Rules and Hacks● Java (Hotspot) ♥ 8-byte alignment: ○ new Java Objects ○ Unsafe.allocateMemory ○ Bytebuffer.allocateDirect● Get the object address: ○ Unsafe gives it to you directly ○ Direct ByteBuffer has a field called address
  41. 41. Cacheline Aligned Access● Performance Benefits: ○ 3-6x on Core2Duo ○ ~1.5x on Nehalem Class ○ Measured using direct ByteBuffers ○ Mid Aligned vs Straddling a Cacheline● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
  42. 42. Today I learned ...Access and Cache Aligned, Contiguous Data
  43. 43. Translation Lookaside Buffers TLB, not TLAB!
  44. 44. Virtual Memory● RAM is finite● RAM is fragmented● Multitasking is nice● Programming easier with a contiguous address space
  45. 45. Page Tables● Virtual Address Space split into pages● Page has two Addresses: ○ Virtual Address: Process view ○ Physical Address: Location in RAM/Disk● Page Table ○ Mapping Between these addresses ○ OS Managed ○ Page Fault when entry is missing ○ OS Pages in from disk
  46. 46. Translation Lookaside Buffers● TLB = Page Table Cache● Avoids memory bottlenecks in page lookups● CPU Cache is multi-level => TLB is multi- level● Miss/Hit rates measurable via CPU Counters ○ Hit/Miss Rates ○ Walk Duration
  47. 47. TLB Flushing● Process context switch => change address space● Historically caused "flush" =● Avoided with an Address Space Identifier (ASID)
  48. 48. Page Size Tradeoff: Bigger Pages = less pages = quicker page lookups vs Bigger Pages = wasted memory space = more disk paging
  49. 49. Page Sizes● Page Size is adjustable● Common Sizes: 4KB, 2MB, 1GB● Easy Speedup: 10%-30%● -XX:+UseLargePages● Oracle DBs love Large Pages
  50. 50. Too Long, Didnt Listen1. Virtual Memory + Paging have an overhead2. Translation Lookaside Buffers used toreduce cost3. Page size changes important
  51. 51. Principles & Examples (2) Concurrent Code
  52. 52. Context Switching● Multitasking systems context switch between processes● Variants: ○ Thread ○ Process ○ Virtualized OS
  53. 53. Context Switching Costs● Direct Cost ○ Save old process state ○ Schedule new process ○ Load new process state● Indirect Cost ○ TLB Reload (Process only) ○ CPU Pipeline Flush ○ Cache Interference ○ Temporal Locality
  54. 54. Indirect Costs (cont.)● "Quantifying The Cost of Context Switch" by Li et al. ○ 5 years old - principle still applies ○ Direct = 3.8 microseconds ○ Indirect = ~0 - 1500 microseconds ○ indirect dominates direct when working set > L2 Cache size● Cooperative hits also exist● Minimise context switching if possible
  55. 55. Locking● Locks require kernel arbitration ○ Not a context switch, but a mode switch ○ Lots of lock contention causes context switching● Has a cache-pollution cost● Can use try lock-free algorithms
  56. 56. Java Memory Model● JSR 133 Specs the memory model● Turtles Example: ○ Java Level: volatile ○ CPU Level: lock ○ Cache Level: MESI● Has a cost!
  57. 57. False Sharing● Data can share a cache line● Not always good ○ Some data accidentally next to other data. ○ Causes Cache lines to be evicted prematurely● Serious Concurrency Issue
  58. 58. Concurrent Cache Line Access © Intel
  59. 59. Current Solution: Field Paddingpublic volatile long value;public long pad1, pad2, pad3, pad4,pad5, pad6;8 bytes(long) * 7 fields + header = 64 bytes =CachelineNB: fields aligned to 8 bytes even if smaller
  60. 60. Real Solution: JEP 142class UncontendedFoo { int someValue; @Uncontended volatile long foo; Bar bar;}http://openjdk.java.net/jeps/142
  61. 61. False Sharing in Your GC● Card Tables ○ Split RAM into cards ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM● BUT ... optimise by writing to the byte on every write● -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  62. 62. Buyer Beware:● Context Switching● False Sharing● Visibility/Atomicity/Locking Costs
  63. 63. Too Long, Didnt Listen1. Caching Behaviour has a performance effect2. The effects are measurable3. There are common problems and solutions
  64. 64. Questions? @RichardWarburtowww.akkadia.org/drepper/cpumemory.pdfwww.jclarity.com/friends/g.oswego.edu/dl/concurrency-interest/
  65. 65. A Brave New World (hopefully, sort of)
  66. 66. Arrays 2.0● Proposed JVM Feature● Library Definition of Array Types● Incorporate a notion of flatness● Semi-reified Generics● Loads of other goodies (final, volatile, resizing, etc.)● Requires Value Types
  67. 67. Value Types● Assign or pass the object you copy the value and not the reference● Allow control of the memory layout● Control of layout = Control of Caching Behaviour● Idea, initial prototypes in mlvm
  68. 68. The future will solve all problems!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×