Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Caching in


Published on

Published in: Technology
  • Be the first to comment

Caching in

  1. 1. Caching InUnderstand, Measure and Use your CPU Cache more effectively Richard Warburton
  2. 2. What are you talking about?● Why do you care?● Measurement● Principles and Examples
  3. 3. Why do you care?Perhaps why /should/ you care.
  4. 4. The Problem Very Fast Relatively Slow
  5. 5. The Solution: CPU Cache● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache● Fast memory is expensive, a small amount is affordable
  6. 6. Multilevel Cache: Intel Sandybridge Physical Core 0 Physical Core N HT: 2 Logical Cores HT: 2 Logical Cores Level 1 Level 1 Level 1 Level 1 Data Instruction .... Data Instruction Cache Cache Cache Cache Level 2 Cache Level 2 Cache Shared Level 3 Cache
  7. 7. CPU Stalls● Want your CPU to execute instructions● Stall = CPU cant execute because its waiting on memory● Stalls displace potential computation● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) / HyperThreading (HT)
  8. 8. How bad is a miss? Location Clockcycle Cost Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 400+NB: Sandybridge Numbers
  9. 9. Cache Lines● Data transferred in cache lines● Fixed size block of memory● usually 64Bytes in current x86 CPUs● Purely hardware consideration
  10. 10. You dont need toknow everything
  11. 11. Architectural Takeaways● Modern CPUs have large multilevel Caches● Cache misses cause Stalls● Stalls are expensive
  12. 12. MeasurementBarometer, Sun Dial ... and CPU Counters
  13. 13. Do you care?● DONT look at CPU Caching behaviour first!● Not all Performance Problems CPU Bound ○ Garbage Collection ○ Networking ○ Database or External Service ○ I/O● Consider caching behaviour when youre execution bound and know your hotspot.
  14. 14. CPU Performance Counters● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge)● Dont worry - leave details to tooling
  15. 15. Measurement: Instructions Retired● The number of instructions which executed until completion ○ ignores branch mispredictions● When stalled youre not retiring instructions● Aim to maximise instruction retirement when reducing cache misses
  16. 16. Cache Misses● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction)● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  17. 17. Cache Profilers● Open Source ○ perfstat ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc)● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012
  18. 18. Good Benchmarking Practice● Warmups● Measure and rule-out other factors● Specific Caching Ideas ...
  19. 19. Take Care with Working Set Size
  20. 20. Good Benchmark = Low Variance
  21. 21. You can measure Cache behaviour
  22. 22. Principles & Examples
  23. 23. Prefetching● Prefetch = Eagerly load data● Adjacent Cache Line Prefetch● Data Cache Unit (Streaming) Prefetch● Problem: CPU Prediction isnt perfect● Solution: Arrange Data so accesses are predictable
  24. 24. Temporal Locality Repeatedly referring to same data in a short time spanSpatial Locality Referring to data that is close together in memorySequential Locality Referring to data that is arranged linearly in memory
  25. 25. General Principles● Use smaller data types (-XX:+UseCompressedOops)● Avoid big holes in your data● Make accesses as linear as possible
  26. 26. Primitive Arrays// Sequential Access = Predictablefor (int i=0; i<someArray.length; i++) someArray[i]++;
  27. 27. Primitive Arrays - Skipping Elements// Holes Hurtfor (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  28. 28. Primitive Arrays - Skipping Elements
  29. 29. Multidimensional Arrays● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C)● Some people realign their accesses:for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; }}
  30. 30. Bad AlignmentStrides the wrong way, badlocality.array[COLS * row + col]++;Strides the right way, goodlocality.array[ROWS * col + row]++;
  31. 31. Data Locality vs Java Heap Layout class Foo { count Integer count;0 bar Bar bar; Baz baz;1 baz }2 // No alignment guarantees3 for (Foo foo : foos) { foo.count = 5; ...; }
  32. 32. Data Locality vs Java Heap Layout● Serious Java Weakness● Location of objects in memory hard to guarantee.● Garbage Collection Impact ○ Mark-Sweep ○ Copying Collector
  33. 33. General Principles● Primitive Collections (GNU Trove, etc.)● Avoid Code bloating (Loop Unrolling)● Reconsider Data Structures ○ eg: Judy Arrays, kD-Trees, Z-Order Curve● Care with Context Switching
  34. 34. Game Event Serverclass PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  35. 35. Flyweight Unsafe (1)static final Unsafe unsafe = ... ;long space = OBJ_COUNT * ATTACK_OBJ_SIZE;address = unsafe.allocateMemory(space);static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc);}
  36. 36. Flyweight Unsafe (2)class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET return unsafe.getInt(address, weapon); }}
  37. 37. False Sharing● Data can share a cache line● Not always good ○ Some data accidentally next to other data. ○ Causes Cache lines to be evicted prematurely● Serious Concurrency Issue
  38. 38. Concurrent Cache Line Access © Intel
  39. 39. Current Solution: Field Paddingpublic volatile long value;public long pad1, pad2, pad3, pad4,pad5, pad6;8 bytes(long) * 7 fields + header = 64 bytes =CachelineNB: fields aligned to 8 bytes even if smaller
  40. 40. Real Solution: JEP 142class UncontendedFoo { int someValue; @Uncontended volatile long foo; Bar bar;}
  41. 41. False Sharing in Your GC● Card Tables ○ Split RAM into cards ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM● BUT ... optimise by writing to the byte on every write● -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  42. 42. Too Long, Didnt Listen1. Caching Behaviour has a performance effect2. The effects are measurable3. There are common problems and solutions
  43. 43. Questions?
  44. 44. A Brave New World (hopefully, sort of)
  45. 45. Arrays 2.0● Proposed JVM Feature● Library Definition of Array Types● Incorporate a notion of flatness● Semi-reified Generics● Loads of other goodies (final, volatile, resizing, etc.)● Requires Value Types
  46. 46. Value Types● Assign or pass the object you copy the value and not the reference● Allow control of the memory layout● Control of layout = Control of Caching Behaviour● Idea, initial prototypes in mlvm
  47. 47. The future will solve all problems!