Caching in

Caching In
Understand, Measure and Use your CPU
        Cache more effectively




         Richard Warburton
What are you talking about?

● Why do you care?

● Measurement

● Principles and Examples
Why do you care?
Perhaps why /should/ you care.
The Problem        Very Fast




 Relatively Slow
The Solution: CPU Cache

● Core Demands Data, looks at its cache
  ○ If present (a "hit") then data returned to register
  ○ If absent (a "miss") then data looked up from
    memory and stored in the cache


● Fast memory is expensive, a small amount
  is affordable
Multilevel Cache: Intel Sandybridge
    Physical Core 0                               Physical Core N
   HT: 2 Logical Cores                          HT: 2 Logical Cores


  Level 1      Level 1                          Level 1     Level 1
   Data      Instruction          ....           Data     Instruction
  Cache        Cache                            Cache       Cache



     Level 2 Cache                                Level 2 Cache




                         Shared Level 3 Cache
CPU Stalls
● Want your CPU to execute instructions

● Stall = CPU can't execute because its
  waiting on memory

● Stalls displace potential computation

● Busy-work Strategies
  ○ Out-of-Order Execution
  ○ Simultaneous MultiThreading (SMT) /
    HyperThreading (HT)
How bad is a miss?
          Location        Clockcycle Cost

          Register        0

          L1 Cache        3

          L2 Cache        9

          L3 Cache        21

          Main Memory     400+




NB: Sandybridge Numbers
Cache Lines
● Data transferred in cache lines

● Fixed size block of memory

● usually 64Bytes in current x86 CPUs

● Purely hardware consideration
You don't need to
know everything
Architectural Takeaways



● Modern CPUs have large multilevel Caches

● Cache misses cause Stalls

● Stalls are expensive
Measurement
Barometer, Sun Dial ... and CPU Counters
Do you care?
● DON'T look at CPU Caching behaviour first!

● Not all Performance Problems CPU Bound
  ○   Garbage Collection
  ○   Networking
  ○   Database or External Service
  ○   I/O


● Consider caching behaviour when you're
  execution bound and know your hotspot.
CPU Performance Counters
● CPU Register which can count events
   ○ Eg: the number of Level 3 Cache Misses


● Model Specific Registers
   ○ not instruction set standardised by x86
   ○ differ by CPU (eg Penryn vs Sandybridge)


● Don't worry - leave details to tooling
Measurement: Instructions Retired

● The number of instructions which executed
  until completion
   ○ ignores branch mispredictions


● When stalled you're not retiring instructions

● Aim to maximise instruction retirement when
  reducing cache misses
Cache Misses
● Cache Level
  ○ Level 3
  ○ Level 2
  ○ Level 1 (Data vs Instruction)


● What to measure
  ○   Hits
  ○   Misses
  ○   Reads vs Writes
  ○   Calculate Ratio
Cache Profilers
● Open Source
  ○ perfstat
  ○ linux (rdmsr/wrmsr)
  ○ Linux Kernel (via /proc)


● Proprietary
  ○   jClarity jMSR
  ○   Intel VTune
  ○   AMD Code Analyst
  ○   Visual Studio 2012
Good Benchmarking Practice

● Warmups

● Measure and rule-out other factors

● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark = Low Variance
You can measure Cache behaviour
Principles & Examples
Prefetching


●   Prefetch = Eagerly load data
●   Adjacent Cache Line Prefetch
●   Data Cache Unit (Streaming) Prefetch
●   Problem: CPU Prediction isn't perfect
●   Solution: Arrange Data so accesses
    are predictable
Temporal Locality
            Repeatedly referring to same data in a short time span


Spatial Locality
                Referring to data that is close together in memory



Sequential Locality
              Referring to data that is arranged linearly in memory
General Principles

● Use smaller data types (-XX:+UseCompressedOops)

● Avoid 'big holes' in your data

● Make accesses as linear as possible
Primitive Arrays

// Sequential Access = Predictable

for (int i=0; i<someArray.length; i++)
   someArray[i]++;
Primitive Arrays - Skipping Elements

// Holes Hurt

for (int i=0; i<someArray.length; i += SKIP)
   someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
● Multidimensional Arrays are really Arrays of
  Arrays in Java. (Unlike C)

● Some people realign their accesses:

for (int col=0; col<COLS; col++) {
  for (int row=0; row<ROWS; row++) {
    array[ROWS * col + row]++;
  }
}
Bad Alignment

Strides the wrong way, bad
locality.

array[COLS * row + col]++;




Strides the right way, good
locality.

array[ROWS * col + row]++;
Data Locality vs Java Heap Layout
                      class Foo {
          count
                         Integer count;
0               bar      Bar bar;
                         Baz baz;
1         baz
                      }
2
                      // No alignment guarantees
3                     for (Foo foo : foos) {
                         foo.count = 5;
    ...
                         foo.bar.visit();
                      }
Data Locality vs Java Heap Layout
● Serious Java Weakness

● Location of objects in memory hard to
  guarantee.

● Garbage Collection Impact
  ○ Mark-Sweep
  ○ Copying Collector
General Principles

● Primitive Collections (GNU Trove, etc.)

● Avoid Code bloating (Loop Unrolling)

● Reconsider Data Structures
  ○ eg: Judy Arrays, kD-Trees, Z-Order Curve


● Care with Context Switching
Game Event Server
class PlayerAttack {
  private long playerId;
  private int weapon;
  private int energy;
  ...
  private int getWeapon() {
    return weapon;
  }
  ...
Flyweight Unsafe (1)
static final Unsafe unsafe = ... ;

long space = OBJ_COUNT * ATTACK_OBJ_SIZE;

address = unsafe.allocateMemory(space);

static PlayerAttack get(int index) {
   long loc = address + (ATTACK_OBJ_SIZE * index)
   return new PlayerAttack(loc);
}
Flyweight Unsafe (2)
class PlayerAttack {
   static final long WEAPON_OFFSET = 8
   private long loc;
   public int getWeapon() {
     long address = loc + WEAPON_OFFSET
     return unsafe.getInt(address);
   }
   public void setWeapon(int weapon) {
     long address = loc + WEAPON_OFFSET
     return unsafe.getInt(address, weapon);
   }
}
False Sharing
● Data can share a cache line

● Not always good
  ○ Some data 'accidentally' next to other data.
  ○ Causes Cache lines to be evicted prematurely


● Serious Concurrency Issue
Concurrent Cache Line Access




                          © Intel
Current Solution: Field Padding
public volatile long value;
public long pad1, pad2, pad3, pad4,
pad5, pad6;

8 bytes(long) * 7 fields + header = 64 bytes =
Cacheline

NB: fields aligned to 8 bytes even if smaller
Real Solution: JEP 142
class UncontendedFoo {

    int someValue;

    @Uncontended volatile long foo;

    Bar bar;
}


http://openjdk.java.net/jeps/142
False Sharing in Your GC
● Card Tables
   ○ Split RAM into 'cards'
   ○ Table records which parts of RAM are written to
   ○ Avoid rescanning large blocks of RAM


● BUT ... optimise by writing to the byte on
  every write

● -XX:+UseCondCardMark
   ○ Use With Care: 15-20% sequential slowdown
Too Long, Didn't Listen

1. Caching Behaviour has a performance effect

2. The effects are measurable

3. There are common problems and solutions
Questions?
           @RichardWarburto


www.akkadia.org/drepper/cpumemory.pdf
www.jclarity.com/friends/
g.oswego.edu/dl/concurrency-interest/
A Brave New World
   (hopefully, sort of)
Arrays 2.0
● Proposed JVM Feature
● Library Definition of Array Types
● Incorporate a notion of 'flatness'
● Semi-reified Generics
● Loads of other goodies (final, volatile,
  resizing, etc.)
● Requires Value Types
Value Types
● Assign or pass the object you copy the value
  and not the reference

● Allow control of the memory layout

● Control of layout = Control of Caching
  Behaviour

● Idea, initial prototypes in mlvm
The future will solve all problems!
1 of 47

More Related Content

What's hot(15)

Memory modelMemory model
Memory model
MingdongLiao669 views
Pthreads linuxPthreads linux
Pthreads linux
Mark Veltzer1.7K views
Exploring Gpgpu WorkloadsExploring Gpgpu Workloads
Exploring Gpgpu Workloads
Unai Lopez-Novoa1.5K views
Scientific computing on jrubyScientific computing on jruby
Scientific computing on jruby
Prasun Anand1.9K views
Time space trade offTime space trade off
Time space trade off
anisha talwar13.2K views
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier2.6K views
Memory OptimizationMemory Optimization
Memory Optimization
guest3eed30863 views
jvm goes to big datajvm goes to big data
jvm goes to big data
srisatish ambati2K views

Viewers also liked(8)

Promotions Uk Q4 Incl Why WaitPromotions Uk Q4 Incl Why Wait
Promotions Uk Q4 Incl Why Wait
Bradley Gilmour362 views
Smart City OverveiwSmart City Overveiw
Smart City Overveiw
Bradley Gilmour614 views
Rugged Tablets   Not I PadRugged Tablets   Not I Pad
Rugged Tablets Not I Pad
David Thiede600 views
Why Ibm iWhy Ibm i
Why Ibm i
David Thiede874 views
Ibm Watson Designed For AnswersIbm Watson Designed For Answers
Ibm Watson Designed For Answers
Bradley Gilmour494 views
Cloud Computing ChecklistCloud Computing Checklist
Cloud Computing Checklist
David Thiede551 views
Social Media And The CitySocial Media And The City
Social Media And The City
Bradley Gilmour443 views
IBM ITS ServicesIBM ITS Services
IBM ITS Services
Bradley Gilmour1.8K views

Similar to Caching in(20)

Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton1.1K views
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton2.9K views
Memory OptimizationMemory Optimization
Memory Optimization
Wei Lin3.4K views
2013 05 ny2013 05 ny
2013 05 ny
Sri Ambati714 views
ScyllaDB: NoSQL at Ludicrous SpeedScyllaDB: NoSQL at Ludicrous Speed
ScyllaDB: NoSQL at Ludicrous Speed
J On The Beach600 views
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
Red_Hat_Storage7.3K views
Performance and predictabilityPerformance and predictability
Performance and predictability
RichardWarburton1.3K views
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
slantsixgames1.4K views
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
Platonov Sergey6K views
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH1.4K views
HPP Week 1 SummaryHPP Week 1 Summary
HPP Week 1 Summary
Pipat Methavanitpong3.5K views
Programar para GPUsProgramar para GPUs
Programar para GPUs
Alcides Fonseca280 views

More from RichardWarburton(20)

Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and How
RichardWarburton511 views
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and How
RichardWarburton1.6K views
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
RichardWarburton1.8K views
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
RichardWarburton594 views
Collections forceawakensCollections forceawakens
Collections forceawakens
RichardWarburton3K views
Generics  past, present and futureGenerics  past, present and future
Generics past, present and future
RichardWarburton2K views
How to run a hackdayHow to run a hackday
How to run a hackday
RichardWarburton1.1K views
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
RichardWarburton2.8K views
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
RichardWarburton542 views
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
RichardWarburton839 views
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)
RichardWarburton1.3K views
Twins: OOP and FPTwins: OOP and FP
Twins: OOP and FP
RichardWarburton1.2K views
Twins: OOP and FPTwins: OOP and FP
Twins: OOP and FP
RichardWarburton1.2K views
The Bleeding EdgeThe Bleeding Edge
The Bleeding Edge
RichardWarburton859 views

Caching in

  • 1. Caching In Understand, Measure and Use your CPU Cache more effectively Richard Warburton
  • 2. What are you talking about? ● Why do you care? ● Measurement ● Principles and Examples
  • 3. Why do you care? Perhaps why /should/ you care.
  • 4. The Problem Very Fast Relatively Slow
  • 5. The Solution: CPU Cache ● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache ● Fast memory is expensive, a small amount is affordable
  • 6. Multilevel Cache: Intel Sandybridge Physical Core 0 Physical Core N HT: 2 Logical Cores HT: 2 Logical Cores Level 1 Level 1 Level 1 Level 1 Data Instruction .... Data Instruction Cache Cache Cache Cache Level 2 Cache Level 2 Cache Shared Level 3 Cache
  • 7. CPU Stalls ● Want your CPU to execute instructions ● Stall = CPU can't execute because its waiting on memory ● Stalls displace potential computation ● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) / HyperThreading (HT)
  • 8. How bad is a miss? Location Clockcycle Cost Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 400+ NB: Sandybridge Numbers
  • 9. Cache Lines ● Data transferred in cache lines ● Fixed size block of memory ● usually 64Bytes in current x86 CPUs ● Purely hardware consideration
  • 10. You don't need to know everything
  • 11. Architectural Takeaways ● Modern CPUs have large multilevel Caches ● Cache misses cause Stalls ● Stalls are expensive
  • 12. Measurement Barometer, Sun Dial ... and CPU Counters
  • 13. Do you care? ● DON'T look at CPU Caching behaviour first! ● Not all Performance Problems CPU Bound ○ Garbage Collection ○ Networking ○ Database or External Service ○ I/O ● Consider caching behaviour when you're execution bound and know your hotspot.
  • 14. CPU Performance Counters ● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses ● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge) ● Don't worry - leave details to tooling
  • 15. Measurement: Instructions Retired ● The number of instructions which executed until completion ○ ignores branch mispredictions ● When stalled you're not retiring instructions ● Aim to maximise instruction retirement when reducing cache misses
  • 16. Cache Misses ● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction) ● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  • 17. Cache Profilers ● Open Source ○ perfstat ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc) ● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012
  • 18. Good Benchmarking Practice ● Warmups ● Measure and rule-out other factors ● Specific Caching Ideas ...
  • 19. Take Care with Working Set Size
  • 20. Good Benchmark = Low Variance
  • 21. You can measure Cache behaviour
  • 23. Prefetching ● Prefetch = Eagerly load data ● Adjacent Cache Line Prefetch ● Data Cache Unit (Streaming) Prefetch ● Problem: CPU Prediction isn't perfect ● Solution: Arrange Data so accesses are predictable
  • 24. Temporal Locality Repeatedly referring to same data in a short time span Spatial Locality Referring to data that is close together in memory Sequential Locality Referring to data that is arranged linearly in memory
  • 25. General Principles ● Use smaller data types (-XX:+UseCompressedOops) ● Avoid 'big holes' in your data ● Make accesses as linear as possible
  • 26. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  • 27. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  • 28. Primitive Arrays - Skipping Elements
  • 29. Multidimensional Arrays ● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) ● Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  • 30. Bad Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  • 31. Data Locality vs Java Heap Layout class Foo { count Integer count; 0 bar Bar bar; Baz baz; 1 baz } 2 // No alignment guarantees 3 for (Foo foo : foos) { foo.count = 5; ... foo.bar.visit(); }
  • 32. Data Locality vs Java Heap Layout ● Serious Java Weakness ● Location of objects in memory hard to guarantee. ● Garbage Collection Impact ○ Mark-Sweep ○ Copying Collector
  • 33. General Principles ● Primitive Collections (GNU Trove, etc.) ● Avoid Code bloating (Loop Unrolling) ● Reconsider Data Structures ○ eg: Judy Arrays, kD-Trees, Z-Order Curve ● Care with Context Switching
  • 34. Game Event Server class PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  • 35. Flyweight Unsafe (1) static final Unsafe unsafe = ... ; long space = OBJ_COUNT * ATTACK_OBJ_SIZE; address = unsafe.allocateMemory(space); static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc); }
  • 36. Flyweight Unsafe (2) class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET return unsafe.getInt(address, weapon); } }
  • 37. False Sharing ● Data can share a cache line ● Not always good ○ Some data 'accidentally' next to other data. ○ Causes Cache lines to be evicted prematurely ● Serious Concurrency Issue
  • 38. Concurrent Cache Line Access © Intel
  • 39. Current Solution: Field Padding public volatile long value; public long pad1, pad2, pad3, pad4, pad5, pad6; 8 bytes(long) * 7 fields + header = 64 bytes = Cacheline NB: fields aligned to 8 bytes even if smaller
  • 40. Real Solution: JEP 142 class UncontendedFoo { int someValue; @Uncontended volatile long foo; Bar bar; } http://openjdk.java.net/jeps/142
  • 41. False Sharing in Your GC ● Card Tables ○ Split RAM into 'cards' ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM ● BUT ... optimise by writing to the byte on every write ● -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  • 42. Too Long, Didn't Listen 1. Caching Behaviour has a performance effect 2. The effects are measurable 3. There are common problems and solutions
  • 43. Questions? @RichardWarburto www.akkadia.org/drepper/cpumemory.pdf www.jclarity.com/friends/ g.oswego.edu/dl/concurrency-interest/
  • 44. A Brave New World (hopefully, sort of)
  • 45. Arrays 2.0 ● Proposed JVM Feature ● Library Definition of Array Types ● Incorporate a notion of 'flatness' ● Semi-reified Generics ● Loads of other goodies (final, volatile, resizing, etc.) ● Requires Value Types
  • 46. Value Types ● Assign or pass the object you copy the value and not the reference ● Allow control of the memory layout ● Control of layout = Control of Caching Behaviour ● Idea, initial prototypes in mlvm
  • 47. The future will solve all problems!