Caching in

Caching In
Understand, measure and use your CPU
Cache more effectively
Richard Warburton
- Luke Gorrie on Mechanical Sympathy
the hardware really wants to run fast
and you only need to avoid getting in
the way
What are you talking about?
● Why do you care?
● Hardware Fundamentals
● Measurement
● Principles and Examples
Why do you care?
Perhaps why /should/ you care.
The Problem Very Fast
Relatively Slow
The Solution: CPU Cache
● Core Demands Data, looks at its cache
○ If present (a "hit") then data returned to register
○ If absent (a "miss") then data looked up from
memory and stored in the cache
● Fast memory is expensive, a small amount
is affordable
Multilevel Cache: Intel Sandybridge
Shared Level 3 Cache
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core 0
HT: 2 Logical Cores
....
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core N
HT: 2 Logical Cores
CPU Stalls
● Want your CPU to execute instructions
● Pipeline Stall = CPU can't execute because
its waiting on memory
● Stalls displace potential computation
● Busy-work Strategies
○ Out-of-Order Execution
○ Simultaneous MultiThreading (SMT) /
How bad is a miss?
Location Latency in Clockcycles
Register 0
L1 Cache 3
L2 Cache 9
L3 Cache 21
Main Memory 150-400
NB: Sandybridge Numbers
Cache Lines
● Data transferred in cache lines
● Fixed size block of memory
● Usually 64 bytes in current x86 CPUs
○ Between 32 and 256 bytes
● Purely hardware consideration
You don't need to
know everything
Architectural Takeaways
● Modern CPUs have large multilevel Caches
● Cache misses cause Stalls
● Stalls are expensive
Measurement
Barometer, Sun Dial ... and CPU Counters
Do you care?
● DON'T look at CPU Caching behaviour first
● Many problems not Execution Bound
○ Networking
○ Database or External Service
○ I/O
○ Garbage Collection
○ Insufficient Parallelism
● Consider caching behaviour when you're
execution bound and know your hotspot.
CPU Performance Event Counters
● CPU Register which can count events
○ Eg: the number of Level 3 Cache Misses
● Model Specific Registers
○ not instruction set standardised by x86
○ differ by CPU (eg Penryn vs Sandybridge) and
manufacturer
● Don't worry - leave details to tooling
Measurement: Instructions Retired
● The number of instructions which executed
until completion
○ ignores branch mispredictions
● When stalled you're not retiring instructions
● Aim to maximise instruction retirement when
reducing cache misses
Cache Misses
● Cache Level
○ Level 3
○ Level 2
○ Level 1 (Data vs Instruction)
● What to measure
○ Hits
○ Misses
○ Reads vs Writes
○ Calculate Ratio
Cache Profilers
● Open Source
○ perf
○ likwid
○ linux (rdmsr/wrmsr)
○ Linux Kernel (via /proc)
● Proprietary
○ jClarity jMSR
○ Intel VTune
○ AMD Code Analyst
○ Visual Studio 2012
○ Oracle Studio
Good Benchmarking Practice
● Warmups
● Measure and rule-out other factors
● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark has Low Variance
You can measure Cache behaviour
Principles & Examples
Sequential Code
● Prefetch = Eagerly load data
● Adjacent Cache Line Prefetch
● Data Cache Unit (Streaming) Prefetch
● Problem: CPU Prediction isn't perfect
● Solution: Arrange Data so accesses
are predictable
Prefetching
Sequential Locality
Referring to data that is arranged linearly in memory
Spatial Locality
Referring to data that is close together in memory
Temporal Locality
Repeatedly referring to same data in a short time span
General Principles
● Use smaller data types (-XX:+UseCompressedOops)
● Avoid 'big holes' in your data
● Make accesses as linear as possible
Primitive Arrays
// Sequential Access = Predictable
for (int i=0; i<someArray.length; i++)
someArray[i]++;
Primitive Arrays - Skipping Elements
// Holes Hurt
for (int i=0; i<someArray.length; i += SKIP)
someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
● Multidimensional Arrays are really Arrays of
Arrays in Java. (Unlike C)
● Some people realign their accesses:
for (int col=0; col<COLS; col++) {
for (int row=0; row<ROWS; row++) {
array[ROWS * col + row]++;
}
}
Bad Access Alignment
Strides the wrong way, bad
locality.
array[COLS * row + col]++;
Strides the right way, good
locality.
array[ROWS * col + row]++;
Data Layout Principles
● Primitive Collections (GNU Trove, GS-Coll)
● Arrays > Linked Lists
● Hashtable > Search Tree
● Avoid Code bloating (Loop Unrolling)
● Custom Data Structures
○ Judy Arrays
■ an associative array/map
○ kD-Trees
■ generalised Binary Space Partitioning
○ Z-Order Curve
■ multidimensional data in one dimension
Data Locality vs Java Heap Layout
0
1
2
3
...
class Foo {
Integer count;
Bar bar;
Baz baz;
}
// No alignment guarantees
for (Foo foo : foos) {
foo.count = 5;
foo.bar.visit();
}
Foo
bar
baz
count
Data Locality vs Java Heap Layout
● Serious Java Weakness
● Location of objects in memory hard to
guarantee.
● GC also interferes
○ Copying
○ Compaction
Alternative Approach
● Use sun.misc.Unsafe to directly allocate
memory
● Manage mapping yourself
● Use with care
○ Maybe slower
○ Maybe error prone
○ Definitely hard work
SLAB
● Manually writing this is:
○ time consuming
○ error prone
○ boilerplate-ridden
● Prototype library:
○ http://github.com/RichardWarburton/slab
○ Maps an interface to a flyweight
● Still not recommended!
Data Alignment
● An address a is n-byte aligned when a is
divisible by n
● Cache Line Aligned:
○ byte aligned to the size of a cache line
Rules and Hacks
● Java (Hotspot) ♥ 8-byte alignment:
○ new Java Objects
○ Unsafe.allocateMemory
○ Bytebuffer.allocateDirect
● Get the object address:
○ Unsafe gives it to you directly
○ Direct ByteBuffer has a field called 'address'
Cacheline Aligned Access
● Performance Benefits:
○ 3-6x on Core2Duo
○ ~1.5x on Nehalem Class
○ Measured using direct ByteBuffers
○ Mid Aligned vs Straddling a Cacheline
● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
Today I learned ...
Access and Cache Aligned, Contiguous Data
Translation Lookaside
Buffers
TLB, not TLAB!
Virtual Memory
● RAM is finite
● RAM is fragmented
● Multitasking is nice
● Programming easier with a contiguous
address space
Page Tables
● Virtual Address Space split into pages
● Page has two Addresses:
○ Virtual Address: Process' view
○ Physical Address: Location in RAM/Disk
● Page Table
○ Mapping Between these addresses
○ OS Managed
○ Page Fault when entry is missing
○ OS Pages in from disk
Translation Lookaside Buffers
● TLB = Page Table Cache
● Avoids memory bottlenecks in page lookups
● CPU Cache is multi-level => TLB is multi-
level
● Miss/Hit rates measurable via CPU Counters
○ Hit/Miss Rates
○ Walk Duration
TLB Flushing
● Process context switch => change address
space
● Historically caused "flush" =
● Avoided with an Address Space Identifier
(ASID)
Page Size Tradeoff:
Bigger Pages
= less pages
= quicker page lookups
vs
Bigger Pages
= wasted memory space
= more disk paging
Page Sizes
● Page Size is adjustable
● Common Sizes: 4KB, 2MB, 1GB
● Potential Speedup: 10%-30%
● -XX:+UseLargePages
● Oracle DBs particularly
Too Long, Didn't Listen
1. Virtual Memory + Paging have an overhead
2. Translation Lookaside Buffers used to
reduce cost
3. Page size changes important
Principles & Examples (2)
Concurrent Code
Context Switching
● Multitasking systems 'context switch'
between processes
● Variants:
○ Thread
○ Process
○ Virtualized OS
Context Switching Costs
● Direct Cost
○ Save old process state
○ Schedule new process
○ Load new process state
● Indirect Cost
○ TLB Reload (Process only)
○ CPU Pipeline Flush
○ Cache Interference
○ Temporal Locality
Indirect Costs (cont.)
● "Quantifying The Cost of Context Switch" by
Li et al.
○ 5 years old - principle still applies
○ Direct = 3.8 microseconds
○ Indirect = ~0 - 1500 microseconds
○ indirect dominates direct when working set > L2
Cache size
● Cooperative hits also exist
● Minimise context switching if possible
Locking
● Locks require kernel arbitration
○ Not a context switch, but a mode switch
○ Lots of lock contention causes context switching
● Has a cache-pollution cost
● Can use try lock-free algorithms
GC Safepointing
● GC needs to pause program
○ Safe points
● Safe Points cause a page fault
● Also generates Context Switch between
Mutator and GC
○ Not guaranteed to be scheduled back on the same
core
Java Memory Model
● JSR 133 Specs the memory model
● Turtles Example:
○ Java Level: volatile
○ CPU Level: lock
○ Cache Level: MESI
● Has a cost!
False Sharing
● Data can share a cache line
● Not always good
○ Some data 'accidentally' next to other data.
○ Causes Cache lines to be evicted prematurely
● Serious Concurrency Issue
Concurrent Cache Line Access
© Intel
Current Solution: Field Padding
public volatile long value;
public long pad1, pad2, pad3, pad4,
pad5, pad6;
8 bytes(long) * 7 fields + header = 64 bytes =
Cacheline
NB: fields aligned to 8 bytes even if smaller and
re-ordered by size
Real Solution: JEP 142
class UncontendedFoo {
int someValue;
@Contended volatile long foo;
Bar bar;
}
http://openjdk.java.net/jeps/142
False Sharing in Your GC
● Card Tables
○ Split RAM into 'cards'
○ Table records which parts of RAM are written to
○ Avoid rescanning large blocks of RAM
● BUT ... optimise by writing to the byte on
every write
● Solution: -XX:+UseCondCardMark
○ Use With Care: 15-20% sequential slowdown
Buyer Beware:
● Context Switching
● False Sharing
● Visibility/Atomicity/Locking Costs
Too Long, Didn't Listen
1. Caching Behaviour has a performance effect
2. The effects are measurable
3. There are common problems and solutions
Useful Links
● My blog
○ insightfullogic.com
● Lots about memory
○ akkadia.org/drepper/cpumemory.pdf
●
○ jclarity.com/friends/
● Martin Thompson's blog
○ mechanical-sympathy.blogspot.co.uk
● Concurrency Interest
○ g.oswego.edu/dl/concurrency-interest/
Questions?
@RichardWarburto
insightfullogic.com
Garbage Collection Impact
● Young Gen Copying Collector
○ Copying changes location
○ Sequential Allocation helps
● Concurrent Mark Sweep
○ lack of compaction until CMF hurts
● Parallel Old GC
○ Compaction at Full GC helps
Slab (2)
public interface GameEvent extends Cursor {
public int getId();
public void setId(int value);
public long getStrength();
public void setStrength(long value);
}
Allocator<GameEvent> eventAllocator =
Allocator.of(GameEvent.class);
GameEvent event = eventAllocator.allocate(100);
event.move(50);
event.setId(6);
assertEquals(6, event.getId());
A Brave New World
(hopefully, sort of)
Value Types
● Assign or pass the object you copy the value
and not the reference
● Allow control of the memory layout
● Control of layout
● Idea, initial prototypes in mlvm
The future will solve all problems!
Arrays 2.0
● Proposed JVM Feature
● Library Definition of Array Types
● Incorporate a notion of 'flatness'
● Semi-reified Generics
● Loads of other goodies (final, volatile,
resizing, etc.)
● Requires Value Types
Game Event Server
class PlayerAttack {
private long playerId;
private int weapon;
private int energy;
...
private int getWeapon() {
return weapon;
}
...
Flyweight Unsafe (1)
static final Unsafe unsafe = ... ;
long space = OBJ_COUNT * ATTACK_OBJ_SIZE;
address = unsafe.allocateMemory(space);
static PlayerAttack get(int index) {
long loc = address +
(ATTACK_OBJ_SIZE * index)
return new PlayerAttack(loc);
}
Flyweight Unsafe (2)
class PlayerAttack {
static final long WEAPON_OFFSET = 8
private long loc;
public int getWeapon() {
long address = loc + WEAPON_OFFSET
return unsafe.getInt(address);
}
public void setWeapon(int weapon) {
long address = loc + WEAPON_OFFSET
unsafe.setInt(address, weapon);
}
1 of 73

Recommended

Caching in (DevoxxUK 2013) by
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)RichardWarburton
992 views68 slides
Caching in by
Caching inCaching in
Caching inRichardWarburton
858 views47 slides
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs by
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
1.6K views47 slides
Introduction to Apache Tajo: Data Warehouse for Big Data by
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataJihoon Son
4.4K views46 slides
Performance evaluation of apache tajo by
Performance evaluation of apache tajoPerformance evaluation of apache tajo
Performance evaluation of apache tajoJihoon Son
2K views20 slides
Introduction to Apache Tajo: Future of Data Warehouse by
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
2K views50 slides

More Related Content

What's hot

Apache tajo configuration by
Apache tajo configurationApache tajo configuration
Apache tajo configurationJihoon Son
897 views12 slides
Query optimization in Apache Tajo by
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache TajoJihoon Son
3K views44 slides
OpenTSDB: HBaseCon2017 by
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
2K views31 slides
Ndb cluster 80_dbt2_5_tb by
Ndb cluster 80_dbt2_5_tbNdb cluster 80_dbt2_5_tb
Ndb cluster 80_dbt2_5_tbmikaelronstrom
118 views11 slides
Large-scale projects development (scaling LAMP) by
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
20.9K views102 slides
Efficient Query Processing in Web Search Engines by
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesSimon Lia-Jonassen
2K views32 slides

What's hot(20)

Apache tajo configuration by Jihoon Son
Apache tajo configurationApache tajo configuration
Apache tajo configuration
Jihoon Son897 views
Query optimization in Apache Tajo by Jihoon Son
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache Tajo
Jihoon Son3K views
OpenTSDB: HBaseCon2017 by HBaseCon
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
HBaseCon2K views
Large-scale projects development (scaling LAMP) by Alexey Rybak
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
Alexey Rybak20.9K views
Efficient Query Processing in Web Search Engines by Simon Lia-Jonassen
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014) by Alexey Rybak
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Alexey Rybak1.3K views
MapDB - taking Java collections to the next level by JavaDayUA
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
JavaDayUA3.4K views
Tajo case study bay area hug 20131105 by Gruter
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
Gruter6.3K views
in-memory database system and low latency by hyeongchae lee
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
hyeongchae lee5.6K views
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак by Ontico
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico4.1K views
Mongo nyc nyt + mongodb by Deep Kapadia
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
Deep Kapadia297 views
In memory databases presentation by Michael Keane
In memory databases presentationIn memory databases presentation
In memory databases presentation
Michael Keane1.8K views
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team by Mydbops
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops TeamEvolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Mydbops460 views

Viewers also liked

Lock? We don't need no stinkin' locks! by
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Michael Barker
23.1K views26 slides
Cache memory by
Cache memoryCache memory
Cache memoryMuhammad Ishaq
5.3K views43 slides
04 cache memory by
04 cache memory04 cache memory
04 cache memorySher Shah Merkhel
9.8K views71 slides
Input Output Operations by
Input Output OperationsInput Output Operations
Input Output Operationskdisthere
13.6K views36 slides
Caching by
CachingCaching
CachingNascenia IT
8.3K views29 slides

Viewers also liked(18)

Lock? We don't need no stinkin' locks! by Michael Barker
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!
Michael Barker23.1K views
Input Output Operations by kdisthere
Input Output OperationsInput Output Operations
Input Output Operations
kdisthere13.6K views
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015 by Matt Raible
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
Matt Raible37.4K views
Electronic governance steps in the right direction? by Bozhidar Bozhanov
Electronic governance   steps in the right direction?Electronic governance   steps in the right direction?
Electronic governance steps in the right direction?
Bozhidar Bozhanov4.4K views
Memory Organization by Kamal Acharya
Memory OrganizationMemory Organization
Memory Organization
Kamal Acharya31.5K views
Cache memory by Anuj Modi
Cache memoryCache memory
Cache memory
Anuj Modi55.1K views
An introduction to fundamental architecture concepts by wweinmeyer79
An introduction to fundamental architecture conceptsAn introduction to fundamental architecture concepts
An introduction to fundamental architecture concepts
wweinmeyer79129.6K views
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin... by Matt Raible
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Matt Raible68.4K views

Similar to Caching in

Elasticsearch 101 - Cluster setup and tuning by
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningPetar Djekic
1.9K views13 slides
Etl confessions pg conf us 2017 by
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
309 views42 slides
M|18 Why Abstract Away the Underlying Database Infrastructure by
M|18 Why Abstract Away the Underlying Database InfrastructureM|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database InfrastructureMariaDB plc
480 views40 slides
Retaining Goodput with Query Rate Limiting by
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
158 views22 slides
2013 05 ny by
2013 05 ny2013 05 ny
2013 05 nySri Ambati
714 views17 slides
Ledingkart Meetup #2: Scaling Search @Lendingkart by
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
604 views46 slides

Similar to Caching in(20)

Elasticsearch 101 - Cluster setup and tuning by Petar Djekic
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
Petar Djekic1.9K views
Etl confessions pg conf us 2017 by Corey Huinker
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
Corey Huinker309 views
M|18 Why Abstract Away the Underlying Database Infrastructure by MariaDB plc
M|18 Why Abstract Away the Underlying Database InfrastructureM|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database Infrastructure
MariaDB plc480 views
Retaining Goodput with Query Rate Limiting by ScyllaDB
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
ScyllaDB158 views
Ledingkart Meetup #2: Scaling Search @Lendingkart by Mukesh Singh
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
Mukesh Singh604 views
The Dark Side Of Go -- Go runtime related problems in TiDB in production by PingCAP
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
PingCAP444 views
An Introduction to Apache Cassandra by Saeid Zebardast
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
Saeid Zebardast1.4K views
Scalability broad strokes by Gagan Bajpai
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
Gagan Bajpai499 views
Erasure codes and storage tiers on gluster by Red_Hat_Storage
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
Red_Hat_Storage7.3K views
strangeloop 2012 apache cassandra anti patterns by Matthew Dennis
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
Matthew Dennis16.7K views
The Parquet Format and Performance Optimization Opportunities by Databricks
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks8.1K views
In-memory Data Management Trends & Techniques by Hazelcast
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast1K views
UKOUG 2011: Practical MySQL Tuning by FromDual GmbH
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
FromDual GmbH1.4K views
Performance and Predictability - Richard Warburton by JAXLondon2014
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
JAXLondon2014778 views
Performance and predictability (1) by RichardWarburton
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
RichardWarburton1.1K views
Memory Bandwidth QoS by Rohit Jnagal
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
Rohit Jnagal1.4K views

More from RichardWarburton

Fantastic performance and where to find it by
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find itRichardWarburton
97 views32 slides
Production profiling what, why and how technical audience (3) by
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)RichardWarburton
164 views54 slides
Production profiling: What, Why and How by
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and HowRichardWarburton
511 views46 slides
Production profiling what, why and how (JBCN Edition) by
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)RichardWarburton
510 views47 slides
Production Profiling: What, Why and How by
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and HowRichardWarburton
1.6K views47 slides
Java collections the force awakens by
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakensRichardWarburton
1.8K views56 slides

More from RichardWarburton(20)

Fantastic performance and where to find it by RichardWarburton
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find it
RichardWarburton97 views
Production profiling what, why and how technical audience (3) by RichardWarburton
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)
RichardWarburton164 views
Production profiling: What, Why and How by RichardWarburton
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and How
RichardWarburton511 views
Production profiling what, why and how (JBCN Edition) by RichardWarburton
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)
RichardWarburton510 views
Production Profiling: What, Why and How by RichardWarburton
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and How
RichardWarburton1.6K views
Java collections the force awakens by RichardWarburton
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
RichardWarburton1.8K views
Generics Past, Present and Future (Latest) by RichardWarburton
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
RichardWarburton594 views
Generics Past, Present and Future by RichardWarburton
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
RichardWarburton2.8K views
Pragmatic functional refactoring with java 8 (1) by RichardWarburton
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)
RichardWarburton3.7K views
Twins: Object Oriented Programming and Functional Programming by RichardWarburton
Twins: Object Oriented Programming and Functional ProgrammingTwins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional Programming
RichardWarburton2.5K views
Pragmatic functional refactoring with java 8 by RichardWarburton
Pragmatic functional refactoring with java 8Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8
RichardWarburton4.3K views
Simplifying java with lambdas (short) by RichardWarburton
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)
RichardWarburton1.3K views

Recently uploaded

"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy by
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy NakonechnyyFwdays
40 views21 slides
Understanding GenAI/LLM and What is Google Offering - Felix Goh by
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix GohNUS-ISS
39 views33 slides
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...NUS-ISS
23 views70 slides
Tunable Laser (1).pptx by
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptxHajira Mahmood
21 views37 slides
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr UspenskyiFwdays
26 views9 slides
ChatGPT and AI for Web Developers by
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web DevelopersMaximiliano Firtman
174 views82 slides

Recently uploaded(20)

"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy by Fwdays
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
"Role of a CTO in software outsourcing company", Yuriy Nakonechnyy
Fwdays40 views
Understanding GenAI/LLM and What is Google Offering - Felix Goh by NUS-ISS
Understanding GenAI/LLM and What is Google Offering - Felix GohUnderstanding GenAI/LLM and What is Google Offering - Felix Goh
Understanding GenAI/LLM and What is Google Offering - Felix Goh
NUS-ISS39 views
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen... by NUS-ISS
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
Upskilling the Evolving Workforce with Digital Fluency for Tomorrow's Challen...
NUS-ISS23 views
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi by Fwdays
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
"AI Startup Growth from Idea to 1M ARR", Oleksandr Uspenskyi
Fwdays26 views
Future of Learning - Yap Aye Wee.pdf by NUS-ISS
Future of Learning - Yap Aye Wee.pdfFuture of Learning - Yap Aye Wee.pdf
Future of Learning - Yap Aye Wee.pdf
NUS-ISS38 views
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa... by The Digital Insurer
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...
Webinar : Competing for tomorrow’s leaders – How MENA insurers can win the wa...
Transcript: The Details of Description Techniques tips and tangents on altern... by BookNet Canada
Transcript: The Details of Description Techniques tips and tangents on altern...Transcript: The Details of Description Techniques tips and tangents on altern...
Transcript: The Details of Description Techniques tips and tangents on altern...
BookNet Canada119 views
Combining Orchestration and Choreography for a Clean Architecture by ThomasHeinrichs1
Combining Orchestration and Choreography for a Clean ArchitectureCombining Orchestration and Choreography for a Clean Architecture
Combining Orchestration and Choreography for a Clean Architecture
ThomasHeinrichs168 views
MemVerge: Memory Viewer Software by CXL Forum
MemVerge: Memory Viewer SoftwareMemVerge: Memory Viewer Software
MemVerge: Memory Viewer Software
CXL Forum118 views
AMD: 4th Generation EPYC CXL Demo by CXL Forum
AMD: 4th Generation EPYC CXL DemoAMD: 4th Generation EPYC CXL Demo
AMD: 4th Generation EPYC CXL Demo
CXL Forum126 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi113 views
Liqid: Composable CXL Preview by CXL Forum
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum121 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10165 views
GigaIO: The March of Composability Onward to Memory with CXL by CXL Forum
GigaIO: The March of Composability Onward to Memory with CXLGigaIO: The March of Composability Onward to Memory with CXL
GigaIO: The March of Composability Onward to Memory with CXL
CXL Forum126 views
Micron CXL product and architecture update by CXL Forum
Micron CXL product and architecture updateMicron CXL product and architecture update
Micron CXL product and architecture update
CXL Forum27 views

Caching in

  • 1. Caching In Understand, measure and use your CPU Cache more effectively Richard Warburton
  • 2. - Luke Gorrie on Mechanical Sympathy the hardware really wants to run fast and you only need to avoid getting in the way
  • 3. What are you talking about? ● Why do you care? ● Hardware Fundamentals ● Measurement ● Principles and Examples
  • 4. Why do you care? Perhaps why /should/ you care.
  • 5. The Problem Very Fast Relatively Slow
  • 6. The Solution: CPU Cache ● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache ● Fast memory is expensive, a small amount is affordable
  • 7. Multilevel Cache: Intel Sandybridge Shared Level 3 Cache Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core 0 HT: 2 Logical Cores .... Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core N HT: 2 Logical Cores
  • 8. CPU Stalls ● Want your CPU to execute instructions ● Pipeline Stall = CPU can't execute because its waiting on memory ● Stalls displace potential computation ● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) /
  • 9. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400 NB: Sandybridge Numbers
  • 10. Cache Lines ● Data transferred in cache lines ● Fixed size block of memory ● Usually 64 bytes in current x86 CPUs ○ Between 32 and 256 bytes ● Purely hardware consideration
  • 11. You don't need to know everything
  • 12. Architectural Takeaways ● Modern CPUs have large multilevel Caches ● Cache misses cause Stalls ● Stalls are expensive
  • 13. Measurement Barometer, Sun Dial ... and CPU Counters
  • 14. Do you care? ● DON'T look at CPU Caching behaviour first ● Many problems not Execution Bound ○ Networking ○ Database or External Service ○ I/O ○ Garbage Collection ○ Insufficient Parallelism ● Consider caching behaviour when you're execution bound and know your hotspot.
  • 15. CPU Performance Event Counters ● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses ● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge) and manufacturer ● Don't worry - leave details to tooling
  • 16. Measurement: Instructions Retired ● The number of instructions which executed until completion ○ ignores branch mispredictions ● When stalled you're not retiring instructions ● Aim to maximise instruction retirement when reducing cache misses
  • 17. Cache Misses ● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction) ● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  • 18. Cache Profilers ● Open Source ○ perf ○ likwid ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc) ● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012 ○ Oracle Studio
  • 19. Good Benchmarking Practice ● Warmups ● Measure and rule-out other factors ● Specific Caching Ideas ...
  • 20. Take Care with Working Set Size
  • 21. Good Benchmark has Low Variance
  • 22. You can measure Cache behaviour
  • 24. ● Prefetch = Eagerly load data ● Adjacent Cache Line Prefetch ● Data Cache Unit (Streaming) Prefetch ● Problem: CPU Prediction isn't perfect ● Solution: Arrange Data so accesses are predictable Prefetching
  • 25. Sequential Locality Referring to data that is arranged linearly in memory Spatial Locality Referring to data that is close together in memory Temporal Locality Repeatedly referring to same data in a short time span
  • 26. General Principles ● Use smaller data types (-XX:+UseCompressedOops) ● Avoid 'big holes' in your data ● Make accesses as linear as possible
  • 27. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  • 28. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  • 29. Primitive Arrays - Skipping Elements
  • 30. Multidimensional Arrays ● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) ● Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  • 31. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  • 32. Data Layout Principles ● Primitive Collections (GNU Trove, GS-Coll) ● Arrays > Linked Lists ● Hashtable > Search Tree ● Avoid Code bloating (Loop Unrolling) ● Custom Data Structures ○ Judy Arrays ■ an associative array/map ○ kD-Trees ■ generalised Binary Space Partitioning ○ Z-Order Curve ■ multidimensional data in one dimension
  • 33. Data Locality vs Java Heap Layout 0 1 2 3 ... class Foo { Integer count; Bar bar; Baz baz; } // No alignment guarantees for (Foo foo : foos) { foo.count = 5; foo.bar.visit(); } Foo bar baz count
  • 34. Data Locality vs Java Heap Layout ● Serious Java Weakness ● Location of objects in memory hard to guarantee. ● GC also interferes ○ Copying ○ Compaction
  • 35. Alternative Approach ● Use sun.misc.Unsafe to directly allocate memory ● Manage mapping yourself ● Use with care ○ Maybe slower ○ Maybe error prone ○ Definitely hard work
  • 36. SLAB ● Manually writing this is: ○ time consuming ○ error prone ○ boilerplate-ridden ● Prototype library: ○ http://github.com/RichardWarburton/slab ○ Maps an interface to a flyweight ● Still not recommended!
  • 37. Data Alignment ● An address a is n-byte aligned when a is divisible by n ● Cache Line Aligned: ○ byte aligned to the size of a cache line
  • 38. Rules and Hacks ● Java (Hotspot) ♥ 8-byte alignment: ○ new Java Objects ○ Unsafe.allocateMemory ○ Bytebuffer.allocateDirect ● Get the object address: ○ Unsafe gives it to you directly ○ Direct ByteBuffer has a field called 'address'
  • 39. Cacheline Aligned Access ● Performance Benefits: ○ 3-6x on Core2Duo ○ ~1.5x on Nehalem Class ○ Measured using direct ByteBuffers ○ Mid Aligned vs Straddling a Cacheline ● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
  • 40. Today I learned ... Access and Cache Aligned, Contiguous Data
  • 42. Virtual Memory ● RAM is finite ● RAM is fragmented ● Multitasking is nice ● Programming easier with a contiguous address space
  • 43. Page Tables ● Virtual Address Space split into pages ● Page has two Addresses: ○ Virtual Address: Process' view ○ Physical Address: Location in RAM/Disk ● Page Table ○ Mapping Between these addresses ○ OS Managed ○ Page Fault when entry is missing ○ OS Pages in from disk
  • 44. Translation Lookaside Buffers ● TLB = Page Table Cache ● Avoids memory bottlenecks in page lookups ● CPU Cache is multi-level => TLB is multi- level ● Miss/Hit rates measurable via CPU Counters ○ Hit/Miss Rates ○ Walk Duration
  • 45. TLB Flushing ● Process context switch => change address space ● Historically caused "flush" = ● Avoided with an Address Space Identifier (ASID)
  • 46. Page Size Tradeoff: Bigger Pages = less pages = quicker page lookups vs Bigger Pages = wasted memory space = more disk paging
  • 47. Page Sizes ● Page Size is adjustable ● Common Sizes: 4KB, 2MB, 1GB ● Potential Speedup: 10%-30% ● -XX:+UseLargePages ● Oracle DBs particularly
  • 48. Too Long, Didn't Listen 1. Virtual Memory + Paging have an overhead 2. Translation Lookaside Buffers used to reduce cost 3. Page size changes important
  • 49. Principles & Examples (2) Concurrent Code
  • 50. Context Switching ● Multitasking systems 'context switch' between processes ● Variants: ○ Thread ○ Process ○ Virtualized OS
  • 51. Context Switching Costs ● Direct Cost ○ Save old process state ○ Schedule new process ○ Load new process state ● Indirect Cost ○ TLB Reload (Process only) ○ CPU Pipeline Flush ○ Cache Interference ○ Temporal Locality
  • 52. Indirect Costs (cont.) ● "Quantifying The Cost of Context Switch" by Li et al. ○ 5 years old - principle still applies ○ Direct = 3.8 microseconds ○ Indirect = ~0 - 1500 microseconds ○ indirect dominates direct when working set > L2 Cache size ● Cooperative hits also exist ● Minimise context switching if possible
  • 53. Locking ● Locks require kernel arbitration ○ Not a context switch, but a mode switch ○ Lots of lock contention causes context switching ● Has a cache-pollution cost ● Can use try lock-free algorithms
  • 54. GC Safepointing ● GC needs to pause program ○ Safe points ● Safe Points cause a page fault ● Also generates Context Switch between Mutator and GC ○ Not guaranteed to be scheduled back on the same core
  • 55. Java Memory Model ● JSR 133 Specs the memory model ● Turtles Example: ○ Java Level: volatile ○ CPU Level: lock ○ Cache Level: MESI ● Has a cost!
  • 56. False Sharing ● Data can share a cache line ● Not always good ○ Some data 'accidentally' next to other data. ○ Causes Cache lines to be evicted prematurely ● Serious Concurrency Issue
  • 57. Concurrent Cache Line Access © Intel
  • 58. Current Solution: Field Padding public volatile long value; public long pad1, pad2, pad3, pad4, pad5, pad6; 8 bytes(long) * 7 fields + header = 64 bytes = Cacheline NB: fields aligned to 8 bytes even if smaller and re-ordered by size
  • 59. Real Solution: JEP 142 class UncontendedFoo { int someValue; @Contended volatile long foo; Bar bar; } http://openjdk.java.net/jeps/142
  • 60. False Sharing in Your GC ● Card Tables ○ Split RAM into 'cards' ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM ● BUT ... optimise by writing to the byte on every write ● Solution: -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  • 61. Buyer Beware: ● Context Switching ● False Sharing ● Visibility/Atomicity/Locking Costs
  • 62. Too Long, Didn't Listen 1. Caching Behaviour has a performance effect 2. The effects are measurable 3. There are common problems and solutions
  • 63. Useful Links ● My blog ○ insightfullogic.com ● Lots about memory ○ akkadia.org/drepper/cpumemory.pdf ● ○ jclarity.com/friends/ ● Martin Thompson's blog ○ mechanical-sympathy.blogspot.co.uk ● Concurrency Interest ○ g.oswego.edu/dl/concurrency-interest/
  • 65. Garbage Collection Impact ● Young Gen Copying Collector ○ Copying changes location ○ Sequential Allocation helps ● Concurrent Mark Sweep ○ lack of compaction until CMF hurts ● Parallel Old GC ○ Compaction at Full GC helps
  • 66. Slab (2) public interface GameEvent extends Cursor { public int getId(); public void setId(int value); public long getStrength(); public void setStrength(long value); } Allocator<GameEvent> eventAllocator = Allocator.of(GameEvent.class); GameEvent event = eventAllocator.allocate(100); event.move(50); event.setId(6); assertEquals(6, event.getId());
  • 67. A Brave New World (hopefully, sort of)
  • 68. Value Types ● Assign or pass the object you copy the value and not the reference ● Allow control of the memory layout ● Control of layout ● Idea, initial prototypes in mlvm
  • 69. The future will solve all problems!
  • 70. Arrays 2.0 ● Proposed JVM Feature ● Library Definition of Array Types ● Incorporate a notion of 'flatness' ● Semi-reified Generics ● Loads of other goodies (final, volatile, resizing, etc.) ● Requires Value Types
  • 71. Game Event Server class PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  • 72. Flyweight Unsafe (1) static final Unsafe unsafe = ... ; long space = OBJ_COUNT * ATTACK_OBJ_SIZE; address = unsafe.allocateMemory(space); static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc); }
  • 73. Flyweight Unsafe (2) class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET unsafe.setInt(address, weapon); }