SlideShare a Scribd company logo
1 of 73
Download to read offline
Caching In
Understand, measure and use your CPU
Cache more effectively
Richard Warburton
- Luke Gorrie on Mechanical Sympathy
the hardware really wants to run fast
and you only need to avoid getting in
the way
What are you talking about?
● Why do you care?
● Hardware Fundamentals
● Measurement
● Principles and Examples
Why do you care?
Perhaps why /should/ you care.
The Problem Very Fast
Relatively Slow
The Solution: CPU Cache
● Core Demands Data, looks at its cache
○ If present (a "hit") then data returned to register
○ If absent (a "miss") then data looked up from
memory and stored in the cache
● Fast memory is expensive, a small amount
is affordable
Multilevel Cache: Intel Sandybridge
Shared Level 3 Cache
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core 0
HT: 2 Logical Cores
....
Level 2 Cache
Level 1
Data
Cache
Level 1
Instruction
Cache
Physical Core N
HT: 2 Logical Cores
CPU Stalls
● Want your CPU to execute instructions
● Pipeline Stall = CPU can't execute because
its waiting on memory
● Stalls displace potential computation
● Busy-work Strategies
○ Out-of-Order Execution
○ Simultaneous MultiThreading (SMT) /
How bad is a miss?
Location Latency in Clockcycles
Register 0
L1 Cache 3
L2 Cache 9
L3 Cache 21
Main Memory 150-400
NB: Sandybridge Numbers
Cache Lines
● Data transferred in cache lines
● Fixed size block of memory
● Usually 64 bytes in current x86 CPUs
○ Between 32 and 256 bytes
● Purely hardware consideration
You don't need to
know everything
Architectural Takeaways
● Modern CPUs have large multilevel Caches
● Cache misses cause Stalls
● Stalls are expensive
Measurement
Barometer, Sun Dial ... and CPU Counters
Do you care?
● DON'T look at CPU Caching behaviour first
● Many problems not Execution Bound
○ Networking
○ Database or External Service
○ I/O
○ Garbage Collection
○ Insufficient Parallelism
● Consider caching behaviour when you're
execution bound and know your hotspot.
CPU Performance Event Counters
● CPU Register which can count events
○ Eg: the number of Level 3 Cache Misses
● Model Specific Registers
○ not instruction set standardised by x86
○ differ by CPU (eg Penryn vs Sandybridge) and
manufacturer
● Don't worry - leave details to tooling
Measurement: Instructions Retired
● The number of instructions which executed
until completion
○ ignores branch mispredictions
● When stalled you're not retiring instructions
● Aim to maximise instruction retirement when
reducing cache misses
Cache Misses
● Cache Level
○ Level 3
○ Level 2
○ Level 1 (Data vs Instruction)
● What to measure
○ Hits
○ Misses
○ Reads vs Writes
○ Calculate Ratio
Cache Profilers
● Open Source
○ perf
○ likwid
○ linux (rdmsr/wrmsr)
○ Linux Kernel (via /proc)
● Proprietary
○ jClarity jMSR
○ Intel VTune
○ AMD Code Analyst
○ Visual Studio 2012
○ Oracle Studio
Good Benchmarking Practice
● Warmups
● Measure and rule-out other factors
● Specific Caching Ideas ...
Take Care with Working Set Size
Good Benchmark has Low Variance
You can measure Cache behaviour
Principles & Examples
Sequential Code
● Prefetch = Eagerly load data
● Adjacent Cache Line Prefetch
● Data Cache Unit (Streaming) Prefetch
● Problem: CPU Prediction isn't perfect
● Solution: Arrange Data so accesses
are predictable
Prefetching
Sequential Locality
Referring to data that is arranged linearly in memory
Spatial Locality
Referring to data that is close together in memory
Temporal Locality
Repeatedly referring to same data in a short time span
General Principles
● Use smaller data types (-XX:+UseCompressedOops)
● Avoid 'big holes' in your data
● Make accesses as linear as possible
Primitive Arrays
// Sequential Access = Predictable
for (int i=0; i<someArray.length; i++)
someArray[i]++;
Primitive Arrays - Skipping Elements
// Holes Hurt
for (int i=0; i<someArray.length; i += SKIP)
someArray[i]++;
Primitive Arrays - Skipping Elements
Multidimensional Arrays
● Multidimensional Arrays are really Arrays of
Arrays in Java. (Unlike C)
● Some people realign their accesses:
for (int col=0; col<COLS; col++) {
for (int row=0; row<ROWS; row++) {
array[ROWS * col + row]++;
}
}
Bad Access Alignment
Strides the wrong way, bad
locality.
array[COLS * row + col]++;
Strides the right way, good
locality.
array[ROWS * col + row]++;
Data Layout Principles
● Primitive Collections (GNU Trove, GS-Coll)
● Arrays > Linked Lists
● Hashtable > Search Tree
● Avoid Code bloating (Loop Unrolling)
● Custom Data Structures
○ Judy Arrays
■ an associative array/map
○ kD-Trees
■ generalised Binary Space Partitioning
○ Z-Order Curve
■ multidimensional data in one dimension
Data Locality vs Java Heap Layout
0
1
2
3
...
class Foo {
Integer count;
Bar bar;
Baz baz;
}
// No alignment guarantees
for (Foo foo : foos) {
foo.count = 5;
foo.bar.visit();
}
Foo
bar
baz
count
Data Locality vs Java Heap Layout
● Serious Java Weakness
● Location of objects in memory hard to
guarantee.
● GC also interferes
○ Copying
○ Compaction
Alternative Approach
● Use sun.misc.Unsafe to directly allocate
memory
● Manage mapping yourself
● Use with care
○ Maybe slower
○ Maybe error prone
○ Definitely hard work
SLAB
● Manually writing this is:
○ time consuming
○ error prone
○ boilerplate-ridden
● Prototype library:
○ http://github.com/RichardWarburton/slab
○ Maps an interface to a flyweight
● Still not recommended!
Data Alignment
● An address a is n-byte aligned when a is
divisible by n
● Cache Line Aligned:
○ byte aligned to the size of a cache line
Rules and Hacks
● Java (Hotspot) ♥ 8-byte alignment:
○ new Java Objects
○ Unsafe.allocateMemory
○ Bytebuffer.allocateDirect
● Get the object address:
○ Unsafe gives it to you directly
○ Direct ByteBuffer has a field called 'address'
Cacheline Aligned Access
● Performance Benefits:
○ 3-6x on Core2Duo
○ ~1.5x on Nehalem Class
○ Measured using direct ByteBuffers
○ Mid Aligned vs Straddling a Cacheline
● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
Today I learned ...
Access and Cache Aligned, Contiguous Data
Translation Lookaside
Buffers
TLB, not TLAB!
Virtual Memory
● RAM is finite
● RAM is fragmented
● Multitasking is nice
● Programming easier with a contiguous
address space
Page Tables
● Virtual Address Space split into pages
● Page has two Addresses:
○ Virtual Address: Process' view
○ Physical Address: Location in RAM/Disk
● Page Table
○ Mapping Between these addresses
○ OS Managed
○ Page Fault when entry is missing
○ OS Pages in from disk
Translation Lookaside Buffers
● TLB = Page Table Cache
● Avoids memory bottlenecks in page lookups
● CPU Cache is multi-level => TLB is multi-
level
● Miss/Hit rates measurable via CPU Counters
○ Hit/Miss Rates
○ Walk Duration
TLB Flushing
● Process context switch => change address
space
● Historically caused "flush" =
● Avoided with an Address Space Identifier
(ASID)
Page Size Tradeoff:
Bigger Pages
= less pages
= quicker page lookups
vs
Bigger Pages
= wasted memory space
= more disk paging
Page Sizes
● Page Size is adjustable
● Common Sizes: 4KB, 2MB, 1GB
● Potential Speedup: 10%-30%
● -XX:+UseLargePages
● Oracle DBs particularly
Too Long, Didn't Listen
1. Virtual Memory + Paging have an overhead
2. Translation Lookaside Buffers used to
reduce cost
3. Page size changes important
Principles & Examples (2)
Concurrent Code
Context Switching
● Multitasking systems 'context switch'
between processes
● Variants:
○ Thread
○ Process
○ Virtualized OS
Context Switching Costs
● Direct Cost
○ Save old process state
○ Schedule new process
○ Load new process state
● Indirect Cost
○ TLB Reload (Process only)
○ CPU Pipeline Flush
○ Cache Interference
○ Temporal Locality
Indirect Costs (cont.)
● "Quantifying The Cost of Context Switch" by
Li et al.
○ 5 years old - principle still applies
○ Direct = 3.8 microseconds
○ Indirect = ~0 - 1500 microseconds
○ indirect dominates direct when working set > L2
Cache size
● Cooperative hits also exist
● Minimise context switching if possible
Locking
● Locks require kernel arbitration
○ Not a context switch, but a mode switch
○ Lots of lock contention causes context switching
● Has a cache-pollution cost
● Can use try lock-free algorithms
GC Safepointing
● GC needs to pause program
○ Safe points
● Safe Points cause a page fault
● Also generates Context Switch between
Mutator and GC
○ Not guaranteed to be scheduled back on the same
core
Java Memory Model
● JSR 133 Specs the memory model
● Turtles Example:
○ Java Level: volatile
○ CPU Level: lock
○ Cache Level: MESI
● Has a cost!
False Sharing
● Data can share a cache line
● Not always good
○ Some data 'accidentally' next to other data.
○ Causes Cache lines to be evicted prematurely
● Serious Concurrency Issue
Concurrent Cache Line Access
© Intel
Current Solution: Field Padding
public volatile long value;
public long pad1, pad2, pad3, pad4,
pad5, pad6;
8 bytes(long) * 7 fields + header = 64 bytes =
Cacheline
NB: fields aligned to 8 bytes even if smaller and
re-ordered by size
Real Solution: JEP 142
class UncontendedFoo {
int someValue;
@Contended volatile long foo;
Bar bar;
}
http://openjdk.java.net/jeps/142
False Sharing in Your GC
● Card Tables
○ Split RAM into 'cards'
○ Table records which parts of RAM are written to
○ Avoid rescanning large blocks of RAM
● BUT ... optimise by writing to the byte on
every write
● Solution: -XX:+UseCondCardMark
○ Use With Care: 15-20% sequential slowdown
Buyer Beware:
● Context Switching
● False Sharing
● Visibility/Atomicity/Locking Costs
Too Long, Didn't Listen
1. Caching Behaviour has a performance effect
2. The effects are measurable
3. There are common problems and solutions
Useful Links
● My blog
○ insightfullogic.com
● Lots about memory
○ akkadia.org/drepper/cpumemory.pdf
●
○ jclarity.com/friends/
● Martin Thompson's blog
○ mechanical-sympathy.blogspot.co.uk
● Concurrency Interest
○ g.oswego.edu/dl/concurrency-interest/
Questions?
@RichardWarburto
insightfullogic.com
Garbage Collection Impact
● Young Gen Copying Collector
○ Copying changes location
○ Sequential Allocation helps
● Concurrent Mark Sweep
○ lack of compaction until CMF hurts
● Parallel Old GC
○ Compaction at Full GC helps
Slab (2)
public interface GameEvent extends Cursor {
public int getId();
public void setId(int value);
public long getStrength();
public void setStrength(long value);
}
Allocator<GameEvent> eventAllocator =
Allocator.of(GameEvent.class);
GameEvent event = eventAllocator.allocate(100);
event.move(50);
event.setId(6);
assertEquals(6, event.getId());
A Brave New World
(hopefully, sort of)
Value Types
● Assign or pass the object you copy the value
and not the reference
● Allow control of the memory layout
● Control of layout
● Idea, initial prototypes in mlvm
The future will solve all problems!
Arrays 2.0
● Proposed JVM Feature
● Library Definition of Array Types
● Incorporate a notion of 'flatness'
● Semi-reified Generics
● Loads of other goodies (final, volatile,
resizing, etc.)
● Requires Value Types
Game Event Server
class PlayerAttack {
private long playerId;
private int weapon;
private int energy;
...
private int getWeapon() {
return weapon;
}
...
Flyweight Unsafe (1)
static final Unsafe unsafe = ... ;
long space = OBJ_COUNT * ATTACK_OBJ_SIZE;
address = unsafe.allocateMemory(space);
static PlayerAttack get(int index) {
long loc = address +
(ATTACK_OBJ_SIZE * index)
return new PlayerAttack(loc);
}
Flyweight Unsafe (2)
class PlayerAttack {
static final long WEAPON_OFFSET = 8
private long loc;
public int getWeapon() {
long address = loc + WEAPON_OFFSET
return unsafe.getInt(address);
}
public void setWeapon(int weapon) {
long address = loc + WEAPON_OFFSET
unsafe.setInt(address, weapon);
}

More Related Content

What's hot

Apache tajo configuration
Apache tajo configurationApache tajo configuration
Apache tajo configurationJihoon Son
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache TajoJihoon Son
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017HBaseCon
 
Ndb cluster 80_dbt2_5_tb
Ndb cluster 80_dbt2_5_tbNdb cluster 80_dbt2_5_tb
Ndb cluster 80_dbt2_5_tbmikaelronstrom
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Alexey Rybak
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesSimon Lia-Jonassen
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Alexey Rybak
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelJavaDayUA
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Gruter
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latencyhyeongchae lee
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей РыбакOntico
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodbDeep Kapadia
 
Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspotsChristophe Marchal
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentationMichael Keane
 
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops TeamEvolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops TeamMydbops
 

What's hot (20)

Apache tajo configuration
Apache tajo configurationApache tajo configuration
Apache tajo configuration
 
Query optimization in Apache Tajo
Query optimization in Apache TajoQuery optimization in Apache Tajo
Query optimization in Apache Tajo
 
OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017OpenTSDB: HBaseCon2017
OpenTSDB: HBaseCon2017
 
Ndb cluster 80_dbt2_5_tb
Ndb cluster 80_dbt2_5_tbNdb cluster 80_dbt2_5_tb
Ndb cluster 80_dbt2_5_tb
 
Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)Large-scale projects development (scaling LAMP)
Large-scale projects development (scaling LAMP)
 
Efficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search EnginesEfficient Query Processing in Web Search Engines
Efficient Query Processing in Web Search Engines
 
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
Sharding: patterns and antipatterns (Osipov, Rybak, HighLoad'2014)
 
MapDB - taking Java collections to the next level
MapDB - taking Java collections to the next levelMapDB - taking Java collections to the next level
MapDB - taking Java collections to the next level
 
Vaex pygrunn
Vaex pygrunnVaex pygrunn
Vaex pygrunn
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105Tajo case study bay area hug 20131105
Tajo case study bay area hug 20131105
 
in-memory database system and low latency
in-memory database system and low latencyin-memory database system and low latency
in-memory database system and low latency
 
In-memory Databases
In-memory DatabasesIn-memory Databases
In-memory Databases
 
Map db
Map dbMap db
Map db
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding -  patterns & antipatterns, Константин Осипов, Алексей РыбакSharding -  patterns & antipatterns, Константин Осипов, Алексей Рыбак
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
 
Mongo nyc nyt + mongodb
Mongo nyc nyt + mongodbMongo nyc nyt + mongodb
Mongo nyc nyt + mongodb
 
Elasticsearch avoiding hotspots
Elasticsearch  avoiding hotspotsElasticsearch  avoiding hotspots
Elasticsearch avoiding hotspots
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops TeamEvolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
Evolution of MonogDB Sharding and Its Best Practices - Ranjith A - Mydbops Team
 
Geo data analytics
Geo data analyticsGeo data analytics
Geo data analytics
 

Viewers also liked

Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Michael Barker
 
Input Output Operations
Input Output OperationsInput Output Operations
Input Output Operationskdisthere
 
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015Matt Raible
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using HazelcastTaras Matyashovsky
 
Electronic governance steps in the right direction?
Electronic governance   steps in the right direction?Electronic governance   steps in the right direction?
Electronic governance steps in the right direction?Bozhidar Bozhanov
 
Selecting the right cache framework
Selecting the right cache frameworkSelecting the right cache framework
Selecting the right cache frameworkMohammed Fazuluddin
 
Cache memory
Cache memoryCache memory
Cache memoryAnuj Modi
 
An introduction to fundamental architecture concepts
An introduction to fundamental architecture conceptsAn introduction to fundamental architecture concepts
An introduction to fundamental architecture conceptswweinmeyer79
 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Matt Raible
 

Viewers also liked (18)

Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!
 
Computer Fundamentals
Computer FundamentalsComputer Fundamentals
Computer Fundamentals
 
Cache memory
Cache memoryCache memory
Cache memory
 
04 cache memory
04 cache memory04 cache memory
04 cache memory
 
Input Output Operations
Input Output OperationsInput Output Operations
Input Output Operations
 
Caching
CachingCaching
Caching
 
Chapter 8 - Main Memory
Chapter 8 - Main MemoryChapter 8 - Main Memory
Chapter 8 - Main Memory
 
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
#NoXML: Eliminating XML in Spring Projects - SpringOne 2GX 2015
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
Electronic governance steps in the right direction?
Electronic governance   steps in the right direction?Electronic governance   steps in the right direction?
Electronic governance steps in the right direction?
 
Cache memory
Cache memoryCache memory
Cache memory
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
cache memory
cache memorycache memory
cache memory
 
Selecting the right cache framework
Selecting the right cache frameworkSelecting the right cache framework
Selecting the right cache framework
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory presentation
Cache memory presentationCache memory presentation
Cache memory presentation
 
An introduction to fundamental architecture concepts
An introduction to fundamental architecture conceptsAn introduction to fundamental architecture concepts
An introduction to fundamental architecture concepts
 
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
Comparing Hot JavaScript Frameworks: AngularJS, Ember.js and React.js - Sprin...
 

Similar to Caching in

Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningPetar Djekic
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017Corey Huinker
 
M|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database InfrastructureM|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database InfrastructureMariaDB plc
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingScyllaDB
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartMukesh Singh
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache CassandraSaeid Zebardast
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokesGagan Bajpai
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterRed_Hat_Storage
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsMatthew Dennis
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadKrivoy Rog IT Community
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesHazelcast
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...Robert Burrell Donkin
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningFromDual GmbH
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoSRohit Jnagal
 

Similar to Caching in (20)

Elasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuningElasticsearch 101 - Cluster setup and tuning
Elasticsearch 101 - Cluster setup and tuning
 
Etl confessions pg conf us 2017
Etl confessions   pg conf us 2017Etl confessions   pg conf us 2017
Etl confessions pg conf us 2017
 
M|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database InfrastructureM|18 Why Abstract Away the Underlying Database Infrastructure
M|18 Why Abstract Away the Underlying Database Infrastructure
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
2013 05 ny
2013 05 ny2013 05 ny
2013 05 ny
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
The Dark Side Of Go -- Go runtime related problems in TiDB in production
The Dark Side Of Go -- Go runtime related problems in TiDB  in productionThe Dark Side Of Go -- Go runtime related problems in TiDB  in production
The Dark Side Of Go -- Go runtime related problems in TiDB in production
 
An Introduction to Apache Cassandra
An Introduction to Apache CassandraAn Introduction to Apache Cassandra
An Introduction to Apache Cassandra
 
SNIA SDC 2016 final
SNIA SDC 2016 finalSNIA SDC 2016 final
SNIA SDC 2016 final
 
Scalability broad strokes
Scalability   broad strokesScalability   broad strokes
Scalability broad strokes
 
Erasure codes and storage tiers on gluster
Erasure codes and storage tiers on glusterErasure codes and storage tiers on gluster
Erasure codes and storage tiers on gluster
 
strangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patternsstrangeloop 2012 apache cassandra anti patterns
strangeloop 2012 apache cassandra anti patterns
 
kranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High loadkranonit S06E01 Игорь Цинько: High load
kranonit S06E01 Игорь Цинько: High load
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
If the data cannot come to the algorithm...
If the data cannot come to the algorithm...If the data cannot come to the algorithm...
If the data cannot come to the algorithm...
 
UKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL TuningUKOUG 2011: Practical MySQL Tuning
UKOUG 2011: Practical MySQL Tuning
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Memory Bandwidth QoS
Memory Bandwidth QoSMemory Bandwidth QoS
Memory Bandwidth QoS
 

More from RichardWarburton

Fantastic performance and where to find it
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find itRichardWarburton
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)RichardWarburton
 
Production profiling: What, Why and How
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and HowRichardWarburton
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)RichardWarburton
 
Production Profiling: What, Why and How
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and HowRichardWarburton
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakensRichardWarburton
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)RichardWarburton
 
Generics past, present and future
Generics  past, present and futureGenerics  past, present and future
Generics past, present and futureRichardWarburton
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hoodRichardWarburton
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and FutureRichardWarburton
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)RichardWarburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Twins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingTwins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingRichardWarburton
 
Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8RichardWarburton
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behaveRichardWarburton
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behaveRichardWarburton
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Simplifying java with lambdas (short)
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)RichardWarburton
 

More from RichardWarburton (20)

Fantastic performance and where to find it
Fantastic performance and where to find itFantastic performance and where to find it
Fantastic performance and where to find it
 
Production profiling what, why and how technical audience (3)
Production profiling  what, why and how   technical audience (3)Production profiling  what, why and how   technical audience (3)
Production profiling what, why and how technical audience (3)
 
Production profiling: What, Why and How
Production profiling: What, Why and HowProduction profiling: What, Why and How
Production profiling: What, Why and How
 
Production profiling what, why and how (JBCN Edition)
Production profiling  what, why and how (JBCN Edition)Production profiling  what, why and how (JBCN Edition)
Production profiling what, why and how (JBCN Edition)
 
Production Profiling: What, Why and How
Production Profiling: What, Why and HowProduction Profiling: What, Why and How
Production Profiling: What, Why and How
 
Java collections the force awakens
Java collections  the force awakensJava collections  the force awakens
Java collections the force awakens
 
Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)Generics Past, Present and Future (Latest)
Generics Past, Present and Future (Latest)
 
Collections forceawakens
Collections forceawakensCollections forceawakens
Collections forceawakens
 
Generics past, present and future
Generics  past, present and futureGenerics  past, present and future
Generics past, present and future
 
Jvm profiling under the hood
Jvm profiling under the hoodJvm profiling under the hood
Jvm profiling under the hood
 
How to run a hackday
How to run a hackdayHow to run a hackday
How to run a hackday
 
Generics Past, Present and Future
Generics Past, Present and FutureGenerics Past, Present and Future
Generics Past, Present and Future
 
Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)Pragmatic functional refactoring with java 8 (1)
Pragmatic functional refactoring with java 8 (1)
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Twins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional ProgrammingTwins: Object Oriented Programming and Functional Programming
Twins: Object Oriented Programming and Functional Programming
 
Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8Pragmatic functional refactoring with java 8
Pragmatic functional refactoring with java 8
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
 
Introduction to lambda behave
Introduction to lambda behaveIntroduction to lambda behave
Introduction to lambda behave
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Simplifying java with lambdas (short)
Simplifying java with lambdas (short)Simplifying java with lambdas (short)
Simplifying java with lambdas (short)
 

Recently uploaded

How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxKaustubhBhavsar6
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 

Recently uploaded (20)

How to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptxHow to become a GDSC Lead GDSC MI AOE.pptx
How to become a GDSC Lead GDSC MI AOE.pptx
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 

Caching in

  • 1. Caching In Understand, measure and use your CPU Cache more effectively Richard Warburton
  • 2. - Luke Gorrie on Mechanical Sympathy the hardware really wants to run fast and you only need to avoid getting in the way
  • 3. What are you talking about? ● Why do you care? ● Hardware Fundamentals ● Measurement ● Principles and Examples
  • 4. Why do you care? Perhaps why /should/ you care.
  • 5. The Problem Very Fast Relatively Slow
  • 6. The Solution: CPU Cache ● Core Demands Data, looks at its cache ○ If present (a "hit") then data returned to register ○ If absent (a "miss") then data looked up from memory and stored in the cache ● Fast memory is expensive, a small amount is affordable
  • 7. Multilevel Cache: Intel Sandybridge Shared Level 3 Cache Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core 0 HT: 2 Logical Cores .... Level 2 Cache Level 1 Data Cache Level 1 Instruction Cache Physical Core N HT: 2 Logical Cores
  • 8. CPU Stalls ● Want your CPU to execute instructions ● Pipeline Stall = CPU can't execute because its waiting on memory ● Stalls displace potential computation ● Busy-work Strategies ○ Out-of-Order Execution ○ Simultaneous MultiThreading (SMT) /
  • 9. How bad is a miss? Location Latency in Clockcycles Register 0 L1 Cache 3 L2 Cache 9 L3 Cache 21 Main Memory 150-400 NB: Sandybridge Numbers
  • 10. Cache Lines ● Data transferred in cache lines ● Fixed size block of memory ● Usually 64 bytes in current x86 CPUs ○ Between 32 and 256 bytes ● Purely hardware consideration
  • 11. You don't need to know everything
  • 12. Architectural Takeaways ● Modern CPUs have large multilevel Caches ● Cache misses cause Stalls ● Stalls are expensive
  • 13. Measurement Barometer, Sun Dial ... and CPU Counters
  • 14. Do you care? ● DON'T look at CPU Caching behaviour first ● Many problems not Execution Bound ○ Networking ○ Database or External Service ○ I/O ○ Garbage Collection ○ Insufficient Parallelism ● Consider caching behaviour when you're execution bound and know your hotspot.
  • 15. CPU Performance Event Counters ● CPU Register which can count events ○ Eg: the number of Level 3 Cache Misses ● Model Specific Registers ○ not instruction set standardised by x86 ○ differ by CPU (eg Penryn vs Sandybridge) and manufacturer ● Don't worry - leave details to tooling
  • 16. Measurement: Instructions Retired ● The number of instructions which executed until completion ○ ignores branch mispredictions ● When stalled you're not retiring instructions ● Aim to maximise instruction retirement when reducing cache misses
  • 17. Cache Misses ● Cache Level ○ Level 3 ○ Level 2 ○ Level 1 (Data vs Instruction) ● What to measure ○ Hits ○ Misses ○ Reads vs Writes ○ Calculate Ratio
  • 18. Cache Profilers ● Open Source ○ perf ○ likwid ○ linux (rdmsr/wrmsr) ○ Linux Kernel (via /proc) ● Proprietary ○ jClarity jMSR ○ Intel VTune ○ AMD Code Analyst ○ Visual Studio 2012 ○ Oracle Studio
  • 19. Good Benchmarking Practice ● Warmups ● Measure and rule-out other factors ● Specific Caching Ideas ...
  • 20. Take Care with Working Set Size
  • 21. Good Benchmark has Low Variance
  • 22. You can measure Cache behaviour
  • 24. ● Prefetch = Eagerly load data ● Adjacent Cache Line Prefetch ● Data Cache Unit (Streaming) Prefetch ● Problem: CPU Prediction isn't perfect ● Solution: Arrange Data so accesses are predictable Prefetching
  • 25. Sequential Locality Referring to data that is arranged linearly in memory Spatial Locality Referring to data that is close together in memory Temporal Locality Repeatedly referring to same data in a short time span
  • 26. General Principles ● Use smaller data types (-XX:+UseCompressedOops) ● Avoid 'big holes' in your data ● Make accesses as linear as possible
  • 27. Primitive Arrays // Sequential Access = Predictable for (int i=0; i<someArray.length; i++) someArray[i]++;
  • 28. Primitive Arrays - Skipping Elements // Holes Hurt for (int i=0; i<someArray.length; i += SKIP) someArray[i]++;
  • 29. Primitive Arrays - Skipping Elements
  • 30. Multidimensional Arrays ● Multidimensional Arrays are really Arrays of Arrays in Java. (Unlike C) ● Some people realign their accesses: for (int col=0; col<COLS; col++) { for (int row=0; row<ROWS; row++) { array[ROWS * col + row]++; } }
  • 31. Bad Access Alignment Strides the wrong way, bad locality. array[COLS * row + col]++; Strides the right way, good locality. array[ROWS * col + row]++;
  • 32. Data Layout Principles ● Primitive Collections (GNU Trove, GS-Coll) ● Arrays > Linked Lists ● Hashtable > Search Tree ● Avoid Code bloating (Loop Unrolling) ● Custom Data Structures ○ Judy Arrays ■ an associative array/map ○ kD-Trees ■ generalised Binary Space Partitioning ○ Z-Order Curve ■ multidimensional data in one dimension
  • 33. Data Locality vs Java Heap Layout 0 1 2 3 ... class Foo { Integer count; Bar bar; Baz baz; } // No alignment guarantees for (Foo foo : foos) { foo.count = 5; foo.bar.visit(); } Foo bar baz count
  • 34. Data Locality vs Java Heap Layout ● Serious Java Weakness ● Location of objects in memory hard to guarantee. ● GC also interferes ○ Copying ○ Compaction
  • 35. Alternative Approach ● Use sun.misc.Unsafe to directly allocate memory ● Manage mapping yourself ● Use with care ○ Maybe slower ○ Maybe error prone ○ Definitely hard work
  • 36. SLAB ● Manually writing this is: ○ time consuming ○ error prone ○ boilerplate-ridden ● Prototype library: ○ http://github.com/RichardWarburton/slab ○ Maps an interface to a flyweight ● Still not recommended!
  • 37. Data Alignment ● An address a is n-byte aligned when a is divisible by n ● Cache Line Aligned: ○ byte aligned to the size of a cache line
  • 38. Rules and Hacks ● Java (Hotspot) ♥ 8-byte alignment: ○ new Java Objects ○ Unsafe.allocateMemory ○ Bytebuffer.allocateDirect ● Get the object address: ○ Unsafe gives it to you directly ○ Direct ByteBuffer has a field called 'address'
  • 39. Cacheline Aligned Access ● Performance Benefits: ○ 3-6x on Core2Duo ○ ~1.5x on Nehalem Class ○ Measured using direct ByteBuffers ○ Mid Aligned vs Straddling a Cacheline ● http://psy-lob-saw.blogspot.co.uk/2013/01/direct-memory-alignment-in-java.html
  • 40. Today I learned ... Access and Cache Aligned, Contiguous Data
  • 42. Virtual Memory ● RAM is finite ● RAM is fragmented ● Multitasking is nice ● Programming easier with a contiguous address space
  • 43. Page Tables ● Virtual Address Space split into pages ● Page has two Addresses: ○ Virtual Address: Process' view ○ Physical Address: Location in RAM/Disk ● Page Table ○ Mapping Between these addresses ○ OS Managed ○ Page Fault when entry is missing ○ OS Pages in from disk
  • 44. Translation Lookaside Buffers ● TLB = Page Table Cache ● Avoids memory bottlenecks in page lookups ● CPU Cache is multi-level => TLB is multi- level ● Miss/Hit rates measurable via CPU Counters ○ Hit/Miss Rates ○ Walk Duration
  • 45. TLB Flushing ● Process context switch => change address space ● Historically caused "flush" = ● Avoided with an Address Space Identifier (ASID)
  • 46. Page Size Tradeoff: Bigger Pages = less pages = quicker page lookups vs Bigger Pages = wasted memory space = more disk paging
  • 47. Page Sizes ● Page Size is adjustable ● Common Sizes: 4KB, 2MB, 1GB ● Potential Speedup: 10%-30% ● -XX:+UseLargePages ● Oracle DBs particularly
  • 48. Too Long, Didn't Listen 1. Virtual Memory + Paging have an overhead 2. Translation Lookaside Buffers used to reduce cost 3. Page size changes important
  • 49. Principles & Examples (2) Concurrent Code
  • 50. Context Switching ● Multitasking systems 'context switch' between processes ● Variants: ○ Thread ○ Process ○ Virtualized OS
  • 51. Context Switching Costs ● Direct Cost ○ Save old process state ○ Schedule new process ○ Load new process state ● Indirect Cost ○ TLB Reload (Process only) ○ CPU Pipeline Flush ○ Cache Interference ○ Temporal Locality
  • 52. Indirect Costs (cont.) ● "Quantifying The Cost of Context Switch" by Li et al. ○ 5 years old - principle still applies ○ Direct = 3.8 microseconds ○ Indirect = ~0 - 1500 microseconds ○ indirect dominates direct when working set > L2 Cache size ● Cooperative hits also exist ● Minimise context switching if possible
  • 53. Locking ● Locks require kernel arbitration ○ Not a context switch, but a mode switch ○ Lots of lock contention causes context switching ● Has a cache-pollution cost ● Can use try lock-free algorithms
  • 54. GC Safepointing ● GC needs to pause program ○ Safe points ● Safe Points cause a page fault ● Also generates Context Switch between Mutator and GC ○ Not guaranteed to be scheduled back on the same core
  • 55. Java Memory Model ● JSR 133 Specs the memory model ● Turtles Example: ○ Java Level: volatile ○ CPU Level: lock ○ Cache Level: MESI ● Has a cost!
  • 56. False Sharing ● Data can share a cache line ● Not always good ○ Some data 'accidentally' next to other data. ○ Causes Cache lines to be evicted prematurely ● Serious Concurrency Issue
  • 57. Concurrent Cache Line Access © Intel
  • 58. Current Solution: Field Padding public volatile long value; public long pad1, pad2, pad3, pad4, pad5, pad6; 8 bytes(long) * 7 fields + header = 64 bytes = Cacheline NB: fields aligned to 8 bytes even if smaller and re-ordered by size
  • 59. Real Solution: JEP 142 class UncontendedFoo { int someValue; @Contended volatile long foo; Bar bar; } http://openjdk.java.net/jeps/142
  • 60. False Sharing in Your GC ● Card Tables ○ Split RAM into 'cards' ○ Table records which parts of RAM are written to ○ Avoid rescanning large blocks of RAM ● BUT ... optimise by writing to the byte on every write ● Solution: -XX:+UseCondCardMark ○ Use With Care: 15-20% sequential slowdown
  • 61. Buyer Beware: ● Context Switching ● False Sharing ● Visibility/Atomicity/Locking Costs
  • 62. Too Long, Didn't Listen 1. Caching Behaviour has a performance effect 2. The effects are measurable 3. There are common problems and solutions
  • 63. Useful Links ● My blog ○ insightfullogic.com ● Lots about memory ○ akkadia.org/drepper/cpumemory.pdf ● ○ jclarity.com/friends/ ● Martin Thompson's blog ○ mechanical-sympathy.blogspot.co.uk ● Concurrency Interest ○ g.oswego.edu/dl/concurrency-interest/
  • 65. Garbage Collection Impact ● Young Gen Copying Collector ○ Copying changes location ○ Sequential Allocation helps ● Concurrent Mark Sweep ○ lack of compaction until CMF hurts ● Parallel Old GC ○ Compaction at Full GC helps
  • 66. Slab (2) public interface GameEvent extends Cursor { public int getId(); public void setId(int value); public long getStrength(); public void setStrength(long value); } Allocator<GameEvent> eventAllocator = Allocator.of(GameEvent.class); GameEvent event = eventAllocator.allocate(100); event.move(50); event.setId(6); assertEquals(6, event.getId());
  • 67. A Brave New World (hopefully, sort of)
  • 68. Value Types ● Assign or pass the object you copy the value and not the reference ● Allow control of the memory layout ● Control of layout ● Idea, initial prototypes in mlvm
  • 69. The future will solve all problems!
  • 70. Arrays 2.0 ● Proposed JVM Feature ● Library Definition of Array Types ● Incorporate a notion of 'flatness' ● Semi-reified Generics ● Loads of other goodies (final, volatile, resizing, etc.) ● Requires Value Types
  • 71. Game Event Server class PlayerAttack { private long playerId; private int weapon; private int energy; ... private int getWeapon() { return weapon; } ...
  • 72. Flyweight Unsafe (1) static final Unsafe unsafe = ... ; long space = OBJ_COUNT * ATTACK_OBJ_SIZE; address = unsafe.allocateMemory(space); static PlayerAttack get(int index) { long loc = address + (ATTACK_OBJ_SIZE * index) return new PlayerAttack(loc); }
  • 73. Flyweight Unsafe (2) class PlayerAttack { static final long WEAPON_OFFSET = 8 private long loc; public int getWeapon() { long address = loc + WEAPON_OFFSET return unsafe.getInt(address); } public void setWeapon(int weapon) { long address = loc + WEAPON_OFFSET unsafe.setInt(address, weapon); }