This document discusses strategies for optimizing cache performance in Java applications. It begins by providing examples of different caching technologies like Coherence, Gemfire, Ehcache, Cassandra and Memcached. It then discusses key metrics for measuring cache performance like insert, read and update latencies. The document outlines concepts like data locality, hit ratios and expiration policies that impact cache performance. It also demonstrates visualizing cache usage and heatmaps. Finally, it discusses techniques for optimizing the Java virtual machine for big data workloads, including reducing object overhead, using non-blocking collections to avoid locks, tuning garbage collection and avoiding memory leaks.
"Impact of front-end architecture on development cost", Viktor Turskyi
Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago Mercantile Group, 2010
1. Cache is King
(Or How to Stop Worrying and
Start Caching in Java)
SriSatish Ambati
Performance & Partner Engineer
Now: Azul Systems,
Upcoming: Riptano, Apache Cassandra
Twitter: @srisatish
2. The Trail
• Examples
• Elements of Cache Performance
– Theory
– Metrics
– Focus on Oracle Coherence, Apache Cassandra
• 200GB Cache Design
• JVM in BigData Land
– Overheads in Java – Objects, GC
– Locks,
– Serialization
– JMX
3. Wie die wahl hat, hat die Qual!
He who has the choice has the agony!
4. You are on the train -
& When *asked* to pick a stock to invest in?
What does one do?
a) Pick all single letter stocks (Citi, Ford, Hyatt, Sprint)
b) Pick commodities
c) Pick penny stocks
d) Pick an index
e) Pick the nearest empty seat
5. If only we could pick an index of Cache Vendors!
6. Some example caches
• Homegrown caches – Surprisingly work well.
– (do it yourself! It’s a giant hash)
• Coherence, Gemstone/VMWare, GigaSpaces,
EhCache/Terracotta, Infinispan/JBoss etc
• NoSQL stores: Apache Cassandra, HBase,
SimpleDB
• Non java alternatives: MemCached & clones,
Redis, CouchDB, MongoDB
9. Elements of Cache Performance : Metrics
• Inserts: Puts/sec, Latencies
• Reads: Gets/sec, Latencies, Indexing
• Updates: mods/sec, latencies (Locate, Modify &
Notify)
• Replication
– Synchronous, Asynchronous (faster w)
• Consistency – Eventual
• Persistence
• Size of Objects, Number of Objects/rows
• Size of Cache
• # of cacheserver Nodes (read only, read write)
• # of clients
10. Elements of Cache Performance:
“Think Locality”
• Hot or Not: The 80/20 rule.
– A small set of objects are very popular!
– Most popular commodity of the day?
• Hit or Miss: Hit Ratio
– How effective is your cache?
– LRU, LFU, FIFO.. Expiration
• Long-lived objects lead to better locality.
• Spikes happen
– Cascading events
– Cache Thrash: full table scans
11. A feather in the CAP
• Tunable Consistency
– Levels, 0,1, ALL
– Doesn’t mean data loss
(journaled systems)
• SEDA
– Partitioning, Cluster-membership
& Failure detection, Storage
engines
– Event driven & non-blocking io
– Pure Java
12. NoSQL/Cassandra:
furiously fast writes
n2 to
client a pply y
m emor
issues
n1
write find node
commit log
partitioner
• Append only writes
– Sequential disk access
• No locks in critical path
• Key based atomicity
13. Performance
• Facebook Inbox
– Writes:0.12ms, Reads:15ms @ 50GB
data
– More than10x better than MySQL
• ycbs/PNUTS benchmarks
– 5ms read/writes @ 5k ops/s (50/50
Update heavy)
– 8ms reads/5ms writes @ 5k ops/s (95/5
read heavy)
• Lab environment
– ~5k writes per sec per node, <5ms
latencies
– ~10k reads per sec per node, <5ms
16. I/O considerations
• Asynchronous
• Sockets
• Persistence –
– File, DB (CacheLoaders)
– Dedicated disks: What happens in the cloud?
• Data Access Patterns of Doom,
– “Death by a million gets” – Batch your reads.
17. Partitioning & Distributed Caches
• Near Cache/L1 Cache
– Bring data close to the Logic that is using it. (HBase)
– Birds of feather flock together - related data live closer
• Read-only nodes, Read-Write nodes
• Ranges, Bloom Filters
• Management nodes
• Communication Costs
• Balancing (buckets)
• Serialization (more later)
18. Birthdays, Collisions &
Hashing functions
• Birthday Paradox
– For the N=21 people in a room
– Probability that at least 2 of them share same birthday
is ~0.47
• Collisions are real!
• An unbalanced HashMap behaves like a list O(n) retrieval
• Chaining & Linear probing
• Performance Degrades
– with 80% table density
•
19. Bloom Filter: in full bloom
• “constant” time
• size:compact
• false positives
• Single lookup
for key in file
• Deletion
• Improve
– Counting BF
– Bloomier filters
22. How many nodes to get a 200G cache?
• Who needs a 200G cache?
– Disk is the new Tape!
• 200 nodes @ 1GB heap each
• 2 nodes @ 100GB heap each
– (plus overhead)
24. JVM in BigData Land!
A few limits for scale
• Object overhead
– average enterprise collection has 3 elements!
– Use byte[ ], primitives where possible!
• Locks : synchronized
– Can’t use all my multi-cores!
– java.util.collections also hold locks
– Use non-blocking collections!
• (de) Serialization is expensive
– Hampers object portability, cluster-scaleability
– Use avro, thrift!
• Garbage Collection
– Can’t throw memory at the problem!?
– Mitigate, Monitor, Measure footprint
25. Tools
• What is the JVM doing:
– dtrace, hprof, introscope, dynatrace, jconsole,
visualvm, yourkit, azul zvision
• Invasive JVM observation tools
– bci, jvmti, jvmdi/pi agents, jmx, logging
• What is the OS doing:
– dtrace, oprofile, vtune
• What is the network disk doing:
– Ganglia, iostat, lsof, netstat, nagios
26. Java Limits: Objects are not cheap!
• How many bytes for a 8 char String ?
• (assume 32-bit)
String
A. 64bytes
JVM Overhead book keeping fields Pointer 31% overhead
char[] 16 bytes 12 bytes 4 bytes
Size of String
data
JVM Overhead
16 bytes
Varies with JVM
16 bytes
• How many objects in a Tomcat idle
instance?
27. Picking the right collection: Mozart or Bach?
TreeMap
• 100 elements of Treemap Fixed Overhead: 48 bytes
TreeMap$Entry
of <Double, Double>
Per-entry Overhead: 40 bytes
– 82% overhead, 88 bytes constant cost
per element
– Enables updates while maintaining
order data
• double[], double[] – Double double
– 2% overhead, amortized
– [con: load-then-use] JVM Overhead data
16 bytes 8 bytes
• Sparse collections, Empty
collections,
• Wrong collections for the *From one 32-bit JVM.
Varies with JVM Architecture
problem
30. Garbage Collection
• Pause Times
if stop_the_word_FullGC > ttl_of_node
=> failed requests; node repair
=> node is declared dead
• Allocation Rate
– New object creation, insertion rate
• Live Objects (residency)
– if residency in heap > 50%
– GC overheads dominate.
• Overhead: space, cpu cycles spent GC
• 64-bit not addressing pause times
– Bigger is not better!
– 40-50% increase in heap sizes for same
workloads.
31. Too many free parameters!!
Tune GC:
• Entropy is: Number of flags it takes to tune GC.
• Workloads in lab do not represent production
• Fragile, Meaning of flags changes.
Solution:
• Ask VM vendor to provide one flag soln.
• Azul’s PauselessGC (now in software)
⇒ Avoid OOM, configure node death if OOM
⇒ Azul’s Cooperative-Memory (swap space for your jvm
under spike: No more OOM!)
32. Memory Fragmentation
• Fragmentation
– Performance degrades over time
– Inducing “Full GC” makes problem go away
– Free memory that cannot be used
• Reduce occurrence
– Use a compacting collector
– Promote less often
– Use uniform sized objects
• Solution – unsolved
– Use latest CMS with CR:6631166
– Azul’s Zing JVM & Pauseless GC
33. Sizing: Young Generation
• Should we set –Xms == -Xmx ?
• Use –Xmn (fixed eden)
allocations {new Object();}
survivor ratio
eden survivor spaces Tenuring
promotion Threshold
allocation by jvm
old generation
34. Generations
• Don’t promote too often!
– Frequent promotion causes fragmentation
• Size the generations
– Min GC times are a function of Live Set
– Old Gen should host steady state comfortably
• Parallelize on multicores:
– -XX:ParallelCMSThreads=4
– -XX:ParallelGCThreads=4
• Avoid CMS Initiating heuristic
– -XX:+UseCMSInitiationOccupanyOnly
• Use Concurrent for System.gc()
– -XX:+ExplicitGCInvokesConcurrent
35. Memory Leaks
• Application takes all
memory you got!
• Live heap shows sawtooth
• Eventually throws OOM
Theory:
• Allocated, Live heap,
PermGen
Common sources:
• Finalizers, Classloaders,
ThreadLocal
36. synchronized:
Amdahl’s law trumps Moore’s!
• Coarse grained locks
• io under lock
• Stop signal on a highway
• java.util.concurrent does not mean no locks
• Non Blocking, Lock free, Wait free collections
37. Locks: Distributed Caching
• Schemes
– Optimistic, Pessimistic
• Consistency
– Eventually vs. ACID
• Contention, Waits
• java.util.concurrent, critical sections,
– Use Lock Striping
• MVCC, Lock-free, wait-free DataStructures.
(NBHM)
• Transactions are expensive
⇒Reduce JTA abuse, Set the right isolation levels.
39. UUID
Are you using UUID gen for messaging?
• java.util.UUID is slow
– static use leads to contention
SecureRandom
• Uses /dev/urandom for seed initialization
-Djava.security.egd=file:/dev/urandom
• PRNG without file is atleast 20%-40% better.
• Use TimeUUIDs where possible – much faster
• JUG – java.uuid.generator
• http://github.com/cowtowncoder/java-uuid-generator
• http://jug.safehaus.org/
• http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
40. Towards Non-blocking high scale collections!
• Big Array to hold Data
• Concurrent writes via: CAS & Finite State Machine
– No locks, no volatile
– Much faster than locking under heavy load
– Directly reach main data array in 1 step
• Resize as needed
– Copy Array to a larger Array on demand
– Use State Machine to help copy
– “ Mark” old Array words to avoid missing late updates
• Use Non-Blocking Hashmap, google collections
42. Inter-node communication
• TCP for mgmt & data: Infinispan
• TCP for mgmt, UDP for data: Coherence, Infinispan
• UDP for mgmt, TCP for data: Cassandra, Infinispan
• Instrumentation: EHCache/Terracotta
Bandwidth & Latency considerations
⇒ Ensure proper network configuration in the kernel
⇒ Run Datagram tests
⇒ Limit number of management nodes & nodes
47. Count what is countable, measure what is measurable, and what is not
measurable, make measurable
-Galileo
48. Latency:
Where have all the millis gone?
• Moore’s law amplifies bandwidth
– Latencies are still lagging!
• Measure. 90th percentile. Look for consistency.
• => JMX is great! JMX is also very slow.
• Lesser number of nodes means less MBeans!
• Monitor (network, memory, cpu), ganglia,
• Know thyself: Application Footprint, Trend data.