Cache is King(Or How to Stop Worrying andStart Caching in Java)SriSatish AmbatiPerformance & Partner EngineerNow: Azul Sys...
The Trail• Examples• Elements of Cache Performance   – Theory   – Metrics   – Focus on Oracle Coherence, Apache Cassandra•...
Wie die wahl hat, hat die Qual!He who has the choice has the agony!
You are on the train -     & When *asked* to pick a stock to invest in?What does one do?a) Pick all single letter stocks (...
If only we could pick an index of Cache Vendors!
Some example caches• Homegrown caches – Surprisingly work well.  – (do it yourself! It’s a giant hash)• Coherence, Gemston...
Visualize Cache• Simple example• Visualize Cache Replicated Cache      Distributed Cache
Example, RTView : Coherence heat map:
Elements of Cache Performance : Metrics• Inserts: Puts/sec, Latencies• Reads: Gets/sec, Latencies, Indexing• Updates: mods...
Elements of Cache Performance:            “Think Locality”• Hot or Not: The 80/20 rule.   – A small set of objects are ver...
A feather in the CAP• Tunable Consistency   – Levels, 0,1, ALL   – Doesn’t mean data loss     (journaled systems)• SEDA   ...
NoSQL/Cassandra:           furiously fast writes                           n2              to    client                   ...
Performance• Facebook Inbox  – Writes:0.12ms, Reads:15ms @ 50GB    data  – More than10x better than MySQL• ycbs/PNUTS benc...
yahoo cloud store benchmark50/50 – Update Heavy
yahoo cloud store benchmark95/5 – read heavy
I/O considerations• Asynchronous• Sockets• Persistence –  – File, DB (CacheLoaders)  – Dedicated disks: What happens in th...
Partitioning & Distributed Caches• Near Cache/L1 Cache    – Bring data close to the Logic that is using it. (HBase)    – B...
Birthdays, Collisions &       Hashing functions• Birthday Paradox   – For the N=21 people in a room   – Probability that a...
Bloom Filter: in full bloom• “constant” time• size:compact• false positives• Single lookup   for key in file• Deletion• Im...
How many nodes to get a 200G cache?
Imagine          – John Lennon
How many nodes to get a 200G cache?• Who needs a 200G cache?  – Disk is the new Tape!• 200 nodes @ 1GB heap each• 2 nodes ...
the devil’s in the details
JVM in BigData Land!A few limits for scale• Object overhead   – average enterprise collection has 3 elements!   – Use byte...
Tools• What is the JVM doing:   – dtrace, hprof, introscope, dynatrace, jconsole,     visualvm, yourkit, azul zvision• Inv...
Java Limits: Objects are not cheap!• How many bytes for a 8 char String ?• (assume 32-bit)    String                      ...
Picking the right collection: Mozart or Bach?                                            TreeMap• 100 elements of Treemap ...
JEE is not cheap either!                                                                          Million Objects         ...
Another example, Overhead in collection
Garbage Collection• Pause Times   if stop_the_word_FullGC > ttl_of_node    => failed requests; node repair    => node is d...
Too many free parameters!!Tune GC:• Entropy is: Number of flags it takes to tune GC.• Workloads in lab do not represent pr...
Memory Fragmentation• Fragmentation   – Performance degrades over time   – Inducing “Full GC” makes problem go away   – Fr...
Sizing: Young Generation• Should we set –Xms == -Xmx ?• Use –Xmn (fixed eden)         allocations {new Object();}         ...
Generations• Don’t promote too often!  – Frequent promotion causes fragmentation• Size the generations  – Min GC times are...
Memory Leaks• Application takes all  memory you got!• Live heap shows sawtooth• Eventually throws OOMTheory:• Allocated, L...
synchronized:       Amdahl’s law trumps Moore’s!•   Coarse grained locks•   io under lock•   Stop signal on a highway•   j...
Locks: Distributed Caching• Schemes   – Optimistic, Pessimistic• Consistency   – Eventually vs. ACID• Contention, Waits• j...
writes: monitors
UUIDAre you using UUID gen for messaging?• java.util.UUID is slow   – static use leads to contentionSecureRandom• Uses /de...
Towards Non-blocking high scale collections!• Big Array to hold Data• Concurrent writes via: CAS & Finite State Machine   ...
Non-Blocking HashMap            Azul Vega2 – 768 cpus                            1K Table                                 ...
Inter-node communication• TCP for mgmt & data: Infinispan• TCP for mgmt, UDP for data: Coherence, Infinispan• UDP for mgmt...
Example, Apache Cassandra• Partition, Ring, Gateway, BloomFilters• Gossip Protocol  – It’s exponential  – (epidemic algori...
Coherence Communication Issues
Marshal Arts:     Serialization/Deserialization•   java.io.Serializable is S.L..O.…W•   Use “transient”•    jserial, avro,...
Serialization + Deserialization uBench•   http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
Count what is countable, measure what is measurable, and what is notmeasurable, make measurable                           ...
Latency:       Where have all the millis gone?• Moore’s law amplifies bandwidth    – Latencies are still lagging!•   Measu...
Optimization hinders evolution                – Alan Perlis
Q&A•   References:•   Making Sense of Large Heaps, Nick Mitchell, IBM•   Oracle Coherence 3.5, Aleksandar Seovic•   Large ...
Cassandra links•   Verner Wogels, Eventually Consistent    http://www.allthingsdistributed.com/2008/12/eventually_consiste...
Further questions, reach me:  Azul Systems, sris@azulsystems.com  Twitter: @srisatish, srisatish@riptano.com
Upcoming SlideShare
Loading in …5
×

Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago Mercantile Group, 2010

4,297 views

Published on

Compendium of all my Cache talks in 2010 presented at Chicago Mercantile Exchange on Nov.2010.

Published in: Technology

Cache is King ( Or How To Stop Worrying And Start Caching in Java) at Chicago Mercantile Group, 2010

  1. 1. Cache is King(Or How to Stop Worrying andStart Caching in Java)SriSatish AmbatiPerformance & Partner EngineerNow: Azul Systems,Upcoming: Riptano, Apache CassandraTwitter: @srisatish
  2. 2. The Trail• Examples• Elements of Cache Performance – Theory – Metrics – Focus on Oracle Coherence, Apache Cassandra• 200GB Cache Design• JVM in BigData Land – Overheads in Java – Objects, GC – Locks, – Serialization – JMX
  3. 3. Wie die wahl hat, hat die Qual!He who has the choice has the agony!
  4. 4. You are on the train - & When *asked* to pick a stock to invest in?What does one do?a) Pick all single letter stocks (Citi, Ford, Hyatt, Sprint)b) Pick commoditiesc) Pick penny stocksd) Pick an indexe) Pick the nearest empty seat
  5. 5. If only we could pick an index of Cache Vendors!
  6. 6. Some example caches• Homegrown caches – Surprisingly work well. – (do it yourself! It’s a giant hash)• Coherence, Gemstone/VMWare, GigaSpaces, EhCache/Terracotta, Infinispan/JBoss etc• NoSQL stores: Apache Cassandra, HBase, SimpleDB• Non java alternatives: MemCached & clones, Redis, CouchDB, MongoDB
  7. 7. Visualize Cache• Simple example• Visualize Cache Replicated Cache Distributed Cache
  8. 8. Example, RTView : Coherence heat map:
  9. 9. Elements of Cache Performance : Metrics• Inserts: Puts/sec, Latencies• Reads: Gets/sec, Latencies, Indexing• Updates: mods/sec, latencies (Locate, Modify & Notify)• Replication – Synchronous, Asynchronous (faster w)• Consistency – Eventual• Persistence• Size of Objects, Number of Objects/rows• Size of Cache• # of cacheserver Nodes (read only, read write)• # of clients
  10. 10. Elements of Cache Performance: “Think Locality”• Hot or Not: The 80/20 rule. – A small set of objects are very popular! – Most popular commodity of the day?• Hit or Miss: Hit Ratio – How effective is your cache? – LRU, LFU, FIFO.. Expiration• Long-lived objects lead to better locality.• Spikes happen – Cascading events – Cache Thrash: full table scans
  11. 11. A feather in the CAP• Tunable Consistency – Levels, 0,1, ALL – Doesn’t mean data loss (journaled systems)• SEDA – Partitioning, Cluster-membership & Failure detection, Storage engines – Event driven & non-blocking io – Pure Java
  12. 12. NoSQL/Cassandra: furiously fast writes n2 to client a pply y m emor issues n1 write find node commit log partitioner• Append only writes – Sequential disk access• No locks in critical path• Key based atomicity
  13. 13. Performance• Facebook Inbox – Writes:0.12ms, Reads:15ms @ 50GB data – More than10x better than MySQL• ycbs/PNUTS benchmarks – 5ms read/writes @ 5k ops/s (50/50 Update heavy) – 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy)• Lab environment – ~5k writes per sec per node, <5ms latencies – ~10k reads per sec per node, <5ms
  14. 14. yahoo cloud store benchmark50/50 – Update Heavy
  15. 15. yahoo cloud store benchmark95/5 – read heavy
  16. 16. I/O considerations• Asynchronous• Sockets• Persistence – – File, DB (CacheLoaders) – Dedicated disks: What happens in the cloud?• Data Access Patterns of Doom, – “Death by a million gets” – Batch your reads.
  17. 17. Partitioning & Distributed Caches• Near Cache/L1 Cache – Bring data close to the Logic that is using it. (HBase) – Birds of feather flock together - related data live closer• Read-only nodes, Read-Write nodes• Ranges, Bloom Filters• Management nodes• Communication Costs• Balancing (buckets)• Serialization (more later)
  18. 18. Birthdays, Collisions & Hashing functions• Birthday Paradox – For the N=21 people in a room – Probability that at least 2 of them share same birthday is ~0.47• Collisions are real!• An unbalanced HashMap behaves like a list O(n) retrieval• Chaining & Linear probing• Performance Degrades – with 80% table density•
  19. 19. Bloom Filter: in full bloom• “constant” time• size:compact• false positives• Single lookup for key in file• Deletion• Improve – Counting BF – Bloomier filters
  20. 20. How many nodes to get a 200G cache?
  21. 21. Imagine – John Lennon
  22. 22. How many nodes to get a 200G cache?• Who needs a 200G cache? – Disk is the new Tape!• 200 nodes @ 1GB heap each• 2 nodes @ 100GB heap each – (plus overhead)
  23. 23. the devil’s in the details
  24. 24. JVM in BigData Land!A few limits for scale• Object overhead – average enterprise collection has 3 elements! – Use byte[ ], primitives where possible!• Locks : synchronized – Can’t use all my multi-cores! – java.util.collections also hold locks – Use non-blocking collections!• (de) Serialization is expensive – Hampers object portability, cluster-scaleability – Use avro, thrift!• Garbage Collection – Can’t throw memory at the problem!? – Mitigate, Monitor, Measure footprint
  25. 25. Tools• What is the JVM doing: – dtrace, hprof, introscope, dynatrace, jconsole, visualvm, yourkit, azul zvision• Invasive JVM observation tools – bci, jvmti, jvmdi/pi agents, jmx, logging• What is the OS doing: – dtrace, oprofile, vtune• What is the network disk doing: – Ganglia, iostat, lsof, netstat, nagios
  26. 26. Java Limits: Objects are not cheap!• How many bytes for a 8 char String ?• (assume 32-bit) String A. 64bytes JVM Overhead book keeping fields Pointer 31% overhead char[] 16 bytes 12 bytes 4 bytes Size of String data JVM Overhead 16 bytes Varies with JVM 16 bytes• How many objects in a Tomcat idle instance?
  27. 27. Picking the right collection: Mozart or Bach? TreeMap• 100 elements of Treemap Fixed Overhead: 48 bytes TreeMap$Entry of <Double, Double> Per-entry Overhead: 40 bytes – 82% overhead, 88 bytes constant cost per element – Enables updates while maintaining order data• double[], double[] – Double double – 2% overhead, amortized – [con: load-then-use] JVM Overhead data 16 bytes 8 bytes• Sparse collections, Empty collections,• Wrong collections for the *From one 32-bit JVM. Varies with JVM Architecture problem
  28. 28. JEE is not cheap either! Million Objects allocated live JBoss 5.1 20 4 Apache Tomcat 6.0 0.25 0.1 JBoss 5.1 Apache Tomcat 6.0 Allocated AllocatedClass name Size (B) Count Avg (B) Class name Size (B) Count Avg (B)Total 1,410,764,512 19,830,135 71.1 Total 21,580,592 228,805 94.3char[] 423,372,528 4,770,424 88.7 char[] 4,215,784 48,574 86.8byte[] 347,332,152 1,971,692 176.2 byte[] 3,683,984 5,024 733.3int[] 85,509,280 1,380,642 61.9 Built‐in VM methodKlass 2,493,064 16,355 152.4java.lang.String 73,623,024 3,067,626 24 Built‐in VM constMethodKlass 1,955,696 16,355 119.6java.lang.Object[] 64,788,840 565,693 114.5 Built‐in VM constantPoolKlass 1,437,240 1,284 1,119.30java.util.regex.Matcher 51,448,320 643,104 80 Built‐in VM instanceKlass 1,078,664 1,284 840.1java.lang.reflect.Method 43,374,528 301,212 144 java.lang.Class[] 922,808 45,354 20.3java.util.HashMap$Entry[] 27,876,848 140,898 197.9 Built‐in VM constantPoolCacheK 903,360 1,132 798java.util.TreeMap$Entry 22,116,136 394,931 56 Live java.lang.String 753,936 31,414 24java.util.HashMap$Entry 19,806,440 495,161 40 java.lang.Object[] 702,264 8,118 86.5java.nio.HeapByteBuffer 17,582,928 366,311 48 java.lang.reflect.Method 310,752 2,158 144java.nio.HeapCharBuffer 17,575,296 366,152 48java.lang.StringBuilder 15,322,128 638,422 24 short[] 261,112 3,507 74.5java.util.TreeMap$EntryIterator 15,056,784 313,683 48 java.lang.Class 255,904 1,454 176java.util.ArrayList 11,577,480 289,437 40 int[][] 184,680 2,032 90.9java.util.HashMap 7,829,056 122,329 64 java.lang.String[] 173,176 1,746 99.2java.util.TreeMap 7,754,688 107,704 72 java.util.zip.ZipEntry 172,080 2,390 72
  29. 29. Another example, Overhead in collection
  30. 30. Garbage Collection• Pause Times if stop_the_word_FullGC > ttl_of_node => failed requests; node repair => node is declared dead• Allocation Rate – New object creation, insertion rate• Live Objects (residency) – if residency in heap > 50% – GC overheads dominate.• Overhead: space, cpu cycles spent GC• 64-bit not addressing pause times – Bigger is not better! – 40-50% increase in heap sizes for same workloads.
  31. 31. Too many free parameters!!Tune GC:• Entropy is: Number of flags it takes to tune GC.• Workloads in lab do not represent production• Fragile, Meaning of flags changes.Solution:• Ask VM vendor to provide one flag soln.• Azul’s PauselessGC (now in software)⇒ Avoid OOM, configure node death if OOM⇒ Azul’s Cooperative-Memory (swap space for your jvm under spike: No more OOM!)
  32. 32. Memory Fragmentation• Fragmentation – Performance degrades over time – Inducing “Full GC” makes problem go away – Free memory that cannot be used• Reduce occurrence – Use a compacting collector – Promote less often – Use uniform sized objects• Solution – unsolved – Use latest CMS with CR:6631166 – Azul’s Zing JVM & Pauseless GC
  33. 33. Sizing: Young Generation• Should we set –Xms == -Xmx ?• Use –Xmn (fixed eden) allocations {new Object();} survivor ratio eden survivor spaces Tenuring promotion Threshold allocation by jvm old generation
  34. 34. Generations• Don’t promote too often! – Frequent promotion causes fragmentation• Size the generations – Min GC times are a function of Live Set – Old Gen should host steady state comfortably• Parallelize on multicores: – -XX:ParallelCMSThreads=4 – -XX:ParallelGCThreads=4• Avoid CMS Initiating heuristic – -XX:+UseCMSInitiationOccupanyOnly• Use Concurrent for System.gc() – -XX:+ExplicitGCInvokesConcurrent
  35. 35. Memory Leaks• Application takes all memory you got!• Live heap shows sawtooth• Eventually throws OOMTheory:• Allocated, Live heap, PermGenCommon sources:• Finalizers, Classloaders, ThreadLocal
  36. 36. synchronized: Amdahl’s law trumps Moore’s!• Coarse grained locks• io under lock• Stop signal on a highway• java.util.concurrent does not mean no locks• Non Blocking, Lock free, Wait free collections
  37. 37. Locks: Distributed Caching• Schemes – Optimistic, Pessimistic• Consistency – Eventually vs. ACID• Contention, Waits• java.util.concurrent, critical sections, – Use Lock Striping• MVCC, Lock-free, wait-free DataStructures. (NBHM)• Transactions are expensive⇒Reduce JTA abuse, Set the right isolation levels.
  38. 38. writes: monitors
  39. 39. UUIDAre you using UUID gen for messaging?• java.util.UUID is slow – static use leads to contentionSecureRandom• Uses /dev/urandom for seed initialization -Djava.security.egd=file:/dev/urandom• PRNG without file is atleast 20%-40% better.• Use TimeUUIDs where possible – much faster• JUG – java.uuid.generator• http://github.com/cowtowncoder/java-uuid-generator• http://jug.safehaus.org/• http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
  40. 40. Towards Non-blocking high scale collections!• Big Array to hold Data• Concurrent writes via: CAS & Finite State Machine – No locks, no volatile – Much faster than locking under heavy load – Directly reach main data array in 1 step• Resize as needed – Copy Array to a larger Array on demand – Use State Machine to help copy – “ Mark” old Array words to avoid missing late updates• Use Non-Blocking Hashmap, google collections
  41. 41. Non-Blocking HashMap Azul Vega2 – 768 cpus 1K Table 1M Table 1200 1200 1000 NB-99 1000 800 800 M-ops/secM-ops/sec 600 CHM-99 600 400 400 NB NB-75 200 200 CHM-75 CHM 0 0 0 100 200 300 400 500 600 700 800 0 100 200 300 400 500 600 700 800 Threads Threads
  42. 42. Inter-node communication• TCP for mgmt & data: Infinispan• TCP for mgmt, UDP for data: Coherence, Infinispan• UDP for mgmt, TCP for data: Cassandra, Infinispan• Instrumentation: EHCache/TerracottaBandwidth & Latency considerations⇒ Ensure proper network configuration in the kernel⇒ Run Datagram tests⇒ Limit number of management nodes & nodes
  43. 43. Example, Apache Cassandra• Partition, Ring, Gateway, BloomFilters• Gossip Protocol – It’s exponential – (epidemic algorithm)• Failure Detector – Accrual rate phi• Anti-Entropy – Bringing replicas to uptodate
  44. 44. Coherence Communication Issues
  45. 45. Marshal Arts: Serialization/Deserialization• java.io.Serializable is S.L..O.…W• Use “transient”• jserial, avro, etc• + Google Protocol Buffers,• PortableObjectFormat (Coherence)• + JBossMarshalling• + Externalizable + byte[]• + Roll your own
  46. 46. Serialization + Deserialization uBench• http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
  47. 47. Count what is countable, measure what is measurable, and what is notmeasurable, make measurable -Galileo
  48. 48. Latency: Where have all the millis gone?• Moore’s law amplifies bandwidth – Latencies are still lagging!• Measure. 90th percentile. Look for consistency.• => JMX is great! JMX is also very slow.• Lesser number of nodes means less MBeans!• Monitor (network, memory, cpu), ganglia,• Know thyself: Application Footprint, Trend data.
  49. 49. Optimization hinders evolution – Alan Perlis
  50. 50. Q&A• References:• Making Sense of Large Heaps, Nick Mitchell, IBM• Oracle Coherence 3.5, Aleksandar Seovic• Large Pages in Java http://andrigoss.blogspot.com/2008/02/jvm-performance- tuning.html• Patterns of Doom http://3.latest.googtst23.appspot.com/• Infinispan Demos http://community.jboss.org/wiki/5minutetutorialonInfinispan• RTView, Tom Lubinski, http://www.sl.com/pdfs/SL-BACSIG-100429-final.pdf• Google Protocol Buffers, http://code.google.com/p/protobuf/• Azul’s Pauseless GC http://www.azulsystems.com/technology/zing-virtual- machine• Cliff Click’s Non-Blocking Hash Map http://sourceforge.net/projects/high-scale- lib/• JVM Serialization Benchmarks:• http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
  51. 51. Cassandra links• Verner Wogels, Eventually Consistent http://www.allthingsdistributed.com/2008/12/eventually_consistent.html• Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors"• Avinash Lakshman, http://static.last.fm/johan/nosql- 20090611/cassandra_nosql.pdf• Eric Brewer, CAP http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC- keynote.pdf• Tony Printzeis, Charlie Hunt, Javaone Talk http://www.scribd.com/doc/36090475/GC-Tuning-in-the-Java• http://github.com/digitalreasoning/PyStratus/wiki/Documentation• http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf• Cassandra on Cloud, http://www.coreyhulen.org/?p=326• Cliff Click’s, Non-blocking HashMap http://sourceforge.net/projects/high-scale-lib/• Brian F. Cooper., Yahoo Cloud Storage Benchmark, http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf• www.riptano.com
  52. 52. Further questions, reach me: Azul Systems, sris@azulsystems.com Twitter: @srisatish, srisatish@riptano.com

×