JVM goes BigDatasrisatish.ambati AT gmail.comDataStax/OpenJDK4/12/2011@srisatish
Motivation•   A compendium of     recent jvm scale     issues while working     with big data.•   This talk will not have ...
Trail AheadsynchronizedNon­blocking Hashmap    ­ A state transition viewCollectionsSerializationUUIDGarbage Collection    ...
tools of trade  • What the JVM is doing:    – dtrace, hprof, introscope, jconsole, visualvm, yourkit,       gchisto, zvisi...
synchronizedunder the hood  –   Fast path for no­contention thin lock  –   Bias threads to lock or bulk revoke bias  –   S...
JMM: happens­before, causalityPartial ordervolatilePiggybackingFutureTaskBlockingQueuejsr133
* Java Concurrency in Practice, Brian Goetz
java.util.concurrent also holds locks!
Tomcat under concurrent load!
Non­blocking collections:                Amdahls > Moores!State, Actions – key/value pairs!   get, put, delete, _resizeByt...
Death & Taxes: Java Overheads! • Cost of an 8­char String?     8b      12b          4b     hdr     fields       ptr       ...
yourkit: memory profile
Which collection: Mozart or Bach?Concurrency:   Non­blocking HashMap  Google CollectionsOverheads  Watch out for per­eleme...
serializable             java.io.Serializable is S.L..O.…W     True to platform            Use “transient”            O...
ser+deser smaller is better      https://github.com/eishay/jvm­serializers.git
avro• Schema   – No per datum overheads   – Optional code gen• Types are runtime• Untagged data• No manually­assigned fiel...
google­proto­buffer• Define message format   in .proto file• All data in key/value pairs• Generate sources• .builder for e...
thrift• Type, Transport, Protocol,   Version, Processors• Separation of structure from   protocol & transport• TCompactPro...
UUIDjava.util.UUID is slow   ●         dominated by sha_transform costs   ●        Leach­salz (128­bit) Turns out that def...
Leach­salz UUID    /**          * Returns a {@code String} object representing this {@code UUID}.          *          * <p...
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­...
summaryTimebasedUUIDs vs. UUIDsuse ~4 times less kernel time on creation!No SHA library calls!optimized toString()Much fas...
String theory of Java!byte[] vs. char[]If ver > jdk16u21 try ­XX:+UseCompressedStringsAppend performance (gc) differs: Str...
“Null References: A billion dollar mistake”                                                        ­ C.A.R Hoare“I call it...
Best Practices: Garbage Collection
verbose:gcGC Logs are cheap even in  production          ­Xloggc:gc.log           ­XX:+PrintGCDetails           ­XX:+Print...
Three free parameters Allocation Rate: your workload!Size: defines runway!      Live Set, memoryPause times:       Stoppag...
Four free parameters Allocation Rate: your application load!Size: defines runway!      Live Set, system memoryPause times:...
Part I: Sizingto be ­Xmx == ­Xms or not?Young generation:Use ­Xmn for predictable performance                       new Ob...
Part II: Pick a collector!Serial GC – Serial new + Serial OldParallel GC (default) Parallel Scavenge + Serial OldUseParall...
Reading GC logs – a topic/toolFull GC is STWInitial Mark, Rescan/WeakRef/Remark  are STWLook for promotion failuresLook fo...
... 995.330: [CMS­concurrent­mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS­concurrent­...
Tuning CMSDon’t promote too often!     Frequent promotion causes fragmentation     (avoid never tenure) TenuringThresholdS...
GC ThreadsParallelize on multicores           ­XX:ParallelGCThreads=4        (default: derived from # of cpus on system)  ...
Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) {    assert(ParallelGCThreads == 0, "Default Paral...
FragmentationPerformance degrades over timeInducing “Full GC” makes problem go awayFree memory that cannot be used    Roun...
Not enough large contiguous space for  promotionSmall objects still can fit in the holes!Compaction – stop the world.Unsol...
JRockit Mission Control
ExampleApplication suddenly  transitions to back­ to­back full gcs.Cannot use free mem  – too many holes!
Tools•   GCHisto•   jconsole•   VisualVM/VisualGC•   Logs•   Thread dumps•   yourkit memory profile, snapshots
GCSpy
Gone 0xff the heap !!ByteBuffer.allocateDirect(16 * 1024 * 1024)Also can be mapped memory of a file regionStore long­lived...
Gone 0xff the heap ?Issues to consider:No clear api to de­allocate from this region   ●    See jbellis patch to JNA­179 fo...
Virtually there! Ballooning driver for Memory: Disable it!Time (TSC) issue! Its relative!Scheduling when # of threads > # ...
summary•   JVM is still the most popular platform for     deployment for the new languages!•   JVM heartburn around scale!...
ReferencesChris Wimmer, http://wikis.sun.com/display/HotSpotInternals/SynchronizationRussel & Detlefs http://www.oracle.co...
Jvm goes big_data_sfjava
Upcoming SlideShare
Loading in...5
×

Jvm goes big_data_sfjava

2,643

Published on

SF Java presentation of jvm goes to big data.
“Slowly yet surely the JVM is going to Big Data! In this fun filled presentation we see what pieces of Java & JVM triumph or unravel in the battle for performance at high scale!”
Concurrency is the currency of scale on multi-core & the new generation of collections and non-blocking hashmaps are well worth the time taking a deep dive into. We take a quick look at the next gen serialization techniques as well as implementation pitfalls around UUID. The achilles' heel for JVM remains Garbage Collection: a deep dive into the internals of the memory model, common GC algorithms and their tuning knobs is always a big draw. EC2 & cloud present us with a virtualized & unchartered territory for scaling the JVM.

We will leave some room for Q&A or fill it up with any asynchronous I/O that might queue up during the talk. A round of applause will be due to the various tools that are essentials for Java performance debugging.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,643
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
118
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Jvm goes big_data_sfjava

  1. 1. JVM goes BigDatasrisatish.ambati AT gmail.comDataStax/OpenJDK4/12/2011@srisatish
  2. 2. Motivation• A compendium of  recent jvm scale  issues while working  with big data.• This talk will not have  details on big data.• Thanks Sasa!
  3. 3. Trail AheadsynchronizedNon­blocking Hashmap    ­ A state transition viewCollectionsSerializationUUIDGarbage Collection    ­ The free parameters!    ­ Generations, Promotion, Fragmentation    ­ OffheapQuestions & asynchronous IO
  4. 4. tools of trade • What the JVM is doing: – dtrace, hprof, introscope, jconsole, visualvm, yourkit,  gchisto, zvision • Invasive JVM observation tools: – bci, jvmti, jvmdi/pi agents, logging • What the OS is doing: – dtrace, oprofile, vtune, perf • What the network/disk is doing: – ganglia, iostat, lsof, nagios, netstat, tcpdump
  5. 5. synchronizedunder the hood – Fast path for no­contention thin lock – Bias threads to lock or bulk revoke bias – Store free biasing
  6. 6. JMM: happens­before, causalityPartial ordervolatilePiggybackingFutureTaskBlockingQueuejsr133
  7. 7. * Java Concurrency in Practice, Brian Goetz
  8. 8. java.util.concurrent also holds locks!
  9. 9. Tomcat under concurrent load!
  10. 10. Non­blocking collections:  Amdahls > Moores!State, Actions – key/value pairs!  get, put, delete, _resizeByteArray to hold DataConcurrent writes: using CAS No locks, no volatile Much faster than locking under heavy load     Directly reach main data array in 1 stepResize as needed Copy Array to a larger Array on demand. Post updates
  11. 11. Death & Taxes: Java Overheads! • Cost of an 8­char String? 8b 12b 4b hdr fields ptr A: 56 bytes, or a 7x blowup 8b 4b 16b 4b hdr len data pad • Cost of 100­entry TreeMap<Double,Double> ? 48b TreeMap 40b TreeMap$Entry 16b 16b A: 7248 bytes or a ~5x blowup Double Double
  12. 12. yourkit: memory profile
  13. 13. Which collection: Mozart or Bach?Concurrency:   Non­blocking HashMap  Google CollectionsOverheads  Watch out for per­element costs!  Primitives can be hard to manage!Sparse collections   Average collection size in enterprise is ~3
  14. 14. serializable    java.io.Serializable is S.L..O.…W True to platform  Use “transient”  ObjectSerialField[]  Avro  Google Protocol Buffers,   Externalizable + byte[]  Roll your own
  15. 15. ser+deser smaller is better https://github.com/eishay/jvm­serializers.git
  16. 16. avro• Schema – No per datum overheads – Optional code gen• Types are runtime• Untagged data• No manually­assigned field IdsCons:• Schema mismatches• Runtime only checks
  17. 17. google­proto­buffer• Define message format  in .proto file• All data in key/value pairs• Generate sources• .builder for each class  with getter/setter
  18. 18. thrift• Type, Transport, Protocol,  Version, Processors• Separation of structure from  protocol & transport• TCompactProtocol, etc – tag/data, compression• TSocket, TfileTransport, etc• colocated clients & servers
  19. 19. UUIDjava.util.UUID is slow ●   dominated by sha_transform costs ●  Leach­salz (128­bit) Turns out that default PRNG (via SecureRandom)Uses /dev/urandom for seed initialization          ­Djava.security.egd=file:/dev/urandom ● PRNG without file is atleast 20%­40% better.Use TimeUUIDs where possible – much faster   Alternatives: JUG – java.uuid.generator, com.eaio.uuid    ~10x fasterhttp://github.com/cowtowncoder/java­uuid­generator http://jug.safehaus.org/ http://johannburkard.de/blog/programming/java/Java­UUID­generators­compared.htm
  20. 20. Leach­salz UUID    /** * Returns a {@code String} object representing this {@code UUID}. * * <p> The UUID string representation is as described by this BNF: * <blockquote><pre> * {@code * UUID = <time_low> "-" <time_mid> "-" * <time_high_and_version> "-" * <variant_and_sequence> "-" * <node> * time_low = 4*<hexOctet> * time_mid = 2*<hexOctet> * time_high_and_version = 2*<hexOctet> * variant_and_sequence = 2*<hexOctet> * node = 6*<hexOctet> * hexOctet = <hexDigit><hexDigit> * hexDigit = * "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" * | "a" | "b" | "c" | "d" | "e" | "f" * | "A" | "B" | "C" | "D" | "E" | "F" * }</pre></blockquote> * * @return A string representation of this {@code UUID} */ public String toString() { return (digits(mostSigBits >> 32, 8) + "-" + digits(mostSigBits >> 16, 4) + "-" + digits(mostSigBits, 4) + "-" + digits(leastSigBits >> 48, 4) + "-" + digits(leastSigBits, 12)); }
  21. 21. ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­   PerfTop:    1485 irqs/sec  kernel:18.6%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­             samples  pcnt function                                                         DSO             _______ _____ ________________________________________________________________              1882.00 26.3% intel_idle                                                       [kernel.kallsyms]                      1678.00 23.5% os::javaTimeMillis()                                   libjvm.so                               382.00  5.3% SpinPause                                                        libjvm.so                               335.00  4.7% Timer::ImplTimerCallbackProc()                   libvcllx.so                             291.00  4.1% gettimeofday                                                     /lib/libc­2.12.1.so                     268.00  3.7% hpet_next_event                                              [kernel.kallsyms]                       254.00  3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so                               ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­    PerfTop:    1656 irqs/sec  kernel:59.5%  exact:  0.0% [1000Hz cycles],  (all, 8 CPUs)­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­             samples  pcnt function                                                         DSO             _______ _____ ________________________________________________________________            6980.00 38.5% sha_transform                                                            [kernel.kallsyms]             2119.00 11.7% intel_idle                                                                      [kernel.kallsyms]             1382.00  7.6% mix_pool_bytes_extract                                                [kernel.kallsyms]                 437.00  2.4% i8042_interrupt                                                               [kernel.kallsyms]              416.00  2.3% hpet_next_event                                                             [kernel.kallsyms]              390.00  2.2% extract_buf                                                                     [kernel.kallsyms]              376.00  2.1% ThreadInVMfromNative::~ThreadInVMfromNative()  libjvm.so                      321.00  1.8% T.3542                                                                            libjvm.so                      298.00  1.6% __ticket_spin_lock                                                         [kernel.kallsyms]              296.00  1.6% Timer::ImplTimerCallbackProc()                                  libvcllx.so                    255.00  1.4% Unsafe_GetInt                                                                libjvm.so               
  22. 22. summaryTimebasedUUIDs vs. UUIDsuse ~4 times less kernel time on creation!No SHA library calls!optimized toString()Much faster than standard java.util.UUID­ Better Instructions per clocks as well. If on EC2: Watch out for non­cacheable file access to /dev/urandom!  
  23. 23. String theory of Java!byte[] vs. char[]If ver > jdk16u21 try ­XX:+UseCompressedStringsAppend performance (gc) differs: Strings vs. StringBufferscom.google.common.base.Joiner • Join text for cheap,  • skipNulls or useForNulls()com.google.common.base.Splitter 
  24. 24. “Null References: A billion dollar mistake”                                                        ­ C.A.R Hoare“I call it my billion­dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldnt resist the temptation to put in a null reference, simply because it was so easy to implement.  This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” ­ qconlondon, 09
  25. 25. Best Practices: Garbage Collection
  26. 26. verbose:gcGC Logs are cheap even in  production          ­Xloggc:gc.log           ­XX:+PrintGCDetails           ­XX:+PrintGCTimeStamps ­XX:+PrintTenuringDistribution A bit expensive/obscure ones:          ­XX:PrintFLSStatistics=2 ­XX:CMSStatistics=1          ­XX:CMSInitiationStatistics ­XX:+PrintFLSCensus
  27. 27. Three free parameters Allocation Rate: your workload!Size: defines runway!      Live Set, memoryPause times:       Stoppages!
  28. 28. Four free parameters Allocation Rate: your application load!Size: defines runway!      Live Set, system memoryPause times:       Stoppages! (fourth: Overheads of GC – Space & CPU.)
  29. 29. Part I: Sizingto be ­Xmx == ­Xms or not?Young generation:Use ­Xmn for predictable performance new Object() survivor ratio eden survivor spaces promotion Tenuring Threshold old gen  jvm allocates 
  30. 30. Part II: Pick a collector!Serial GC – Serial new + Serial OldParallel GC (default) Parallel Scavenge + Serial OldUseParallelOldGC : Parallel Scavenge + Parallel OldUseConcurrentMarkSweep: ParNew, CMS Old, Serial OldG1/Experimental
  31. 31. Reading GC logs – a topic/toolFull GC is STWInitial Mark, Rescan/WeakRef/Remark  are STWLook for promotion failuresLook for concurrent mode failures
  32. 32. ... 995.330: [CMS­concurrent­mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS­concurrent­preclean­start]995.618: [CMS­concurrent­preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs] 995.618: [CMS­concurrent­abortable­preclean­start]995.695: [GC 995.695: [ParNew (promotion failed)Desired survivor size 41943040 bytes, new threshold 1 (max 1)­ age   1:   29826872 bytes,   29826872 total: 720596K­>703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS­concurrent­abortable­preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs]  (concurrent mode failure): 4100132K­>784070K(5341184K), 4.7478300 secs] 4780154K­>784070K(6078464K), [CMS Perm : 17033K­>17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs]...
  33. 33. Tuning CMSDon’t promote too often!     Frequent promotion causes fragmentation     (avoid never tenure) TenuringThresholdSize the generations     Min GC times are a function of Live Set     Old Gen should host steady state comfortablyAvoid CMS Initiating heuristic         ­XX:+UseCMSInitiationOccupanyOnlyUse Concurrent for System.gc()         ­XX:+ExplicitGCInvokesConcurrent
  34. 34. GC ThreadsParallelize on multicores           ­XX:ParallelGCThreads=4        (default: derived from # of cpus on system)                *8 + (n­5)/8         ­XX:ParallelCMSThreads=4         (default: derived from # of parallelgcthreads)Strategy A:        Tune min gcs & let appl data in eden 
  35. 35. Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) {    assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0");    // For very large machines, there are diminishing returns    // for large numbers of worker threads.  Instead of    // hogging the whole system, use a fraction of the workers for every    // processor after the first 8.  For example, on a 72 cpu machine    // and a chosen fraction of 5/8    // use 8 + (72 ­ 8) * (5/8) == 48 worker threads.    unsigned int ncpus = (unsigned int) os::active_processor_count();    return (ncpus <= switch_pt) ?           ncpus :          (switch_pt + ((ncpus ­ switch_pt) * num) / den);  } else {    return ParallelGCThreads;  }
  36. 36. FragmentationPerformance degrades over timeInducing “Full GC” makes problem go awayFree memory that cannot be used    Round off errorsReduce occurrenceUse a compacting collectorPromote less oftenUse uniform sized objects 
  37. 37. Not enough large contiguous space for  promotionSmall objects still can fit in the holes!Compaction – stop the world.Unsolved on Oracle/Sun Hotspot Azul Systems Pauseless JVM.
  38. 38. JRockit Mission Control
  39. 39. ExampleApplication suddenly  transitions to back­ to­back full gcs.Cannot use free mem  – too many holes!
  40. 40. Tools• GCHisto• jconsole• VisualVM/VisualGC• Logs• Thread dumps• yourkit memory profile, snapshots
  41. 41. GCSpy
  42. 42. Gone 0xff the heap !!ByteBuffer.allocateDirect(16 * 1024 * 1024)Also can be mapped memory of a file regionStore long­lived objects outside jvm Managed by native i/o ops.JNA: dynamically load & call native libraries  without compile time decl like JNIWorks for limited use cases in the lab.          Ex: Terracotta, Hbase, Cassandra
  43. 43. Gone 0xff the heap ?Issues to consider:No clear api to de­allocate from this region  ● See jbellis patch to JNA­179 for FreeableBufferObject cleanup relegated to finalization Single finalizer thread, Bug ID: 4469299Behind WeakReference processing in jdk16u21Workaround:­XX:MaxDirectMemorySize=<size> Manually Trigger System.gc() to avoid “leak” 
  44. 44. Virtually there! Ballooning driver for Memory: Disable it!Time (TSC) issue! Its relative!Scheduling when # of threads > # of vcpus..          Tickless _nohz kernelGC Thread starvation = STW pauseslarge ec2 instances are not all equal..DirectPathIO & vt­d, rvi – Watch out for Sockets!Tools: Performance counters still not virtualized!
  45. 45. summary• JVM is still the most popular platform for  deployment for the new languages!• JVM heartburn around scale! – Serialization – UUID – Object overhead – Garbage Collection – Hypervisor
  46. 46. ReferencesChris Wimmer, http://wikis.sun.com/display/HotSpotInternals/SynchronizationRussel & Detlefs http://www.oracle.com/technetwork/java/biasedlocking­oopsla2006­wp­149958.pdfGoogle Protocol Buffers http://code.google.com/p/protobufThrift http://incubator.apache.org/thrift/static/thrift­20070401.pdfLeach­Salz Variant of UUID http://www.upnp.org/resources/draft­leach­uuids­guids­00.txtHans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.htmlBrian Goetz, JSR­133 http://www.ibm.com/developerworks/java/library/j­jtp03304/GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logsCliff Clicks http://sourceforge.net/projects/high­scale­lib/
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×