• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
jvm goes to big data
 

jvm goes to big data

on

  • 4,968 views

invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.

invited netflix talk: JVM issues in the age of scale! We take an under the hood look at java locking, memory model, overheads, serialization, uuid, gc tuning, CMS, ParallelGC, java.

Statistics

Views

Total Views
4,968
Views on SlideShare
4,670
Embed Views
298

Actions

Likes
8
Downloads
129
Comments
0

5 Embeds 298

http://www.nosqldatabases.com 286
http://paper.li 5
http://www.linkedin.com 4
https://www.linkedin.com 2
http://twitter.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    jvm goes to big data jvm goes to big data Presentation Transcript

    • JVM goes BigData srisatish.ambati AT gmail.com DataStax/OpenJDK 2/28/2011 @srisatish
    • Motivation
      • A compendium of recent jvm scale issues while working with big data.
      • This talk will not have details on big data.
      • Thanks Sid!
    • Trail Ahead
      • synchronized
      • Non-blocking Hashmap
      •      - A state transition view
      • Collections
      • Serialization
      • UUID
      • Garbage Collection
      •      - The free parameters!
      •      - Generations, Promotion, Fragmentation
      •      - Offheap
      • Questions & asynchronous IO
    • tools of trade
      • What the JVM is doing:
        • dtrace, hprof, introscope, jconsole, visualvm, yourkit, gchisto, zvision
      • Invasive JVM observation tools:
        • bci, jvmti, jvmdi/pi agents, logging
      • What the OS is doing:
        • dtrace, oprofile, vtune, perf
      • What the network/disk is doing:
        • ganglia, iostat, lsof, nagios, netstat, tcpdump
    •  
    • synchronized
      • under the hood
        • Fast path for no-contention thin lock
        • Bias threads to lock or bulk revoke bias
        • Store free biasing
    • JMM: happens-before, causality
      • Partial order
      • volatile
      • Piggybacking
      • FutureTask
      • BlockingQueue
      • jsr133
    • java.util.concurrent also holds locks!
    • Tomcat under concurrent load!
    • Non-blocking collections: Amdahl's > Moore's!
      •  
      State, Actions – key/value pairs! get, put, delete, _resize ByteArray to hold Data Concurrent writes: using CAS No locks , no volatile Much faster than locking under heavy load Directly reach main data array in 1 step Resize as needed Copy Array to a larger Array on demand. Post updates
    • Death & Taxes: Java Overheads!
      • Cost of an 8-char String?
      • Cost of 100-entry TreeMap<Double,Double> ?
      8b hdr 12b fields 4b ptr 4b pad 8b hdr 4b len 16b data A: 56 bytes, or a 7x blowup 48b TreeMap 40b TreeMap$Entry 16b Double 16b Double A: 7248 bytes or a ~5x blowup
    • yourkit: memory profile
    • Which collection: Mozart or Bach?
      • Concurrency:
      • Non-blocking HashMap
      • Google Collections
      • Overheads
      • Watch out for per-element costs!
      • Primitives can be hard to manage!
      • Sparse collections
      • Average collection size in enterprise is ~3
    •  
      •  
      • java.io.Serializable is S.L..O.…W
      • True to platform
      • Use “transient”
      • ObjectSerialField[]
      • Avro
      • Google Protocol Buffers,
      • Externalizable + byte[]
      • Roll your own
      serializable
    • ser+deser smaller is better https://github.com/eishay/jvm-serializers.git
    • avro
      • Schema
        • No per datum overheads
        • Optional code gen
      • Types are runtime
      • Untagged data
      • No manually-assigned field Ids
      • Cons:
      • Schema mismatches
      • Runtime only checks
    • google-proto-buffer
      • Define message format in .proto file
      • All data in key/value pairs
      • Generate sources
      • .builder for each class with getter/setter
    • thrift
      • Type, Transport, Protocol, Version, Processors
      • Separation of structure from protocol & transport
      • TCompactProtocol, etc
        • tag/data, compression
      • TSocket, TfileTransport, etc
      • colocated clients & servers
    • UUID
      • java.util.UUID is slow
        • dominated by sha_transform costs
        • Leach-salz (128-bit)
      • Turns out that default PRNG (via SecureRandom)
      • Uses /dev/urandom for seed initialization
      • -Djava.security.egd=file:/dev/urandom
          • PRNG without file is atleast 20%-40% better.
      • Use TimeUUIDs where possible – much faster
      • Alternatives: JUG – java.uuid.generator, com.eaio.uuid
      • ~10x faster
      • http://github.com/cowtowncoder/java-uuid-generator
      • http://jug.safehaus.org/
      • http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.htm
      • /**
      • * Returns a {@code String} object representing this {@code UUID}.
      • *
      • * <p> The UUID string representation is as described by this BNF:
      • * <blockquote><pre>
      • * {@code
      • * UUID = <time_low> &quot;-&quot; <time_mid> &quot;-&quot;
      • * <time_high_and_version> &quot;-&quot;
      • * <variant_and_sequence> &quot;-&quot;
      • * <node>
      • * time_low = 4*<hexOctet>
      • * time_mid = 2*<hexOctet>
      • * time_high_and_version = 2*<hexOctet>
      • * variant_and_sequence = 2*<hexOctet>
      • * node = 6*<hexOctet>
      • * hexOctet = <hexDigit><hexDigit>
      • * hexDigit =
      • * &quot;0&quot; | &quot;1&quot; | &quot;2&quot; | &quot;3&quot; | &quot;4&quot; | &quot;5&quot; | &quot;6&quot; | &quot;7&quot; | &quot;8&quot; | &quot;9&quot;
      • * | &quot;a&quot; | &quot;b&quot; | &quot;c&quot; | &quot;d&quot; | &quot;e&quot; | &quot;f&quot;
      • * | &quot;A&quot; | &quot;B&quot; | &quot;C&quot; | &quot;D&quot; | &quot;E&quot; | &quot;F&quot;
      • * }</pre></blockquote>
      • *
      • * @return A string representation of this {@code UUID}
      • */
      • public String toString() {
      • return (digits(mostSigBits >> 32, 8) + &quot;-&quot; +
      • digits(mostSigBits >> 16, 4) + &quot;-&quot; +
      • digits(mostSigBits, 4) + &quot;-&quot; +
      • digits(leastSigBits >> 48, 4) + &quot;-&quot; +
      • digits(leastSigBits, 12));
      • }
      Leach-salz UUID
    • ------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1485 irqs/sec kernel:18.6% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 1882.00 26.3% intel_idle [kernel.kallsyms] 1678.00 23.5% os::javaTimeMillis() libjvm.so 382.00 5.3% SpinPause libjvm.so 335.00 4.7% Timer::ImplTimerCallbackProc() libvcllx.so 291.00 4.1% gettimeofday /lib/libc-2.12.1.so 268.00 3.7% hpet_next_event [kernel.kallsyms] 254.00 3.6% ParallelTaskTerminator::offer_termination(TerminatorTerminator*) libjvm.so ------------------------------------------------------------------------------------------------------------------------------- PerfTop: 1656 irqs/sec kernel:59.5% exact: 0.0% [1000Hz cycles], (all, 8 CPUs) ------------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________________________________________________ 6980.00 38.5% sha_transform [kernel.kallsyms] 2119.00 11.7% intel_idle [kernel.kallsyms] 1382.00 7.6% mix_pool_bytes_extract [kernel.kallsyms] 437.00 2.4% i8042_interrupt [kernel.kallsyms] 416.00 2.3% hpet_next_event [kernel.kallsyms] 390.00 2.2% extract_buf [kernel.kallsyms] 376.00 2.1% ThreadInVMfromNative::~ThreadInVMfromNative() libjvm.so 321.00 1.8% T.3542 libjvm.so 298.00 1.6% __ticket_spin_lock [kernel.kallsyms] 296.00 1.6% Timer::ImplTimerCallbackProc() libvcllx.so 255.00 1.4% Unsafe_GetInt libjvm.so
      • perf stat java -cp uuid-3.2.jar:. HWUUID eaio 100
      • b05c8260-42c8-11e0-aa90-005056c00008
      • time taken:27
      • Performance counter stats for 'java -cp uuid-3.2.jar:. HWUUID eaio 100':
      • 94.736851 task-clock-msecs # 1.094 CPUs
      • 76 context-switches # 0.001 M/sec
      • 34 CPU-migrations # 0.000 M/sec
      • 5325 page-faults # 0.056 M/sec
      • 274771865 cycles # 2900.369 M/sec
      • 265443567 instructions # 0.966 IPC
      • 48687760 branches # 513.926 M/sec
      • 3909886 branch-misses # 8.031 %
      • 2513511 cache-references # 26.532 M/sec
      • 307953 cache-misses # 3.251 M/sec
      • 0.086582043 seconds time elapsed
      • perf stat java -cp uuid-3.2.jar:. HWUUID std 100
      • 3b879285-9071-47c8-81e1-9da4564bacdd
      • time taken:402
      • Performance counter stats for 'java -cp uuid-3.2.jar:. HWUUID std 100':
      • 605.085434 task-clock-msecs # 1.324 CPUs
      • 158 context-switches # 0.000 M/sec
      • 38 CPU-migrations # 0.000 M/sec
      • 9254 page-faults # 0.015 M/sec
      • 1926240824 cycles # 3183.420 M/sec
      • 2828570697 instructions # 1.468 IPC
      • 300923225 branches # 497.324 M/sec
      • 22268770 branch-misses # 7.400 %
      • 6885501 cache-references # 11.379 M/sec
      • 617949 cache-misses # 1.021 M/sec
      • 0.457032941 seconds time elapsed
      IPC breakdown roll-your-own vs. standard java.util.UUID
    • summary
      • TimebasedUUIDs vs. UUIDs
      • use ~4 times less kernel time on creation!
      • No SHA library calls!
      • optimized toString()
      • Much faster than standard java.util.UUID
      • - Better Instructions per clocks as well.
      • If on EC2:
      • Watch out for non-cacheable file access to /dev/urandom!
    • String theory of Java!
      • byte[] vs. char[]
      • If ver > jdk16u21 try -XX:+UseCompressedStrings
      • Append performance (gc) differs:
      • Strings vs. StringBuffers
      • com.google.common.base.Joiner
          • Join text for cheap,
          • skipNulls or useForNulls()
      • “ Null References: A billion dollar mistake”
      • - C.A.R Hoare
      “ I call it my billion-dollar mistake. It was the invention of the null reference in 1965. At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn't resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” - qconlondon, '09
      • Best Practices: Garbage Collection
    • verbose:gc
      • GC Logs are cheap even in production
      • -Xloggc:gc.log
      • -XX:+PrintGCDetails
      • -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
      • A bit expensive/obscure ones:
      • -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1
      • -XX:CMSInitiationStatistics -XX:+PrintFLSCensus
    • Three free parameters Allocation Rate: your workload! Size: defines runway! Live Set, memory Pause times: Stoppages!
    • Four free parameters Allocation Rate: your application load! Size: defines runway! Live Set, system memory Pause times: Stoppages! (fourth: Overheads of GC – Space & CPU.)
    • Part I: Sizing to be -Xmx == -Xms or not?
      • Young generation:
      • Use -Xmn for predictable performance
      eden survivor spaces new Object() survivor ratio jvm allocates Tenuring Threshold promotion old gen
    • Part II: Pick a collector!
      • Serial GC – Serial new + Serial Old
      • Parallel GC (default) Parallel Scavenge + Serial Old
      • UseParallelOldGC : Parallel Scavenge + Parallel Old
      • UseConcurrentMarkSweep: ParNew, CMS Old, Serial Old
      • G1/Experimental
    • Reading GC logs – a topic/tool
      • Full GC is STW
      • Initial Mark, Rescan/WeakRef/Remark are STW
      • Look for promotion failures
      • Look for concurrent mode failures
    • ... 995.330: [CMS-concurrent-mark: 0.952/1.102 secs] [Times: user=3.69 sys=0.54, real=1.10 secs] 995.330: [CMS-concurrent-preclean-start] 995.618: [CMS-concurrent-preclean: 0.279/0.287 secs] [Times: user=0.90 sys=0.20, real=0.29 secs] 995.618: [CMS-concurrent-abortable-preclean-start] 995.695: [GC 995.695: [ParNew (promotion failed) Desired survivor size 41943040 bytes, new threshold 1 (max 1) - age 1: 29826872 bytes, 29826872 total : 720596K->703760K(737280K), 0.4710410 secs]996.166: [CMS996.317: [CMS-concurrent-abortable-preclean: 0.218/0.699 secs] [Times: user=1.39 sys=0.10, real=0.70 secs] (concurrent mode failure): 4100132K->784070K(5341184K), 4.7478300 secs] 4780154K->784070K(6078464K), [CMS Perm : 17033K->17014K(28400K)], 5.2191410 secs] [Times: user=5.70 sys=0.01, real=5.22 secs] ...
    • Tuning CMS
      • Don’t promote too often!
      • Frequent promotion causes fragmentation
      • (avoid never tenure) TenuringThreshold
      • Size the generations
      • Min GC times are a function of Live Set
      • Old Gen should host steady state comfortably
      • Avoid CMS Initiating heuristic
      • -XX:+UseCMSInitiationOccupanyOnly
      • Use Concurrent for System.gc()
      • -XX:+ExplicitGCInvokesConcurrent
    • GC Threads
      • Parallelize on multicores
      • -XX:ParallelGCThreads=4
      • (default: derived from # of cpus on system)
      • *8 + (n-5)/8
      • -XX:ParallelCMSThreads=4
      • (default: derived from # of parallelgcthreads)
      • Strategy A:
      • Tune min gcs & let appl data in eden
    • Did someone ask about defaults? if (FLAG_IS_DEFAULT(ParallelGCThreads)) { assert(ParallelGCThreads == 0, &quot;Default ParallelGCThreads is not 0&quot;); // For very large machines, there are diminishing returns // for large numbers of worker threads. Instead of // hogging the whole system, use a fraction of the workers for every // processor after the first 8. For example, on a 72 cpu machine // and a chosen fraction of 5/8 // use 8 + (72 - 8) * (5/8) == 48 worker threads. unsigned int ncpus = (unsigned int) os::active_processor_count(); return (ncpus <= switch_pt) ? ncpus : (switch_pt + ((ncpus - switch_pt) * num) / den); } else { return ParallelGCThreads; }
    • Fragmentation
      • Performance degrades over time
      • Inducing “Full GC” makes problem go away
      • Free memory that cannot be used
      • Round off errors
      • Reduce occurrence
      • Use a compacting collector
      • Promote less often
      • Use uniform sized objects
      • Not enough large contiguous space for promotion
      • Small objects still can fit in the holes!
      • Compaction – stop the world.
      • Unsolved on Oracle/Sun Hotspot
      • Azul Systems Pauseless JVM.
    • JRockit Mission Control
    • Example
      • Application suddenly transitions to back-to-back full gcs.
      • Cannot use free mem – too many holes!
    • Tools
      • GCHisto
      • jconsole
      • VisualVM/VisualGC
      • Logs
      • Thread dumps
      • yourkit memory profile, snapshots
    • GCSpy
    • Gone 0xff the heap !!
      • ByteBuffer.allocateDirect(16 * 1024 * 1024)
      • Also can be mapped memory of a file region
      • Store long-lived objects outside jvm
      • Managed by native i/o ops.
      • JNA: dynamically load & call native libraries without compile time decl like JNI
      • Works for limited use cases in the lab.
      • Ex: Terracotta, Hbase, Cassandra
    • Gone 0xff the heap ?
      • Issues to consider:
      • No clear api to de-allocate from this region
        • See jbellis patch to JNA-179 for FreeableBuffer
      • Object cleanup relegated to finalization
      • Single finalizer thread, Bug ID: 4469299
      • Behind WeakReference processing in jdk16u21
      • Workaround:
      • -XX:MaxDirectMemorySize=<size>
      • Manually Trigger System.gc() to avoid “leak”
    • Virtually there!
      • Ballooning driver for Memory: Disable it!
      • Time (TSC) issue! It's relative!
      • Scheduling when # of threads > # of vcpus..
      • Tickless _nohz kernel
      • GC Thread starvation = STW pauses
      • large ec2 instances are not all equal..
      • DirectPathIO & vt-d, rvi – Watch out for Sockets!
      • Tools: Performance counters still not virtualized!
    • summary
      • JVM is still the most popular platform for deployment for the new languages!
      • JVM heartburn around scale!
        • Serialization
        • UUID
        • Object overhead
        • Garbage Collection
        • Hypervisor
    • References
      • Chris Wimmer,  http://wikis.sun.com/display/HotSpotInternals/Synchronization
      • Russel & Detlefs  http://www.oracle.com/technetwork/java/biasedlocking-oopsla2006-wp-149958.pdf
      • Google Protocol Buffers http://code.google.com/p/protobuf
      • Thrift http://incubator.apache.org/thrift/static/thrift-20070401.pdf
      • Leach-Salz Variant of UUID http://www.upnp.org/resources/draft-leach-uuids-guids-00.txt
      • Hans Boehm, http://www.hpl.hp.com/personal/Hans_Boehm/gc/complexity.html
      • Brian Goetz, JSR-133 http://www.ibm.com/developerworks/java/library/j-jtp03304/
      • GCSpy http://www.cs.kent.ac.uk/projects/gc/gcspy/
      • Understanding GC logs http://blogs.sun.com/poonam/entry/understanding_cms_gc_logs
      • Cliff Click's http://sourceforge.net/projects/high-scale-lib/