Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dealing with JVM limitationsin Apache CassandraJonathan Ellis / @spyced
Pain points for Java databases✤   GC✤   GC✤   GC
Pain points for Java databases✤   GC✤   Platform specific code
GC✤   Concurrent and compacting: choose one    ✤   G1    ✤   Azul C4 / Zing?
Fragmentation✤   Bloom filter arrays✤   Compression offsets
Automatic mitigation?✤   http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf✤   http://researcher.ibm.c...
Fragmentation, 2✤   Arena allocation for memtables
(Memtables?)    write( k1 , c1:v1 )                                       Memory                          Memtable   Commi...
write( k1 , c1:v )                                             Memory                      k1 c1:v                        ...
write( k1 , c2:v )                                      Memory                     k1 c1:v c2:v    k1 c1:v    k1 c2:v     ...
write( k2 ,     c1:v c2:v   )                                                   Memory                                k1 c...
write( k1 ,     c1:v c3:v   )                                                      Memory                                k...
Memory          flush                 indexcleanup    k1 c1:v c2:v c3:v           k2   c1:v c2:v                           ...
“Java is a memory hog”✤   Large overhead for typical objects and collections✤   How large?✤   java.lang.instrument.Instrum...
org.apache.cassandra.cache.SerializingCache✤   Live objects are about 85% JVM bookeeping✤   org.apache.cassandra.cache.Fre...
Don’t forget about young gen✤   Always stop-the-world for ~100ms
Platform-specific code✤   OS✤   JVM
m[un]map✤   Log-structured storage wants to remove old files post-    compaction; some platforms disallow deleting open file...
mmap part 2✤   2GB limit via ByteBuffer:    public abstract byte get(int index)✤   Workaround: MmappedSegmentedFile    pub...
link✤   Used for snapshots✤   Old workaround: JNA✤   New workaround: supported directly by Java7
mlockall✤   swappiness: pissing off database developers since 2001 (?)✤   mlockall(MCL_CURRENT)
Low-level i/o✤   posix_fadvise✤   mincore/fincore✤   fctl✤   ... JNA
A plug for JNA✤   https://github.com/twall/jna     static {         try {              Native.register("c");       ...    ...
The fallacy of choosing portability over power✤   Applets have been dead for years✤   Python gets it right    ✤   import r...
The fallacy of choosing safety over power✤   Allowing munmap would expose developers to segfaults✤   But, relying on the G...
Compatibility through obscurity?✤   sun.misc.Unsafe✤   Used by high-profile libraries like high-scale-lib
... even public options     http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
Too negative?
Still true✤   "Many concurrent algorithms are very easy to write with a    GC and totally hard (to down right impossible) ...
Upcoming SlideShare
Loading in …5
×

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)

22,747 views

Published on

Published in: Technology

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)

  1. Dealing with JVM limitationsin Apache CassandraJonathan Ellis / @spyced
  2. Pain points for Java databases✤ GC✤ GC✤ GC
  3. Pain points for Java databases✤ GC✤ Platform specific code
  4. GC✤ Concurrent and compacting: choose one ✤ G1 ✤ Azul C4 / Zing?
  5. Fragmentation✤ Bloom filter arrays✤ Compression offsets
  6. Automatic mitigation?✤ http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf✤ http://researcher.ibm.com/files/us-hirzel/pldi10-arraylets.pdf
  7. Fragmentation, 2✤ Arena allocation for memtables
  8. (Memtables?) write( k1 , c1:v1 ) Memory Memtable Commit log Hard drive
  9. write( k1 , c1:v ) Memory k1 c1:v Memtable k1 c1:vCommit log Hard drive
  10. write( k1 , c2:v ) Memory k1 c1:v c2:v k1 c1:v k1 c2:v Hard drive
  11. write( k2 , c1:v c2:v ) Memory k1 c1:v c2:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v Hard drive
  12. write( k1 , c1:v c3:v ) Memory k1 c1:v c2:v c3:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v k1 c1:v c3:v Hard drive
  13. Memory flush indexcleanup k1 c1:v c2:v c3:v k2 c1:v c2:v SSTable Hard drive
  14. “Java is a memory hog”✤ Large overhead for typical objects and collections✤ How large?✤ java.lang.instrument.Instrumentation ✤ JAMM: Java Agent for Memory Measurements ✤ https://github.com/jbellis/jamm
  15. org.apache.cassandra.cache.SerializingCache✤ Live objects are about 85% JVM bookeeping✤ org.apache.cassandra.cache.FreeableMemory using reference counting✤ Considering doing reference-counted, off-heap memtables as well
  16. Don’t forget about young gen✤ Always stop-the-world for ~100ms
  17. Platform-specific code✤ OS✤ JVM
  18. m[un]map✤ Log-structured storage wants to remove old files post- compaction; some platforms disallow deleting open files✤ Old workaround (pre-1.0): ✤ use PhantomReference to tell when mmap’d file is GC (hence unmapped) ✤ Poor user experience and messy corner cases✤ New workaround: ✤ Class.forName("sun.nio.ch.DirectBuffer").getMethod("cleaner")
  19. mmap part 2✤ 2GB limit via ByteBuffer: public abstract byte get(int index)✤ Workaround: MmappedSegmentedFile public Iterator<DataInput> iterator(long position)
  20. link✤ Used for snapshots✤ Old workaround: JNA✤ New workaround: supported directly by Java7
  21. mlockall✤ swappiness: pissing off database developers since 2001 (?)✤ mlockall(MCL_CURRENT)
  22. Low-level i/o✤ posix_fadvise✤ mincore/fincore✤ fctl✤ ... JNA
  23. A plug for JNA✤ https://github.com/twall/jna static { try { Native.register("c"); ... private static native int mlockall(int flags) throws LastErrorException;
  24. The fallacy of choosing portability over power✤ Applets have been dead for years✤ Python gets it right ✤ import readline
  25. The fallacy of choosing safety over power✤ Allowing munmap would expose developers to segfaults✤ But, relying on the GC to clean up external resources is a well-known antipattern ✤ File.close✤ We need munmap badly enough that we resort to unnatural and unportable code to get it ✤ You haven’t kept us from risking segfaults, you’ve just made us miserable
  26. Compatibility through obscurity?✤ sun.misc.Unsafe✤ Used by high-profile libraries like high-scale-lib
  27. ... even public options http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
  28. Too negative?
  29. Still true✤ "Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free." -- Cliff Click

×