Dealing with JVM limitationsin Apache CassandraJonathan Ellis / @spyced
Pain points for Java databases✤   GC✤   GC✤   GC
Pain points for Java databases✤   GC✤   Platform specific code
GC✤   Concurrent and compacting: choose one    ✤   G1    ✤   Azul C4 / Zing?
Fragmentation✤   Bloom filter arrays✤   Compression offsets
Automatic mitigation?✤   http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf✤   http://researcher.ibm.c...
Fragmentation, 2✤   Arena allocation for memtables
(Memtables?)    write( k1 , c1:v1 )                                       Memory                          Memtable   Commi...
write( k1 , c1:v )                                             Memory                      k1 c1:v                        ...
write( k1 , c2:v )                                      Memory                     k1 c1:v c2:v    k1 c1:v    k1 c2:v     ...
write( k2 ,     c1:v c2:v   )                                                   Memory                                k1 c...
write( k1 ,     c1:v c3:v   )                                                      Memory                                k...
Memory          flush                 indexcleanup    k1 c1:v c2:v c3:v           k2   c1:v c2:v                           ...
“Java is a memory hog”✤   Large overhead for typical objects and collections✤   How large?✤   java.lang.instrument.Instrum...
org.apache.cassandra.cache.SerializingCache✤   Live objects are about 85% JVM bookeeping✤   org.apache.cassandra.cache.Fre...
Don’t forget about young gen✤   Always stop-the-world for ~100ms
Platform-specific code✤   OS✤   JVM
m[un]map✤   Log-structured storage wants to remove old files post-    compaction; some platforms disallow deleting open file...
mmap part 2✤   2GB limit via ByteBuffer:    public abstract byte get(int index)✤   Workaround: MmappedSegmentedFile    pub...
link✤   Used for snapshots✤   Old workaround: JNA✤   New workaround: supported directly by Java7
mlockall✤   swappiness: pissing off database developers since 2001 (?)✤   mlockall(MCL_CURRENT)
Low-level i/o✤   posix_fadvise✤   mincore/fincore✤   fctl✤   ... JNA
A plug for JNA✤   https://github.com/twall/jna     static {         try {              Native.register("c");       ...    ...
The fallacy of choosing portability over power✤   Applets have been dead for years✤   Python gets it right    ✤   import r...
The fallacy of choosing safety over power✤   Allowing munmap would expose developers to segfaults✤   But, relying on the G...
Compatibility through obscurity?✤   sun.misc.Unsafe✤   Used by high-profile libraries like high-scale-lib
... even public options     http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
Too negative?
Still true✤   "Many concurrent algorithms are very easy to write with a    GC and totally hard (to down right impossible) ...
Upcoming SlideShare
Loading in...5
×

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)

13,783

Published on

Published in: Technology

Dealing with JVM limitations in Apache Cassandra (Fosdem 2012)

  1. 1. Dealing with JVM limitationsin Apache CassandraJonathan Ellis / @spyced
  2. 2. Pain points for Java databases✤ GC✤ GC✤ GC
  3. 3. Pain points for Java databases✤ GC✤ Platform specific code
  4. 4. GC✤ Concurrent and compacting: choose one ✤ G1 ✤ Azul C4 / Zing?
  5. 5. Fragmentation✤ Bloom filter arrays✤ Compression offsets
  6. 6. Automatic mitigation?✤ http://www.research.ibm.com/people/d/dfb/papers/Bacon03Controlling.pdf✤ http://researcher.ibm.com/files/us-hirzel/pldi10-arraylets.pdf
  7. 7. Fragmentation, 2✤ Arena allocation for memtables
  8. 8. (Memtables?) write( k1 , c1:v1 ) Memory Memtable Commit log Hard drive
  9. 9. write( k1 , c1:v ) Memory k1 c1:v Memtable k1 c1:vCommit log Hard drive
  10. 10. write( k1 , c2:v ) Memory k1 c1:v c2:v k1 c1:v k1 c2:v Hard drive
  11. 11. write( k2 , c1:v c2:v ) Memory k1 c1:v c2:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v Hard drive
  12. 12. write( k1 , c1:v c3:v ) Memory k1 c1:v c2:v c3:v k2 c1:v c2:v k1 c1:v k1 c2:v k2 c1:v c2:v k1 c1:v c3:v Hard drive
  13. 13. Memory flush indexcleanup k1 c1:v c2:v c3:v k2 c1:v c2:v SSTable Hard drive
  14. 14. “Java is a memory hog”✤ Large overhead for typical objects and collections✤ How large?✤ java.lang.instrument.Instrumentation ✤ JAMM: Java Agent for Memory Measurements ✤ https://github.com/jbellis/jamm
  15. 15. org.apache.cassandra.cache.SerializingCache✤ Live objects are about 85% JVM bookeeping✤ org.apache.cassandra.cache.FreeableMemory using reference counting✤ Considering doing reference-counted, off-heap memtables as well
  16. 16. Don’t forget about young gen✤ Always stop-the-world for ~100ms
  17. 17. Platform-specific code✤ OS✤ JVM
  18. 18. m[un]map✤ Log-structured storage wants to remove old files post- compaction; some platforms disallow deleting open files✤ Old workaround (pre-1.0): ✤ use PhantomReference to tell when mmap’d file is GC (hence unmapped) ✤ Poor user experience and messy corner cases✤ New workaround: ✤ Class.forName("sun.nio.ch.DirectBuffer").getMethod("cleaner")
  19. 19. mmap part 2✤ 2GB limit via ByteBuffer: public abstract byte get(int index)✤ Workaround: MmappedSegmentedFile public Iterator<DataInput> iterator(long position)
  20. 20. link✤ Used for snapshots✤ Old workaround: JNA✤ New workaround: supported directly by Java7
  21. 21. mlockall✤ swappiness: pissing off database developers since 2001 (?)✤ mlockall(MCL_CURRENT)
  22. 22. Low-level i/o✤ posix_fadvise✤ mincore/fincore✤ fctl✤ ... JNA
  23. 23. A plug for JNA✤ https://github.com/twall/jna static { try { Native.register("c"); ... private static native int mlockall(int flags) throws LastErrorException;
  24. 24. The fallacy of choosing portability over power✤ Applets have been dead for years✤ Python gets it right ✤ import readline
  25. 25. The fallacy of choosing safety over power✤ Allowing munmap would expose developers to segfaults✤ But, relying on the GC to clean up external resources is a well-known antipattern ✤ File.close✤ We need munmap badly enough that we resort to unnatural and unportable code to get it ✤ You haven’t kept us from risking segfaults, you’ve just made us miserable
  26. 26. Compatibility through obscurity?✤ sun.misc.Unsafe✤ Used by high-profile libraries like high-scale-lib
  27. 27. ... even public options http://blogs.oracle.com/dave/entry/false_sharing_induced_by_card
  28. 28. Too negative?
  29. 29. Still true✤ "Many concurrent algorithms are very easy to write with a GC and totally hard (to down right impossible) using explicit free." -- Cliff Click
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×