Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this presentation? Why not share!

Like this? Share it with your network

Share

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM)

on

  • 5,171 views

Cache & Concurrency considerations for a high performance Cassandra deployment. ...

Cache & Concurrency considerations for a high performance Cassandra deployment.
SriSatish Ambati

Cassandra has hit it's stride as a distributed java NoSQL database! It's fast, it's in-memory, it's scalable, it's seda; It's eventually consistent model makes it practical for the large & growing volumes of unstructured data usecases. It is also time to run it through the filters of performance analysis. For starters it runs on the java virtual machine and inherits the capabilities and culpabilities of the platform. This presentation reviews the runtime architecture, cache behavior & performance of a real-world workload on Cassandra. We blend existing system & jvm tools to get a quick overview & a breakdown of hotspots in the get, put & update operations. We highlight the role played by garbage collection & fragmentation due to long lived objects; We investigate lock contention in the data structures under concurrent usage. Cassandra uses UDP for management & TCP for data: we look at robustness of the communication patterns during high spikes and cluster-wide events. We review Non-Blocking Hashmap modifications to Cassandra that improve concurrency & amplify performance of this frontrunner in the NoSQL space

ApacheCon2010 NA
Wed, 03 November 2010 15:00
cassandra

Statistics

Views

Total Views
5,171
Views on SlideShare
4,873
Embed Views
298

Actions

Likes
8
Downloads
143
Comments
0

5 Embeds 298

http://www.nosqldatabases.com 232
http://www.azulsystems.com 52
http://www.linkedin.com 8
https://www.linkedin.com 4
http://development.azuldev.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. The write into the in-memory data structure is performed only after a successful write into the commit log. We have a dedicated disk on each machine for the commit log since all writes into the commit log are sequential and so we can maximize disk throughput. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk. This write is performed on one of many commodity disks that machines are equipped with. All writes are sequential to disk and also generate an index for ecient lookup based on row key. These indices are also persisted along with the data le. Over time many such les could exist on disk and a merge process runs in the background to collate the different les into one le. This process is very similar to the compaction process that happens in the Bigtable system
  • “ A typical read operation rst queries the in-memory data structure before looking into the les on disk. The files are looked at in the order of newest to oldest. When a disk lookup occurs we could be looking up a key in multiple les on disk. In order to prevent lookups into les that do not contain the key, a bloom lter, summarizing the keys in the le, is also stored in each data le and also kept in memory. This bloom lter is rst consulted to check if the key being looked up does indeed exist in the given le. A key in a column family could have many columns. Some special indexing is required to retrieve columns which are further away from the key. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. As the columns for a given key are being serialized and written out to disk we generate indices at every 256K chunk boundary. This boundary is congurable, but we have found 256K to work well for us in our production workloads.”
  • Description of Graph Shows the average number of cache misses expected when inserting into a hash table with various collision resolution mechanisms; on modern machines, this is a good estimate of actual clock time required. This seems to confirm the common heuristic that performance begins to degrade at about 80% table density. It is based on a simulated model of a hash table where the hash function chooses indexes for each insertion uniformly at random. The parameters of the model were: You may be curious what happens in the case where no cache exists. In other words, how does the number of probes (number of reads, number of comparisons) rise as the table fills? The curve is similar in shape to the one above, but shifted left: it requiresan average of 24 probes for an 80% full table, and you have to go down to a 50% full table for only 3 probes to be required on average. This suggests that in the absence of a cache, ideally your hash table should be about twice as large for probing as for chaining.

ApacheCon2010: Cache & Concurrency Considerations in Cassandra (& limits of JVM) Presentation Transcript

  • 1. SriSatish Ambati Performance, Riptano, Cassandra Azul Systems & OpenJDK Twitter: @srisatish [email_address] Cache & Concurrency considerations for a high performance Cassandra
  • 2. Trail ahead
    • Elements of Cache Performance
    • Metrics, Monitors
    • JVM goes to BigData Land!
    • Examples
    • Lucandra, Twissandra
    • Cassandra Performance with JVM Commentary
      • Runtime Views
      • Non Blocking HashMap
      • Locking: concurrency
      • Garbage Collection
  • 3. A feather in the CAP
    • Eventual Consistency
      • Levels
      • Doesn’t mean data loss (journaled)
    • SEDA
      • Partitioning, Cluster & Failure detection, Storage engine mod
      • Event driven & non-blocking io
      • Pure Java
  • 4. Count what is countable, measure what is measurable, and what is not measurable, make measurable -Galileo
  • 5. Elements of Cache Performance Metrics
    • Operations:
      • Ops/s: Puts/sec, Gets/sec, updates/sec
      • Latencies, percentiles
      • Indexing
    • # of nodes – scale, elasticity
    • Replication
      • Synchronous, Asynchronous (fast writes)
    • Tuneable Consistency
    • Durability/Persistence
    • Size & Number of Objects, Size of Cache
    • # of user clients
  • 6. Elements of Cache Performance: “Think Locality”
    • Hot or Not: The 80/20 rule.
      • A small set of objects are very popular!
      • What is the most RT tweet?
    • Hit or Miss: Hit Ratio
      • How effective is your cache?
      • LRU, LFU, FIFO.. Expiration
    • Long-lived objects lead to better locality.
    • Spikes happen
      • Cascading events
      • Cache Thrash: full table scans
  • 7. Real World Performance
    • Facebook Inbox
      • Writes:0.12ms, Reads:15ms @ 50GB data
    • Twitter performance
      • Twissandra (simulation)
    • Cassandra for Search & Portals
      • Lucandra, solandra (simulation)
    • ycbs/PNUTS benchmarks
      • 5ms read/writes @ 5k ops/s (50/50 Update heavy)
      • 8ms reads/5ms writes @ 5k ops/s (95/5 read heavy)
    • Lab environment
      • ~5k writes per sec per node, <5ms latencies
      • ~10k reads per sec per node, <5ms latencies
    • Performance has improved in newer versions
  • 8. yahoo cloud store benchmark 50/50 – Update Heavy
  • 9. yahoo cloud store benchmark 95/5 – read heavy
  • 10. JVM in BigData Land!
    • Limits for scale
    • Locks : synchronized
      • Can’t use all my multi-cores!
      • java.util.collections also hold locks
      • Use non-blocking collections!
    • (de)Serialization is expensive
      • Hampers object portability
      • Use avro, thrift!
    • Object overhead
      • average enterprise collection has 3 elements!
      • Use byte[ ], primitives where possible!
    • Garbage Collection
      • Can’t throw memory at the problem!
      • Mitigate, Monitor, Measure foot print
  • 11. Tools
    • What is the JVM doing:
      • dtrace, hprof, introscope, jconsole, visualvm, yourkit, azul zvision
    • Invasive JVM observation tools
      • bci, jvmti, jvmdi/pi agents, jmx, logging
    • What is the OS doing:
      • dtrace, oprofile, vtune
    • What is the network disk doing:
      • Ganglia, iostat, lsof, netstat, nagios
  • 12. furiously fast writes
    • Append only writes
      • Sequential disk access
    • No locks in critical path
    • Key based atomicity
    client issues write n1 partitioner commit log apply to memory n2 find node
  • 13. furiously fast writes
    • Use separate disks for commitlog
      • Don’t forget to size them well
      • Isolation difficult in the cloud..
    • Memtable/SSTable sizes
      • Delicately balanced with GC
    • memtable_throughput_in_mb
  • 14. Cassandra on EC2 cloud *Corey Hulen, EC2
  • 15. Cassandra on EC2 cloud
  • 16.  
  • 17. Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT Loaded in memory K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Index File Data File D E L E T E D
  • 18. Compactions
    • Intense disk io & mem churn
    • Triggers GC for tombstones
    • Minor/Major Compactions
    • Reduce priority for better reads
    • Other Parameters -
      • CompactionManager.
      • minimumCompactionThreshold=xxxx
  • 19. Example: compaction in realworld, cloudkick
  • 20. reads design
  • 21. reads performance
    • BloomFilter used to identify the right file
    • Maintain column indices to look up columns
      • Which can span different SSTables
    • Less io than typical b-tree
    • Cold read: Two seeks
      • One for Key lookup, another row lookup
    • Key Cache
      • Optimized in latest cassandra
    • Row Cache
      • Improves read performance
      • GC sensitive for large rows.
    • Most (google) applications require single row transactions*
    *Sanjay G, BigTable Design, Google .
  • 22. Client Performance Marshal Arts: Ser/Deserialization
    • Clients dominated by Thrift, Avro
      • Hector, Pelops
    • Thrift: upgrade to latest: 0.5, 0.4
    • No news: java.io.Serializable is S.L..O.…W
    • Use “transient”
    • avro, thrift, proto-buf
    • Common Patterns of Doom:
      • Death by a million gets
  • 23. Serialization + Deserialization uBench
    • http://code.google.com/p/thrift-protobuf-compare/wiki/BenchmarkingV2
  • 24. Adding Nodes
    • New nodes
      • Add themselves to busiest node
      • And then Split its Range
    • Busy Node starts transmit to new node
    • Bootstrap logic initiated from any node, cli, web
    • Each node capable of ~40MB/s
      • Multiple replicas to parallelize bootstrap
    • UDP for control messages
    • TCP for request routing
  • 25. inter-node comm
    • Gossip Protocol
      • It’s exponential
      • (epidemic algorithm)
    • Failure Detector
      • Accrual rate phi
    • Anti-Entropy
      • Bringing replicas to uptodate
  • 26. Bloom Filter: in full bloom
    • “ constant” time
    • size:compact
    • false positives
    • Single lookup
      • for key in file
    • Deletion
    • Improve
      • Counting BF
      • Bloomier filters
  • 27. Birthdays, Collisions & Hashing functions
    • Birthday Paradox
      • For the N=21 people in this room
      • Probability that at least 2 of them share same birthday is ~0.47
    • Collisions are real!
    • An unbalanced HashMap behaves like a list O(n) retrieval
    • Chaining & Linear probing
    • Performance Degrades
    • with 80% table density
  • 28. the devil’s in the details
  • 29. CFS
    • All in the family!
    • denormalize
  • 30. Memtable
    • In-memory
    • ColumnFamily specific
    • throughput determines size before flush
    • Larger memtables can improve reads
  • 31. SSTable
    • MemTable “flushes” to a SSTable
    • Immutable after
    • Read: Multiple SSTable lookups possible
    • Chief Execs:
      • SSTableWriter
      • SSTableReader
  • 32. Write: Runtime threads
  • 33. Writes: runtime mem
  • 34. Example: Java Overheads
  • 35. writes: monitors
  • 36. U U I D
    • java.util.UUID is slow
      • static use leads to contention
    • SecureRandom
    • Uses /dev/urandom for seed initialization
      • -Djava.security.egd=file:/dev/urandom
    • PRNG without file is atleast 20%-40% better.
    • Use TimeUUIDs where possible – much faster
    • JUG – java.uuid.generator
    • http://github.com/cowtowncoder/java-uuid-generator
    • http://jug.safehaus.org/
    • http://johannburkard.de/blog/programming/java/Java-UUID-generators-compared.html
  • 37. synchronized
    • Coarse grained locks
    • io under lock
    • Stop signal on a highway
    • java.util.concurrent does not mean no locks
    • Non Blocking, Lock free, Wait free collections
  • 38. Scalable Lock-Free Coding Style
    • Big Array to hold Data
    • Concurrent writes via: CAS & Finite State Machine
      • No locks , no volatile
      • Much faster than locking under heavy load
      • Directly reach main data array in 1 step
    • Resize as needed
      • Copy Array to a larger Array on demand
      • Use State Machine to help copy
      • “ Mark” old Array words to avoid missing late updates
  • 39. Non-Blocking HashMap
  • 40. Cassandra uses High Scale Non-Blocking Hashmap
    • public class BinaryMemtable implements IFlushable
    • {
    • private final Map<DecoratedKey,byte[]> columnFamilies =
    • new NonBlockingHashMap<DecoratedKey, byte[]>();
    • /* Lock and Condition for notifying new clients about Memtable switches */
    • private final Lock lock = new ReentrantLock(); Condition condition;
    • }
    • public class Table
    • {
    • private static final Map<String, Table> instances = new NonBlockingHashMap<String, Table>();
    • }
  • 41. GC-sensitive elements within Cassandra
    • Compaction triggers System.gc()
      • Tombstones from files
    • “ GCInspector”
    • Memtable Threshold, sizes
    • SSTable sizes
    • Low overhead collection choices
  • 42. Garbage Collection
    • Pause Times
      • if stop_the_word_FullGC > ttl_of_node
      • => failed requests; failure accrual & node repair.
    • Allocation Rate
      • New object creation, insertion rate
    • Live Objects (residency)
      • if residency in heap > 50%
      • GC overheads dominate.
    • Overhead
      • space, cpu cycles spent GC
    • 64-bit not addressing pause times
      • Bigger is not better!
  • 43. Memory Fragmentation
    • Fragmentation
      • Performance degrades over time
      • Inducing “Full GC” makes problem go away
      • Free memory that cannot be used
    • Reduce occurrence
      • Use a compacting collector
      • Promote less often
      • Use uniform sized objects
    • Solution – unsolved
      • Use latest CMS with CR:6631166
      • Azul’s Zing JVM & Pauseless GC
  • 44. CASSANDRA-1014
  • 45. Best Practices: Garbage Collection
    • GC Logs are cheap even in production
    • -Xloggc:/var/log/cassandra/gc.log
    • -XX:+PrintGCDetails
    • -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
    • -XX:+PrintHeapAtGC
    • Slightly expensive ones:
    • -XX:PrintFLSStatistics=2 -XX:CMSStatistics=1
    • -XX:CMSInitiationStatistics
  • 46. Sizing: Young Generation
    • Should we set –Xms == -Xmx ?
    • Use –Xmn (fixed eden)
    survivor spaces survivor ratio Tenuring Threshold allocations {new Object();} eden promotion old generation allocation by jvm
  • 47. Tuning CMS
    • Don’t promote too often!
      • Frequent promotion causes fragmentation
    • Size the generations
      • Min GC times are a function of Live Set
      • Old Gen should host steady state comfortably
    • Parallelize on multicores:
      • -XX:ParallelCMSThreads=4
      • -XX:ParallelGCThreads=4
    • Avoid CMS Initiating heuristic
      • -XX:+UseCMSInitiationOccupanyOnly
    • Use Concurrent for System.gc()
      • -XX:+ExplicitGCInvokesConcurrent
  • 48. Summary
    • Design & Implementation of Cassandra takes advantages
    • of strengths while avoiding common JVM issues.
    • Locks:
      • Avoids locks in critical path
      • Uses non-blocking collections, TimeUUIDs!
      • Still Can’t use all my multi-cores..?
      • >> Other bottlenecks to find!
    • De/Serialization:
      • Uses avro, thrift!
    • Object overhead
      • Uses mostly byte[ ], primitives where possible!
    • Garbage Collection
      • Mitigate: Monitor, Measure foot print.
      • Work in progress by all jvm vendors!
    • Cassandra starts from a great footing from a JVM standpoint
    • and will reap the benefits of the platform!
  • 49. Q&A
    • References
    • Verner Wogels, Eventually Consistent http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
    • Bloom, Burton H. (1970), &quot;Space/time trade-offs in hash coding with allowable errors&quot;
    • Avinash Lakshman, http://static.last.fm/johan/nosql-20090611/cassandra_nosql.pdf
    • Eric Brewer, CAP http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
    • Tony Printzeis, Charlie Hunt, Javaone Talk http://www.scribd.com/doc/36090475/GC-Tuning-in-the-Java
    • http://github.com/digitalreasoning/PyStratus/wiki/Documentation
    • http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
    • Cassandra on Cloud, http://www.coreyhulen.org/?p=326
    • Cliff Click’s , Non-blocking HashMap http://sourceforge.net/projects/high-scale-lib/
    • Brian F. Cooper ., Yahoo Cloud Storage Benchmark, http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf
  • 50. DataModel: Know your use patterns
    • Alternative Twitter DataModel:
    • <Keyspace Name=&quot;Multiblog&quot;>
    • <ColumnFamily CompareWith=&quot;TimeUUIDType&quot; Name=&quot;Blogs&quot; />
    • <ColumnFamily CompareWith=&quot;TimeUUIDType&quot; Name=&quot;Comments&quot;/>
    • </Keyspace>