4. What is Cassandra? Key-value store (with some structure) Highly scalable Eventually consistent Distributed Tunable Partitioning Replication
5. Where did it come from? Created at Facebook Dynamo: distribution architecture BigTable: data model Open-sourced in 2008 Apache incubator in early 2009 Graduation in March 2010
6. Who uses it? Rackspace Facebook (of course) Twitter Digg Reddit IBM Others…
7. What problems does it solve? Reliability at scale No single point of failure (all nodes are identical) Simple scaling linear High write throughput Large data sets
8. What problems can’t it solve? No flexible indices No querying on non PK values Not good for big binary data (>64mb) unless you chunk Row contents must fit in available memory
9.
10.
11. Concepts: Replication & Consistency You specify replication factor You specify consistency level for read/write operations ZERO, ONE, QUORUM, ALL, ANY
12. Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=3 a j d g
13. Ring Topology Storage ring Every node gets a token Defines its place in the storage ring And which keys it is responsible for (its ranges) RF=2 a j d g
14. Ring: New Node New node Ranges are adjusted RF=3 a m j d g
15. Ring: New Node New node Ranges are adjusted RF=2 a m j d g
16. Ring Partition Node dies or becomes isolated from the ring Hints Handoff RF=3 a m j d g
17. Data Model Keyspace-contains column families ColumnFamily Standard or Super Two levels of indexes (key and column name)
18. Data Model Column and subcolumn sorting Specify your own comparator: TimeUUID LexicalUUID UTF8 Long Bytes CreateYourOwn
22. Inserting: Writes Commit log for durability Memtable – no disk access (no reads or seeks) Sstables are final (become read only) Index Bloom filter Raw data Atomic within a ColumnFamily Bottom line: FAST!!!
23. Querying: Overview You need a key or keys: Single: key=‘a’ Range: key=‘a’ through ’f’ And columns to retrieve: Slice: cols={bar through kite} By name: key=‘b’ cols={bar, cat, llama} Nothing like SQL “WHERE col=‘faz’” But secondary indices are being worked on (see CASSANDRA-749)
24. Querying: Reads Not as fast as writes Read repair when out of sync New in 0.6: Row cache (avoid sstable lookup) Key cache (avoid index scan)
25. Client API (Low level) Fat Client Maybe too low level, not well-tested Thrift (currently best-supported) Many language bindings Not much of a community No streaming Fast transport Avro Just getting started Shows promise
26. Client API (High Level) Rapidly changing, getting feature-rich Connection pools Load balancing/Failover Reduces the verbosity of working with thrift For Java, see Hector http://github.com/rantav/hector Also Ruby, Python, C++, C#, Perl, PHP http://wiki.apache.org/cassandra/ClientExamples
27. Java Bits: JMX Relatively easy to expose objects and services as MBeans Simplifies aspects of cluster and node management Easy monitoring You choose the JMX-enabled system management tool (jconsole is alright)
28. Java Bits: available libraries Excellent: Google collections Multimap, BiMap, Iterators java.util.concurrency nio files (including mmap) Meh: nio sockets
30. Java Bits: code management Library versioning No standard way Mostly declarative Not readily queryable Must ship every dependency Or use ant/mvn. Now you have two (or more!) problems.
31. Java Bits: daemonization Java doesn’t make it easy re: stdout, stderr After setting up, System.out and System.err are close()d Windows: don’t ask
32. Future Direction Range delete (delete these cols from those keys) Vector clocks (including server-side conflict resolution) Altering keyspace/column family definitions on a live cluster Byte[] keys Compression Multi-tenant support Less memory restrictions
33. Linky wiki.apache.org/cassandra cassandra.apache.org Google BigTable labs.google.com/papers/bigtable.html Amazon Dynamo s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf Facebook Cassandra www.facebook.com/note.php?note_id=24413138919 Java tuning: java.sun.com/performance/reference/whitepapers/tuning.html java.sun.com/javase/technologies/hotspot/gc/index.jsp Me gdusbabek@gmail.com gdusbabek on twitter and just about everything else.
Editor's Notes
Hello World
RandomPartitioner – takes key, uses MD5 as the real key, then stores on the appropriate node.OrderPreservingPartitioner– get cheap range scans. Takes more work.
Eric Brewer
Need to describe hinted handoff better.
Keyspace == like namespaceCF == like a tableKeyspace + Table used interchangeably in the code.
Key cache : keys whose location are kept in memory to avoid index scan.Row cache: entire rows kept in memory.
Avro: Doug Cutting
Mmap – index and data files (read only)
java.sun.com/performance/reference/whitepapers/tuning.htmlhttp://java.sun.com/javase/technologies/hotspot/gc/index.jspGoal is low pause times and high throughput:-XX:TargetSurvivorRatio=90Allows 90% of the survivor spaces to be occupied instead of the default 50%, allowing better utilization of the survivor space memory. -XX:SurvivorRatio=128Sets survivor space ratio to 1:128, resulting in small survivor. Smaller survivor spaces allow short lived less time in the young generation (they die faster). -XX:+AggressiveOptsturns on point optimizations that are expected to be on in later releases. Experimental and sometimes reveals JDK bugs.-XX:+UseParNewGC -UseConcMarkSweepGCparallel young generation collector. Similar to +UsePareallelGC except can be used with the concurrent collector. See benefits here on multiway systems. Two pauses instead of one long pause (mark, then sweep). Mark: directly reachable (young). 2nd: objects missed due to concurrent execution of threads (the remark).-XX:+CMSParallelRemarkEnabledworks with UseParNewGC to decrease the remark pauses.