C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui


Published on

Have you wondered what actually happens when you submit a write to Cassandra? This vendor agnostic technical talk will cover the internals of the read and write paths of Cassandra and compare it to other NoSQL stores, especially HBase so you can pick the right database for your project. Some of the topics mentioned are consistency levels, memtables/memstores, SSTables/HFiles, bloom filters, block indexes, data distribution partitioners and optimal use cases.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui

  1. 1. CASSANDRA VS. THE FIELDblueplastic.com/c.pdfBY SAMEER FAROOQUISAMEER@BLUEPLASTIC.COMlinkedin.com/in/blueplastic/@blueplastichttp://youtu.be/ziqx2hJY8Hg#Cassandra13COMPARING ARCHITECTURES:
  2. 2. NoSQL OptionsKey -> Value Key -> Doc Column Family Graph ~Real TimeRiakRedisMemcached DBBerkeley DBHamster DBAmazon DynamoVoldemortFoundationDBLevelDBTokyo CabinetMongoDBCouchDBTerrastoreOrientDBRavenDBElasticsearchCassandraHBaseHypertableAmazon SimpleDBAccumuloHPCCCloudataNeo4JInfinite GraphOrientDBFlockDBGremlinTitanStormImpalaStinger/TezDrillSolr/Lucene
  3. 3. Key -> ValueKey (ID) Value (Name)0001 Winston Smith0002 Julia0003 OBrien0004 Emmanuel Goldstein- Simple API: get, put, delete- K/V pairs are stored in containers called buckets- Consistency only for a single keyUse cases: Content caching, Web Session info, User profiles, Preferences, Shopping Carts- Very fast lookups and good scalability (sharding)- All access via primary keyDon’t use for: Querying by data, multi-operation transactions, relationships between dataCan also be an object, blob, JSON, XML, etc.
  4. 4. Key -> Document- Structure of docs stored should be similar,but doesn’t need to be identical- Like K/V, but value is examinableUse cases: Event logging, content management systems, blogging platforms, web analytics- Documents: XML, JSON, BSON, etcDon’t use for: Complex transactions spanning Different Operations, Strict Schema applicationsKey: 0001Value: {firstname: “Nuru”,lastname: “Abdalla”,location: “Uguanda”,languages: [“English, Swahili”],mother: “Aziza”,father: “Mufa”,refugee_camp: “camp-10”picture: “01010110”}Key: 0039Value: {firstname: “Dee”,location: “Uguanda”,languages: “Swahili”,refugee_camp: “camp-54”picture: “01010110”}- Tolerant of incomplete data
  5. 5. Graph DatabasesUse cases: Connected Data (social networks), shortest path, Recommendation EnginesRouting-Dispatch-Location services (node = location/address)Don’t use for: Not easy to cluster/scale beyond one node, sometimes has to traverse entire graph407-666-4012GPS coordinatesIMSI #407-384-4924+44 #+44 #415-242-9492407-336-1193
  6. 6. ~ Real TimeStormImpalaStinger/TezDrillSpark/Shark- Distributed, real time computation system / Stream processing- For doing a continuous query on data streams and streaming the results into clients(continuous computation)- Still emerging, most are in alpha or beta stages- Count hash tags#SpoutBolt
  7. 7. Column FamilyCol Fam 1C1 C2 C3 C4XCol Fam 2A B C DYCol Fam 31 2 3 4ROW-1ROW-2ROW-3ROW-4ROW-5ROW-6v1=Zv2=K(Table, Row Key, Col. Family, Column, Timestamp) → Value (Z)Table-Name-AlphaRegion-1Region-2
  8. 8. Column Family- Know your R + W queries up front- Design the data model and system architectureto optimally fulfil those queries- Important to understand the architecture fundamentals
  9. 9. How to pick a CF database
  10. 10. How to pick a CF databaseGoogle Trends
  11. 11. How to pick a CF databaseSouth KoreaIndiaUSAChinaRussiaNetherlandsSouth KoreaBelgiumChinaTaiwanGoogle Trends
  12. 12. How to pick a CF databaseDate Apache Cassandra Apache HBaseJan 2013 739 783Feb 2013 714 797March 2013 837 692April 2013 730 741May 2013 567 636- Check activity on the Apache user mailing lists
  13. 13. DynamoBigTableNov, 2006Oct, 2007 Storage EngineData ModelCassandra
  14. 14. MapReduceGFSBigTableHDFSMapReduceChubbyZooKeeperOct, 2003Dec, 2004Nov, 2006Nov, 2006
  15. 15. • Written in Java• Column Family Oriented Databases• Have reached 1,000+ nodes in production• Very low latency reads + writes• Use Log Structured Merge Trees• Atomic at row level• No support for joins, transactions, foreign keysBoth:
  16. 16. • Peer to Peer architecture• Tunable consistency• Secondary Indexes available• Writes to ext3/4• Conflict resolution handledduring reads• N-way writes• Random and ordered sorting ofrow keys supported• Master / Slave architecture• Strict consistency• No native secondary indexsupport• Writes to HDFS• Conflict resolution handledduring writes• Pipelined write• Ordered/Lexicographical sortingof row keysvs.
  17. 17. Amazon.com’s Dynamo Use Cases- Best seller lists- Customer preferences- Sales rank- Product catalog- Session managementServices that only need primary keyaccess to data store:No need for:- Complex SQL queries- Operations spanning multiple data items• Shopping cart service must always allowcustomers to add and remove items• If there are 2 conflicting versions of a write,the application should be able to pull bothwrites and merge them• Designed for apps that “need tight controlover the tradeoffs between availability,consistency, cost-effectiveness andperformance”
  18. 18. Google’s BigTable Use Cases- Gmail- YouTube- Google Earth- Google Finance- Google Analytics- Personalized Search60 products at Google onceused BigTable:• Must be able to store the entire web crawldata• Rely on GFS for replication and dataavailability• Strong integration with MapReduce
  19. 19. 12345678Client- Gossip runs ever second on a nodeto 3 other nodes- Used to discover location and stateinformation about other nodes- Phi Accrual Failure Detector used todetect failures via a suspicion levelon a continuous scale
  20. 20. NameNode JobTracker HBase MasterZooKeeperStandby NNDNTTMMR RSDNTTMMR RSDNTTMMR RSDNTTMMR RSMMMMMMRR RROS2 TB eachSATARAIDOSOSOSJTNN HMHBase Master StandbyMaster MachinesSlave MachinesClient
  21. 21. Effort to deploy- One monolithic database install (1 JVM per node) + 1 log fileand 1 config file (YAML)- No single points of failure, so no standby master nodes- Good default settings- More complex to deploy (multiple JVMs per node) +many log files and many config files (XMLs)- More moving parts: HDFS, HBase, MapReduce,Passive NameNode, Standby HBase Master,ZooKeeper- Default settings usually need tweaking#Cassandra13
  22. 22. Where to write?ClientZooKeeper. . .. . .x xxSynchronousReplication via HDFS-ROOT-. . . . . ..META.1 .META.2 .META.3go to:go to: go to: go to:META 1, 2 or 3RS a,b,c RS a,b,c RS a,b,cNo control over replication orconsistency for each write!-ROOT-Region Location?.META.Root & MetaLocations cachedR.META. table?M
  23. 23. 12345678ClientcoordinatorR2R1R3Replication F. = 3Consistency = 1Where to write?
  24. 24. 12345678ClientcoordinatorR2R1R3Replication F. = 3Consistency = 2Where to write?
  25. 25. 12345678ClientcoordinatorR4R2Replication F. = 4Consistency = 2R1R3Where to write?
  26. 26. Strong Consistency Costs- Write to 3 nodes (RF = 3, C = 2)- Read from at least 2 nodes to guarantee strong consistency- Write to 3 nodes (RF=3, C=3)- Read from only 1 node to guarantee strong consistency#Cassandra13
  27. 27. Log Structured Merge TreesC* HBase(Table, Row Key, Col. Family, Column, Timestamp) → Value (Z)NodeJVMWALMemstoreZZ A B C DHFileFlushCommitLogMemtableSStable
  28. 28. Log Structured Merge TreesC* HBaseNodeJVMWALMemstoreZZ A B C DFlushCommitLogMemtableSStable, ValueZ A B C DFlushHFileSStableHFile
  29. 29. FlushSStableZ A B C DHFileFlush DetailsBloom FilterBlock IndexR only R + CZABCD- In HBase BF and BI are stored in the Hfile- In C*, there are separate data, BF and Index Files.
  30. 30. Flush per Column Family- Supported- Flushes all Column Families together- Unnecessary flushing puts more network pressure onHBase since Hfiles have to be replicated to 2 otherHDFS nodes- Flush per CF is under development via JIRA 3149#Cassandra13
  31. 31. Secondary Indexes- Native support for Secondary Indexes- No native Secondary Indexes- But a trigger can be launched after a put tokeep a secondary index (another CF) up todate and not put the burden on the client#Cassandra13
  32. 32. SSD Support- It is possible to place just the SStables on SSDIn YAML file, set commitlog_directory to spinning disks andset data_file_directories to SSD- See Rick Branson’s talk:youtube.com/watch?v=zQdDi9pdf3I- Not possible to tell HDFS to only store WAL or HFiles on SSD- There is some support in MapR and Intel distributions for this- Apache HDFS JIRAs 2832 & 4672 have preliminary discussions#Cassandra13
  33. 33. Compactions- Tiered and Leveled- For leveled, see J. Ellis’s blog post:- Only Tiered- Note, many new algorithms and improvements coming inHBase 0.95 like Stripe Compactions (JIRA 7667)datastax.com/dev/blog/leveled-compaction-in-apache-cassandra#Cassandra13https://issues.apache.org/jira/secure/attachment/12575449/Stripe%20compactions.pdf
  34. 34. Reading after disk failure- Reads can just be fulfilled from another node natively- After a disk failure, the slave machine will readmissing data from a remote disk until compactionhappens. So, region reads can be slow.
  35. 35. Data Partitioning- Supports ordered partitioner and random partitioner- Only supports ordered partitioner- Row key range scans possible- It is possible to externally md5 hash the row key andadd the hash to the row key: md5-rowkey#Cassandra13
  36. 36. Triggers / Coprocessors- Under development for C* 2.0, JIRA 1311- Supported by Coprocessors (so after a get/put/delon a column family, a trigger can be executed.- Triggers are coded as java classes#Cassandra13
  37. 37. Compare & Set- Under development for C* 2.0- Supported#Cassandra13
  38. 38. Multi-Datacenter/DR Support- Very mature and well tested- Synchronous or Asynchronous replication to DR- Recovery Point Objective (RPO) can be 0- Not as robust- Only Asynchronous replication to DR- Recovery Point Objective (RPO) cannot be 0#Cassandra13
  39. 39. blueplastic.com/c.pdfSameer Farooquisameer@blueplastic.com- Freelance Big Data consultant and trainer- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack- Datastax authorized training partnerEx: Hortonworks, Accenture R&D, Symanteclinkedin.com/in/blueplastic/@blueplastichttp://youtu.be/ziqx2hJY8Hg#Cassandra13