C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui
Upcoming SlideShare
Loading in...5
×
 

C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui

on

  • 2,354 views

Have you wondered what actually happens when you submit a write to Cassandra? This vendor agnostic technical talk will cover the internals of the read and write paths of Cassandra and compare it to ...

Have you wondered what actually happens when you submit a write to Cassandra? This vendor agnostic technical talk will cover the internals of the read and write paths of Cassandra and compare it to other NoSQL stores, especially HBase so you can pick the right database for your project. Some of the topics mentioned are consistency levels, memtables/memstores, SSTables/HFiles, bloom filters, block indexes, data distribution partitioners and optimal use cases.

Statistics

Views

Total Views
2,354
Views on SlideShare
2,353
Embed Views
1

Actions

Likes
5
Downloads
79
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui C* Summit 2013: Comparing Architectures: Cassandra vs the Field by Sameer Farooqui Presentation Transcript

  • CASSANDRA VS. THE FIELDblueplastic.com/c.pdfBY SAMEER FAROOQUISAMEER@BLUEPLASTIC.COMlinkedin.com/in/blueplastic/@blueplastichttp://youtu.be/ziqx2hJY8Hg#Cassandra13COMPARING ARCHITECTURES:
  • NoSQL OptionsKey -> Value Key -> Doc Column Family Graph ~Real TimeRiakRedisMemcached DBBerkeley DBHamster DBAmazon DynamoVoldemortFoundationDBLevelDBTokyo CabinetMongoDBCouchDBTerrastoreOrientDBRavenDBElasticsearchCassandraHBaseHypertableAmazon SimpleDBAccumuloHPCCCloudataNeo4JInfinite GraphOrientDBFlockDBGremlinTitanStormImpalaStinger/TezDrillSolr/Lucene
  • Key -> ValueKey (ID) Value (Name)0001 Winston Smith0002 Julia0003 OBrien0004 Emmanuel Goldstein- Simple API: get, put, delete- K/V pairs are stored in containers called buckets- Consistency only for a single keyUse cases: Content caching, Web Session info, User profiles, Preferences, Shopping Carts- Very fast lookups and good scalability (sharding)- All access via primary keyDon’t use for: Querying by data, multi-operation transactions, relationships between dataCan also be an object, blob, JSON, XML, etc.
  • Key -> Document- Structure of docs stored should be similar,but doesn’t need to be identical- Like K/V, but value is examinableUse cases: Event logging, content management systems, blogging platforms, web analytics- Documents: XML, JSON, BSON, etcDon’t use for: Complex transactions spanning Different Operations, Strict Schema applicationsKey: 0001Value: {firstname: “Nuru”,lastname: “Abdalla”,location: “Uguanda”,languages: [“English, Swahili”],mother: “Aziza”,father: “Mufa”,refugee_camp: “camp-10”picture: “01010110”}Key: 0039Value: {firstname: “Dee”,location: “Uguanda”,languages: “Swahili”,refugee_camp: “camp-54”picture: “01010110”}- Tolerant of incomplete data
  • Graph DatabasesUse cases: Connected Data (social networks), shortest path, Recommendation EnginesRouting-Dispatch-Location services (node = location/address)Don’t use for: Not easy to cluster/scale beyond one node, sometimes has to traverse entire graph407-666-4012GPS coordinatesIMSI #407-384-4924+44 #+44 #415-242-9492407-336-1193
  • ~ Real TimeStormImpalaStinger/TezDrillSpark/Shark- Distributed, real time computation system / Stream processing- For doing a continuous query on data streams and streaming the results into clients(continuous computation)- Still emerging, most are in alpha or beta stages- Count hash tags#SpoutBolt
  • Column FamilyCol Fam 1C1 C2 C3 C4XCol Fam 2A B C DYCol Fam 31 2 3 4ROW-1ROW-2ROW-3ROW-4ROW-5ROW-6v1=Zv2=K(Table, Row Key, Col. Family, Column, Timestamp) → Value (Z)Table-Name-AlphaRegion-1Region-2
  • Column Family- Know your R + W queries up front- Design the data model and system architectureto optimally fulfil those queries- Important to understand the architecture fundamentals
  • How to pick a CF database
  • How to pick a CF databaseGoogle Trends
  • How to pick a CF databaseSouth KoreaIndiaUSAChinaRussiaNetherlandsSouth KoreaBelgiumChinaTaiwanGoogle Trends
  • How to pick a CF databaseDate Apache Cassandra Apache HBaseJan 2013 739 783Feb 2013 714 797March 2013 837 692April 2013 730 741May 2013 567 636- Check activity on the Apache user mailing lists
  • DynamoBigTableNov, 2006Oct, 2007 Storage EngineData ModelCassandra
  • MapReduceGFSBigTableHDFSMapReduceChubbyZooKeeperOct, 2003Dec, 2004Nov, 2006Nov, 2006
  • • Written in Java• Column Family Oriented Databases• Have reached 1,000+ nodes in production• Very low latency reads + writes• Use Log Structured Merge Trees• Atomic at row level• No support for joins, transactions, foreign keysBoth:
  • • Peer to Peer architecture• Tunable consistency• Secondary Indexes available• Writes to ext3/4• Conflict resolution handledduring reads• N-way writes• Random and ordered sorting ofrow keys supported• Master / Slave architecture• Strict consistency• No native secondary indexsupport• Writes to HDFS• Conflict resolution handledduring writes• Pipelined write• Ordered/Lexicographical sortingof row keysvs.
  • Amazon.com’s Dynamo Use Cases- Best seller lists- Customer preferences- Sales rank- Product catalog- Session managementServices that only need primary keyaccess to data store:No need for:- Complex SQL queries- Operations spanning multiple data items• Shopping cart service must always allowcustomers to add and remove items• If there are 2 conflicting versions of a write,the application should be able to pull bothwrites and merge them• Designed for apps that “need tight controlover the tradeoffs between availability,consistency, cost-effectiveness andperformance”
  • Google’s BigTable Use Cases- Gmail- YouTube- Google Earth- Google Finance- Google Analytics- Personalized Search60 products at Google onceused BigTable:• Must be able to store the entire web crawldata• Rely on GFS for replication and dataavailability• Strong integration with MapReduce
  • 12345678Client- Gossip runs ever second on a nodeto 3 other nodes- Used to discover location and stateinformation about other nodes- Phi Accrual Failure Detector used todetect failures via a suspicion levelon a continuous scale
  • NameNode JobTracker HBase MasterZooKeeperStandby NNDNTTMMR RSDNTTMMR RSDNTTMMR RSDNTTMMR RSMMMMMMRR RROS2 TB eachSATARAIDOSOSOSJTNN HMHBase Master StandbyMaster MachinesSlave MachinesClient
  • Effort to deploy- One monolithic database install (1 JVM per node) + 1 log fileand 1 config file (YAML)- No single points of failure, so no standby master nodes- Good default settings- More complex to deploy (multiple JVMs per node) +many log files and many config files (XMLs)- More moving parts: HDFS, HBase, MapReduce,Passive NameNode, Standby HBase Master,ZooKeeper- Default settings usually need tweaking#Cassandra13
  • Where to write?ClientZooKeeper. . .. . .x xxSynchronousReplication via HDFS-ROOT-. . . . . ..META.1 .META.2 .META.3go to:go to: go to: go to:META 1, 2 or 3RS a,b,c RS a,b,c RS a,b,cNo control over replication orconsistency for each write!-ROOT-Region Location?.META.Root & MetaLocations cachedR.META. table?M
  • 12345678ClientcoordinatorR2R1R3Replication F. = 3Consistency = 1Where to write?
  • 12345678ClientcoordinatorR2R1R3Replication F. = 3Consistency = 2Where to write?
  • 12345678ClientcoordinatorR4R2Replication F. = 4Consistency = 2R1R3Where to write?
  • Strong Consistency Costs- Write to 3 nodes (RF = 3, C = 2)- Read from at least 2 nodes to guarantee strong consistency- Write to 3 nodes (RF=3, C=3)- Read from only 1 node to guarantee strong consistency#Cassandra13
  • Log Structured Merge TreesC* HBase(Table, Row Key, Col. Family, Column, Timestamp) → Value (Z)NodeJVMWALMemstoreZZ A B C DHFileFlushCommitLogMemtableSStable
  • Log Structured Merge TreesC* HBaseNodeJVMWALMemstoreZZ A B C DFlushCommitLogMemtableSStable, ValueZ A B C DFlushHFileSStableHFile
  • FlushSStableZ A B C DHFileFlush DetailsBloom FilterBlock IndexR only R + CZABCD- In HBase BF and BI are stored in the Hfile- In C*, there are separate data, BF and Index Files.
  • Flush per Column Family- Supported- Flushes all Column Families together- Unnecessary flushing puts more network pressure onHBase since Hfiles have to be replicated to 2 otherHDFS nodes- Flush per CF is under development via JIRA 3149#Cassandra13
  • Secondary Indexes- Native support for Secondary Indexes- No native Secondary Indexes- But a trigger can be launched after a put tokeep a secondary index (another CF) up todate and not put the burden on the client#Cassandra13
  • SSD Support- It is possible to place just the SStables on SSDIn YAML file, set commitlog_directory to spinning disks andset data_file_directories to SSD- See Rick Branson’s talk:youtube.com/watch?v=zQdDi9pdf3I- Not possible to tell HDFS to only store WAL or HFiles on SSD- There is some support in MapR and Intel distributions for this- Apache HDFS JIRAs 2832 & 4672 have preliminary discussions#Cassandra13
  • Compactions- Tiered and Leveled- For leveled, see J. Ellis’s blog post:- Only Tiered- Note, many new algorithms and improvements coming inHBase 0.95 like Stripe Compactions (JIRA 7667)datastax.com/dev/blog/leveled-compaction-in-apache-cassandra#Cassandra13https://issues.apache.org/jira/secure/attachment/12575449/Stripe%20compactions.pdf
  • Reading after disk failure- Reads can just be fulfilled from another node natively- After a disk failure, the slave machine will readmissing data from a remote disk until compactionhappens. So, region reads can be slow.
  • Data Partitioning- Supports ordered partitioner and random partitioner- Only supports ordered partitioner- Row key range scans possible- It is possible to externally md5 hash the row key andadd the hash to the row key: md5-rowkey#Cassandra13
  • Triggers / Coprocessors- Under development for C* 2.0, JIRA 1311- Supported by Coprocessors (so after a get/put/delon a column family, a trigger can be executed.- Triggers are coded as java classes#Cassandra13
  • Compare & Set- Under development for C* 2.0- Supported#Cassandra13
  • Multi-Datacenter/DR Support- Very mature and well tested- Synchronous or Asynchronous replication to DR- Recovery Point Objective (RPO) can be 0- Not as robust- Only Asynchronous replication to DR- Recovery Point Objective (RPO) cannot be 0#Cassandra13
  • blueplastic.com/c.pdfSameer Farooquisameer@blueplastic.com- Freelance Big Data consultant and trainer- Taught 50+ courses on Hadoop, HBase, Cassandra and OpenStack- Datastax authorized training partnerEx: Hortonworks, Accenture R&D, Symanteclinkedin.com/in/blueplastic/@blueplastichttp://youtu.be/ziqx2hJY8Hg#Cassandra13