• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Cassandra talk @JUG Lausanne, 2012.06.14
 

Cassandra talk @JUG Lausanne, 2012.06.14

on

  • 1,535 views

 

Statistics

Views

Total Views
1,535
Views on SlideShare
1,535
Embed Views
0

Actions

Likes
2
Downloads
13
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cassandra talk @JUG Lausanne, 2012.06.14 Cassandra talk @JUG Lausanne, 2012.06.14 Presentation Transcript

    • Apache Cassandrahttp://cassandra.apache.org Benoit Perroud Software Engineer @Verisign & Apache Committer JUG Lausanne, 14.06.2012
    • Agenda• NoSQL Quick Overview• Apache Cassandra Fundamentals – Design principles – Data & Query Model• Real Life Uses Cases – Doodle clone – Heavy Write Load – Bulk Loading (write once data)• Client side implementation• Q&A 2
    • NoSQL• [Wikipedia] NoSQL is a term used to designate database management systems that differ from classic relational database management systems (RDBMS) in some way. These data stores may not require fixed table schemas, usually avoid join operations, do not attempt to provide ACID properties and typically scale horizontally.• Pioneers : Google BigTable, Amazon Dynamo, etc. 3
    • Scalability• [Wikipedia] Scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner or to be readily enlarged.• Scalability in two dimensions : – Scale up → scale vertically (increase RAM in an existing node) – Scale out → scale horizontally (add a node to the cluster)• In summary : handle load and peaks. 4
    • Availability• [Wikipedia] Availability refers to the ability of the users to access and use the system. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable.• In summary : minimize downtime. 5
    • CAP Theorem• Consistency : all nodes see the same data at the same time• Availability : node failures do not prevent survivors from continuing to operate• Partition Tolerance : the system continues to operate despite arbitrary message loss• According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but not all three. 6
    • NoSQL Promises• Scale horizontally – Double computational power or storage by doubling size of the cluster (tight provisioning) – Adding nodes to the cluster in constant time• High availability – No / few / under control SPoF• On commodity hardware• Let see how Cassandra achieves all of these 7
    • Apache Cassandra• Apache Cassandra is could be simplified as a scalable, distributed, sparse and eventually consistent hash map. But its actually way more.• Originally developed by Facebook, hit AFS incubator early 2008, version 1.0 in 2010• Inspired from Amazon Dynamo and Google BigTable• Version at time of speaking 1.0.10, 1.1.1• Under high development by several startups : Datastax, Acunu, Netflix, Twitter, Rackspace, … 8
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• Gossip protocol (spreading states like a rumor)• Consistent hashing – Node responsible for key range and replica sets• No single point of failure• Key space is 2^128 bits 100% keyspace 0 87 12 ? ? Explicitely set your node’s token ! 75 ? Take half of key range of most loaded node 25 ? 62 37 ? ? 50 Take half of key range 9 of most loaded node
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• Schemaless – A schema (metadata) may be determined for convenience – Column names are stored for every rows• [Wikipedia] Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. 10
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• [Wikipedia] A quorum is the minimum number of votes that a distributed transaction has to obtain in order to be allowed to perform an operation in a distributed system. A quorum-based technique is implemented to enforce consistent operation in a distributed system.• Quorum : R + W > N – N : number of replica, R : number of node read, W : number of node written. – R = 1, W = N – R = N, W = 1 – R = N/2, W = N/2 (+1 if N is even) 11
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• Key space [0,99], previously put(22, 1)• Replication factor 2• Consistency : ONE coordinator 0 Put (22, 2) 80 20 Async put(22,2) 60 40 owner replica 12
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• Key space [0,49], previously put(13, 1)• Replication factor 3• Consistency : QUORUM (R = 2, W = 2) 0 Read(13) = 2, t2 Put (13, 2, t2) Put (13, 2, t2) 80 20 Read(13) = 1, t1 Read repair 60 40 13
    • Apache Cassandra is a scalable distributed, sparse, eventually consistent hash map• Can be seen as a multilevel hash map : Hash of Hash (of Hash) Map – 2 (to 3) levels of keys. • Lets focus on 2, the 3rd level (SuperColumn) usage is no longer recomanded• Keyspace > column family > row > column name = value – # use Keyspace1; – # set ColumnFamily1[key1][columName1] = value1; – # get ColumnFamily1[key1][columName1]; 14
    • Data Model : Keyspace• Equivalent to database name in SQL world• Define replication factor and network topology – Network topology include multi datacenters topology – Replication factor can be defined per datacenters 15
    • Data Model : Column Family• Equivalent to table name in SQL world – Term may change in upcoming releases to stop confusing users• Define – Type of the keys – Column name comparator – Additional metadata (types of certain known columns) 16
    • Data Model : Row• Defined by the key. – Eventually stored to a node and its replicas• Keys are be typed• 2 strategies of key partitioner on the key space – Random partitioner : • md5(key), evenly distribute keys on nodes – Byte Ordered partitioner : • Keep order while iterating through the keys, may lead to hot spots 17
    • Data Model : Column Name • Could be seen as column in SQL world • Not mandatory to be declared – If declared, their corresponding values have types – Or secondary index • Ordered (!) • Column Names are often used as values (!) Column names Event1ColumnFamily 24.04.2012 07:00 08:00 239 255 Row key Values 18
    • Data Model : Value• Can be typed, seen as array of bytes otherwise• Existing types include – Bytes – Strings (ASCII or UTF-8 strings) – Integer, Long, Float, Double, Decimal – UUID, dates – Counters (of long)• Can expire• No foreign keys (!) 19
    • Query Model• 2 interfaces to interact with Cassandra – Native API • Thrift, CLI • Higher level third-party libraries – Hector – Pycassa – Phpyandra – Astyanax – Helenus – CQL (Cassandra Query Language) 20
    • Query Model• Cassandra is more than a key – value store. – Get – Put – Delete – Update – But also various range queries • Key range • Column range (slice) – Secondary indexes 21
    • Query Model : Get • Get single key – Give me key ‘a’ • Get multiple keys – Give me keys ‘a’, ‘c’, ‘d’ and ‘f’ Ordered regarding column name comparator ‘1’ ‘2’ ‘3’ ‘4’ ‘5’Ordered regarding partitionner. ‘a’ 8 9 10 11 ‘b’ 12 13 14ByteOrdred here. ‘c’ 15 16 17 ‘d’ 18 ‘e’ 19 20 20 ‘f’ 22 23 24 25 26 22
    • Query Model : Get Range• Range – Query for a range of key • Give me all keys between ‘a’ and ‘c’. • Mind the partitioner. ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘a’ 8 9 10 11 ‘b’ 12 13 14 ‘c’ 15 16 17 ‘d’ 18 ‘e’ 19 20 20 ‘f’ 22 23 24 25 26 23
    • Query Model : Get Slice• Slice – Query for a slice of columns • For key ‘a’, give me all columns between ‘3’ and ‘5’ • For key ‘f’, give me all columns between ‘3’ and ‘5’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘a’ 8 9 10 11 ‘b’ 12 13 14 ‘c’ 15 16 17 ‘d’ 18 ‘e’ 19 20 20 ‘f’ 22 23 24 25 26 24
    • Query Model : Get Range Slice• Range and Slice can be combined : rangeSliceQuery – For keys between ‘b’ and ‘d’, give me columns between ‘2’ and ‘4’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘a’ 8 9 10 11 ‘b’ 12 13 14 ‘c’ 15 16 17 ‘d’ 18 ‘e’ 19 20 20 ‘f’ 22 23 24 25 26 25
    • Query Model : Secondary Index• Secondary Index – Give me all rows where value for column ‘2’ is ‘12’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘a’ 8 9 10 11 ‘b’ 12 13 14 ‘c’ 15 16 17 ‘d’ 18 ‘e’ 19 20 20 ‘f’ 22 23 24 25 26 26
    • cassandra-cli and nodetool• ./bin/cassandra-cli –p 9160 –h localhost• ./bin/nodetool –p7199 –h localhost Quick demo ! 27
    • Write path1. Write to commit log Memory2. Update MemTable CF1 MemTable CF2 MemTable CFn MemTable …3. Write is acked to client4. If MemTable reach threshold, Disks CF1 CFn Commit log flush to disk as SSTable Bloom filter … SSTable Index Data … SSTable SSTable 28
    • Read path• Versions of the same column Memory can be spread at the same time CF1 CF2 CFn MemTable MemTable MemTable … – In the MemTable – In the memtable being flushed Disks – In one or multiple SSTable Commit log CF1 CFn …• All version read, and resolved / Bloom filter SSTable Index merged using timestamp Data … – Bloom filters allow to skip reading SSTable SSTable unnecessary files – SSTables are indexed – Compaction keep things reasonnable 29
    • Compaction• Runs regularly as a background operation• Merge SSTables together• Remove expired and deleted values• Has impact on general I/O availability (and thus performance) – This is where most of tuning happens – Can be throttled• Two type of compaction – Size-tiered • Few I/O consumption  write-heavy workload – Leveled • Guarantee to read from fewer SSTables  read-heavy workload• See http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra for complete details. 30
    • Other Advanced Features• Super Columns (no more recommended)• Composite column names• Integration with Hadoop• Bulk Loading• Compression• Multi tenancy 31
    • Real Life Use Case : Doodle Clone• Live demo http://doodle.noisette.ch Naïve data model Polls { id, label, [options], email, limit } Subscribers (super) { polls.id { id, label, [options] } }• Id generation – TimeUUID is your friend• Avoid super column familes – Use composite, or serialized/encoded subscribers• Subscriber.label uniqueness per poll ? – Cassandra anti-pattern (read-after-write)• Limit to n subscribers per option ? – Cassandra anti-pattern (read-after-write) 32
    • Real Life Use Case : Heavy Writes• Cassandra is a really good fit when the ratio read / write is close to 0 – Event logging / redo logs – Time series• It’s a best practice to write data in its raw format AND in aggregated forms at the same time• But need compation tuning – {min,max}_compaction_threshold – memtable_flush_writers – … no magic solution here, only pragmatic approach • change configuration in one node, and mesure the difference (load, latency, …) 33
    • Real Life Use Case : Counters• Cassandra >= 0.8 (CASSANDRA-1072) create column family counterCF with default_validation_class=CounterColumnType AND key_validation_class=UTF8Type AND comparator=UTF8Type; INC counterCF[‘key’][‘columnName’] BY 1;• Example counterCF[‘entity1’][2012-06-14 18:30:00] counterCF[‘entity1’][2012-06-14 18:30:05] Query per entity counterCF[‘entity1’][2012-06-14 18:30:10] number of hits for ‘entity1’ … between 18:30:00 and 19:00:00 counterCF[‘entity2’][2012-06-14 18:30:05] counterCF[2012-06-14 18:30:00][‘entity1’] counterCF[2012-06-14 18:30:00][‘entity2’] Query per date range counterCF[2012-06-14 18:30:00][‘entity3’] all entities being hit between … 18:30:00 and 19:00:00 counterCF[2012-06-14 18:30:05][‘entity1’] ! need complete date enumeration 34
    • Real Life Use Case : Bulk Loading• Data is transformed (e.g. using MapReduce)• Then bulk loaded into the cluster – ColumFamilyOutputFormat (Cassanda 1.0) • Not real bulk loading – BulkOutputFormat (Cassandra 1.1) • SSTable generated during the tranformation, and streamed• Prefer Leveled Compaction Strategy – Reduce read latency – Size sstable_size_in_mb to your data 35
    • Conclusion• Cassandra is not a general purpose solution• But Cassandra is doing a really good job if used accordingly – Really good scalability – Low operational cost – Advanced data and query model 36
    • Thanks for your attention• Questions?• Next Swiss BigData User Group : July 16. in Zurich – More information to come, @SwissScale 37