• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data  Presentations  Cassandra Sigmod
 

Data Presentations Cassandra Sigmod

on

  • 90,595 views

SIGMOD presentation by the Cassandra team

SIGMOD presentation by the Cassandra team

Statistics

Views

Total Views
90,595
Views on SlideShare
54,924
Embed Views
35,671

Actions

Likes
91
Downloads
1,792
Comments
6

106 Embeds 35,671

http://jeffhammerbacher.com 16516
http://incubator.apache.org 11586
http://gigaom.com 2117
http://www.dataprix.com 1891
http://www.insidefacebook.com 708
http://www.slideshare.net 341
http://d.hatena.ne.jp 333
http://www.ossblog.it 293
http://socialnetwork.mmdays.com 200
http://www.thebuzzmedia.com 175
http://emsooriyabandara.blogspot.com 162
http://moskalyuk.name 157
http://feeds.feedburner.com 147
http://www.searchenginecaffe.com 132
http://jmonne.blogspot.com 122
http://jeh.tumblr.com 102
http://www.dataprix.net 99
http://lj-toys.com 79
http://mixellaneous.tistory.com 59
http://mkseo.pe.kr 36
http://webcache.googleusercontent.com 33
http://www.hadoopchina.com 30
http://confluence.corp.apple.com 29
http://translate.googleusercontent.com 24
http://mysandbox.tvetph.net 22
http://www.breakitdownblog.com 21
http://swaroopch.com 16
http://cassandra.deadcafe.org 16
https://coral.corp.apple.com 15
http://static.slidesharecdn.com 14
http://www.emsooriyabandara.blogspot.com 14
http://coral.corp.apple.com 12
http://cassandra-test.unempty.com 9
http://coral.corp.apple.com:9080 7
http://static.slideshare.net 7
http://theoldreader.com 7
http://emsooriyabandara.blogspot.in 6
http://emsooriyabandara.blogspot.com.es 6
http://www.scoop.it 5
http://anotherbug.blog.chinajavaworld.com 5
http://localhost:4000 5
http://www.linkedin.com 5
http://www.hanrss.com 5
http://jujo00obo2o234ungd3t8qjfcjrs3o6k-a-sites-opensocial.googleusercontent.com 4
http://emsooriyabandara.blogspot.fr 4
http://www.fatloss247.com 4
http://emsooriyabandara.blogspot.ca 3
http://thesmartphonehub.com 3
http://blog.naver.com 3
http://emsooriyabandara.blogspot.mx 3
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

16 of 6 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Demo :
    http://GameAZ.us is a directory of free flash games - A web build on Cassandra Database
    Are you sure you want to
    Your message goes here
    Processing…
  • It's amazing how you explained with graphing.

    Mark Chang, www.free-ringtones.co.in/ www.free-ringtones-for-sprint.com/
    Are you sure you want to
    Your message goes here
    Processing…
  • Wow its really very interesting. Thank you

    Regards
    Teisha
    http://winkhealth.com
    http://financewink.com
    http://www.fakhriramley.com
    Are you sure you want to
    Your message goes here
    Processing…
  • I think it is better you put some picture in there. looks more interesting and inviting

    Regards
    Teisha
    http://winkhealth.com
    http://financewink.com
    http://www.fakhriramley.com
    Are you sure you want to
    Your message goes here
    Processing…

  • Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data  Presentations  Cassandra Sigmod Data Presentations Cassandra Sigmod Presentation Transcript

    • Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik, Karthik Ranganathan
    • Why Cassandra?
      • Lots of data
        • Copies of messages, reverse indices of messages, per user data.
      • Many incoming requests resulting in a lot of random reads and random writes.
      • No existing production ready solutions in the market meet these requirements.
    • Design Goals
      • High availability
      • Eventual consistency
        • trade-off strong consistency in favor of high availability
      • Incremental scalability
      • Optimistic Replication
      • “ Knobs” to tune tradeoffs between consistency, durability and latency
      • Low total cost of ownership
      • Minimal administration
    • Cassandra Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools
    • Data Model KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : <Binary> TimeStamp : t1 Name : tid2 Value : <Binary> TimeStamp : t2 Name : tid3 Value : <Binary> TimeStamp : t3 Name : tid4 Value : <Binary> TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 <Column List> Name : hint2 <Column List> Name : hint3 <Column List> Name : hint4 <Column List> Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront Columns are added and modified dynamically SuperColumns are added and modified dynamically Columns are added and modified dynamically C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4
    • Write Operations
      • A client issues a write request to a random node in the Cassandra cluster.
      • The “Partitioner” determines the nodes responsible for the data.
      • Locally, write operations are logged and then applied to an in-memory version.
      • Commit log is stored on a dedicated disk local to the machine.
    • Write cont’d Key (CF1 , CF2 , CF3) Commit Log Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH
      • Data size
      • Number of Objects
      • Lifetime
      Dedicated Disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> --- --- --- --- <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> BLOCK Index <Key Name> Offset, <Key Name> Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk
    • Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D
    • Write Properties
      • No locks in the critical path
      • Sequential disk access
      • Behaves like a write through Cache
      • Append support without read ahead
      • Atomicity guarantee for a key
      • “ Always Writable”
        • accept writes during failure scenarios
    • Read Query Closest replica Cassandra Cluster Replica A Result Replica B Replica C Digest Query Digest Response Digest Response Result Client Read repair if digests differ
    • Partitioning N=3 h(key2) And Replication 0 1 1/2 F E D C B A h(key1)
    • Cluster Membership and Failure Detection
      • Gossip protocol is used for cluster membership.
      • Super lightweight with mathematically provable properties.
      • State disseminated in O(logN) rounds where N is the number of nodes in the cluster.
      • Every T seconds each member increments its heartbeat counter and selects one other member to send its list to.
      • A member merges the list with its own list .
    • Accrual Failure Detector
      • Valuable for system management, replication, load balancing etc.
      • Defined as a failure detector that outputs a value, PHI, associated with each process.
      • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions.
      • The value output, PHI, represents a suspicion level.
      • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions.
      • In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5.
    • Properties of the Failure Detector
      • If a process p is faulty, the suspicion level
      • Φ (t)  ∞ as t  ∞.
      • If a process p is faulty, there is a time after which Φ (t) is monotonic increasing.
      • A process p is correct  Φ (t) has an ub over an infinite execution.
      • If process p is correct, then for any time T,
      • Φ (t) = 0 for t >= T.
    • Implementation
      • PHI estimation is done in three phases
        • Inter arrival times for each member are stored in a sampling window.
        • Estimate the distribution of the above inter arrival times.
        • Gossip follows an exponential distribution.
        • The value of PHI is now computed as follows:
          • Φ (t) = -log 10 ( P(t now – t last ) )
              • where P(t) is the CDF of an exponential distribution. P(t) denotes the probability that a heartbeat will arrive more than t units after the previous one. P(t) = ( 1 – e -t λ )
      • The overall mechanism is described in the figure below.
    • Information Flow in the Implementation
    • Performance Benchmark
      • Random and sequential writes - limited by network bandwidth.
      • Read performance for Inbox Search in production:
      Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
    • Lessons Learnt
      • Add fancy features only when absolutely required.
      • Many types of failures are possible.
      • Big systems need proper systems-level monitoring.
      • Value simple designs
    • Future work
      • Atomicity guarantees across multiple keys
      • Distributed transactions
      • Compression support
      • Granular security via ACL’s
    • Questions?