Svccg nosql 2011_sri-cassandra

3,371 views

Published on

silicon valley cloud computing group, netflix, cassandra talk

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,371
On SlideShare
0
From Embeds
0
Number of Embeds
812
Actions
Shares
0
Downloads
37
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Typical write operation involves a write into a commit log for durability and recoverability and an update into an in-memory data structure. The write into the in-memory data structure is performed only after a successful write into the commit log. We have a dedicated disk on each machine for the commit log since all writes into the commit log are sequential and so we can maximize disk throughput. When the in-memory data structure crosses a certain threshold, calculated based on data size and number of objects, it dumps itself to disk. This write is performed on one of many commodity disks that machines are equipped with. All writes are sequential to disk and also generate an index for ecient lookup based on row key. These indices are also persisted along with the data le. Over time many such les could exist on disk and a merge process runs in the background to collate the different les into one le. This process is very similar to the compaction process that happens in the Bigtable system
  • “ A typical read operation rst queries the in-memory data structure before looking into the les on disk. The files are looked at in the order of newest to oldest. When a disk lookup occurs we could be looking up a key in multiple les on disk. In order to prevent lookups into les that do not contain the key, a bloom lter, summarizing the keys in the le, is also stored in each data le and also kept in memory. This bloom lter is rst consulted to check if the key being looked up does indeed exist in the given le. A key in a column family could have many columns. Some special indexing is required to retrieve columns which are further away from the key. In order to prevent scanning of every column on disk we maintain column indices which allow us to jump to the right chunk on disk for column retrieval. As the columns for a given key are being serialized and written out to disk we generate indices at every 256K chunk boundary. This boundary is congurable, but we have found 256K to work well for us in our production workloads.”
  • ConsistencyLevel
  • Svccg nosql 2011_sri-cassandra

    1. 1. HowStuffWorks version Cassandra SriSatish Ambati engineer, DataStax @srisatish
    2. 2. Bigtable, 2006 Dynamo, 2007 OSS, 2008 Incubator, 2009 TLP, 2010
    3. 3. Digital Reasoning: NLP + entity analytics OpenWave: enterprise messaging OpenX: largest publisher-side ad network in the world Cloudkick: performance data & aggregation SimpleGEO: location-as-API Ooyala: video analytics and business intelligence ngmoco: massively multiplayer game worlds Cassandra in production
    4. 4. <ul><li>furiously fast writes </li></ul><ul><ul><li>Append only writes </li></ul></ul><ul><ul><li>Sequential disk access </li></ul></ul><ul><ul><li>No locks in critical path </li></ul></ul><ul><ul><li>Key based atomicity </li></ul></ul><ul><li>client issues </li></ul><ul><li>write </li></ul><ul><li>n1 </li></ul><ul><li>partitioner </li></ul><ul><li>commit log </li></ul><ul><li>apply to memory </li></ul><ul><li>n2 </li></ul><ul><li>find node </li></ul><ul><li>n3 </li></ul>
    5. 5. Tuneable reads
    6. 6. Read Internals @r39132 - #netflixcloud
    7. 7. <ul><li>A feather in the CAP </li></ul><ul><ul><li>Eventual Consistency </li></ul></ul><ul><ul><li>R + W > N </li></ul></ul><ul><ul><ul><li>N is RF </li></ul></ul></ul><ul><ul><ul><li>T is total nodes </li></ul></ul></ul><ul><ul><li>ex: rdbms with backup </li></ul></ul><ul><ul><li>R=1, W=2, N=2, T=2 </li></ul></ul><ul><li>Read Performance </li></ul><ul><ul><li>R=1, 100s of nodes </li></ul></ul><ul><ul><li>R=1, W=N (consistency) </li></ul></ul><ul><li>Write Performance </li></ul><ul><ul><li>W=1, R=N </li></ul></ul><ul><ul><li>Quorum (fast writes!) </li></ul></ul>
    8. 8. Client Marshal Arts <ul><li>Roll your own, C </li></ul><ul><li>Thrift </li></ul><ul><li>pycassa, phpcassa </li></ul><ul><li>Ruby, Scala </li></ul><ul><li>Ready made, Java: Hector, Pelops </li></ul><ul><li>Common Patterns of Doom: </li></ul><ul><ul><li>Death by a million gets </li></ul></ul><ul><ul><li>Turn off Nagle </li></ul></ul><ul><ul><li>Manage your connections </li></ul></ul>
    9. 9. Adding Nodes <ul><li>New nodes </li></ul><ul><ul><li>Add themselves to busiest node </li></ul></ul><ul><ul><li>And then Split its Range </li></ul></ul><ul><li>Busy Node starts transmit to new node </li></ul><ul><li>Bootstrap logic initiated from any node, cli, web </li></ul>
    10. 10. Cassandra on EC2 cloud
    11. 11. Cassandra on EC2 cloud *Corey Hulen, EC2
    12. 12. inter-node comm <ul><li>Gossip Protocol </li></ul><ul><ul><li>It’s exponential </li></ul></ul><ul><ul><li>(epidemic algorithm) </li></ul></ul><ul><li>Failure Detector </li></ul><ul><ul><li>Accrual rate phi </li></ul></ul><ul><li>Anti-Entropy </li></ul><ul><ul><li>Bringing replicas to uptodate </li></ul></ul><ul><li>UDP for control messages </li></ul><ul><li>TCP for request routing </li></ul>
    13. 13. Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT Loaded in memory K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Index File Data File
    14. 14. Compactions K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > -- -- -- Sorted K2 < Serialized data > K10 < Serialized data > K30 < Serialized data > -- -- -- Sorted K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > -- -- -- Sorted MERGE SORT Loaded in memory K1 < Serialized data > K2 < Serialized data > K3 < Serialized data > K4 < Serialized data > K5 < Serialized data > K10 < Serialized data > K30 < Serialized data > Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Index File Data File D E L E T E D
    15. 16. A L T W F P Y Key “C” U Availability in Action
    16. 17. A L T W F P Y Key “C” U X hint Availability in Action
    17. 18. JMX
    18. 19. OpsCenter
    19. 20. OpsCenter
    20. 21. OpsCenter

    ×