The Apache Cassandra   storage engine          Sylvain Lebresne    (sylvain@             .com)      FOSDEM ’12, Brussels
1. What is Apache Cassandra2. Data Model3. The storage engine
1. What is Apache Cassandra2. Data Model3. The storage engine
about:project• Distributed data store aimed at big data• Apache project since 2010.• Version 1.0 released last October.• P...
Apache Cassandra
Apache CassandraA database:
Apache CassandraA database:• distributed / decentralized
Apache CassandraA database:• distributed / decentralized• replicated & durable
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SP...
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SP...
Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SP...
1. What is Apache Cassandra2. Data Model3. The storage engine
Data Model• Not SQL (no transaction, nor joins) but  more than Key/Value.• Inspired by Google BigTable• Column families ba...
Ex: user profiles        “For each user, holds profile infos”                   50e8-e29b                  birth_year   1994...
Ex: user profiles        “For each user, holds profile infos”          50e8-e29b             2ab1-f1b7         birth_year   ...
Ex: user’s Tweets           “For each user, tweets he has made”                        50e8-e29bTimeline
Ex: user’s Tweets           “For each user, tweets he has made”                          50e8-e29b                        ...
Ex: user’s Tweets           “For each user, tweets he has made”                          50e8-e29b                        ...
Ex: user’s Tweets           “For each user, tweets he has made”                          50e8-e29b                        ...
Ex: user’s Tweets           “For each user, tweets he has made”                          50e8-e29b                        ...
There’s more• Secondary indexes• Distributed counters• Composite columns
1. What is Apache Cassandra2. Data Model3. The storage engine
Goal• Writes are harder than reads to scale• Spinning disks aren’t good with random I/O• Goal: minimize random I/O
A write’s journal write( k1 , c1:v1 )                                                Memory                               ...
A write’s journal write( k1 , c1:v1 )                                                    Memory                           ...
A write’s journalack                                 Memory                k1 c1:v1k1 c1:v1                               ...
A write’s journalwrite( k1 , c2:v2 )                                              Memory                           k1 c1:v...
A write’s journalwrite(    k2   ,   c1:v1 c2:v2   )                                                        Memory         ...
A write’s journalwrite(    k1   ,   c1:v4 c3:v3   )                                                              Memory   ...
A write’s journal                                              Memory          flush                 indexcleanup    k1 c1:...
A write’s journalmore updates                                                             Memory                          ...
A write’s journal                                              Memory                        flush       index             ...
Writes properties• No reads or seeks• Only sequential I/O• Immutable SSTables: easy snapshots
A read’s journalread( k1 )                                                       Memory    ?                   index      ...
A read’s journalk1 c1:v5 c2:v2 c3:v3 c4:v4                                                                 Memorymerge    ...
Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O
Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O          index    ...
Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O          index    ...
Optimizations• Row Cache• Bloom filters: eliminates whole SSTable• Key Cache• Rows & Columns Indexes• ...
Other features• Compression• Checksums• Time to live
Questions?
• Cassandra 1.1 scheduled for next month• http://cassandra.apache.org/• http://wiki.apache.org/cassandra/• http://www.data...
Data Model                     Keyspace name                 Column Family name                           Row key         ...
Leveled CompactionL0L1L2L3
Leveled CompactionL0L1L2L3
Leveled CompactionL0L1L2L3
Leveled CompactionL0L1L2L3
Leveled CompactionL0L1L2L3
Leveled CompactionL0L1L2L3
Upcoming SlideShare
Loading in …5
×

Fosdem 2012

1,757 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,757
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
15
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Fosdem 2012

    1. 1. The Apache Cassandra storage engine Sylvain Lebresne (sylvain@ .com) FOSDEM ’12, Brussels
    2. 2. 1. What is Apache Cassandra2. Data Model3. The storage engine
    3. 3. 1. What is Apache Cassandra2. Data Model3. The storage engine
    4. 4. about:project• Distributed data store aimed at big data• Apache project since 2010.• Version 1.0 released last October.• Proven in production (Netflix, Twitter, Reddit, Cisco, ...). Largest know cluster has over 300TB in over 400 machines.
    5. 5. Apache Cassandra
    6. 6. Apache CassandraA database:
    7. 7. Apache CassandraA database:• distributed / decentralized
    8. 8. Apache CassandraA database:• distributed / decentralized• replicated & durable
    9. 9. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
    10. 10. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic
    11. 11. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF
    12. 12. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available
    13. 13. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available
    14. 14. Apache CassandraA database:• distributed / decentralized• replicated & durable• scalable / elastic• fault-tolerant / no SPOF• highly available• data center aware US Europe
    15. 15. 1. What is Apache Cassandra2. Data Model3. The storage engine
    16. 16. Data Model• Not SQL (no transaction, nor joins) but more than Key/Value.• Inspired by Google BigTable• Column families based.
    17. 17. Ex: user profiles “For each user, holds profile infos” 50e8-e29b birth_year 1994 fname Justin lname BieberUsers
    18. 18. Ex: user profiles “For each user, holds profile infos” 50e8-e29b 2ab1-f1b7 birth_year 1994 birth_year 1978 fname Justin email a@kutcher.com lname Bieber fname Ashton lname KutcherUsers
    19. 19. Ex: user’s Tweets “For each user, tweets he has made” 50e8-e29bTimeline
    20. 20. Ex: user’s Tweets “For each user, tweets he has made” 50e8-e29b @LiveLoveKary glad you had 0 a good birthday #muchloveTimeline
    21. 21. Ex: user’s Tweets “For each user, tweets he has made” 50e8-e29b @NickDeMoura happy bday 1 my dude. @LiveLoveKary glad you had 0 a good birthday #muchloveTimeline
    22. 22. Ex: user’s Tweets “For each user, tweets he has made” 50e8-e29b @MickyArison @miamiHEAT 2 thanks for the gam tonight @NickDeMoura happy bday 1 my dude. @LiveLoveKary glad you had 0 a good birthday #muchloveTimeline
    23. 23. Ex: user’s Tweets “For each user, tweets he has made” 50e8-e29b still a little tired. back in the 3 studio today with Timbaland @MickyArison @miamiHEAT 2 thanks for the gam tonight @NickDeMoura happy bday 1 my dude. @LiveLoveKary glad you had 0 a good birthday #muchloveTimeline
    24. 24. There’s more• Secondary indexes• Distributed counters• Composite columns
    25. 25. 1. What is Apache Cassandra2. Data Model3. The storage engine
    26. 26. Goal• Writes are harder than reads to scale• Spinning disks aren’t good with random I/O• Goal: minimize random I/O
    27. 27. A write’s journal write( k1 , c1:v1 ) Memory MemtableCommit log Hard drive
    28. 28. A write’s journal write( k1 , c1:v1 ) Memory k1 c1:v1 Memtable k1 c1:v1Commit log Hard drive
    29. 29. A write’s journalack Memory k1 c1:v1k1 c1:v1 Hard drive
    30. 30. A write’s journalwrite( k1 , c2:v2 ) Memory k1 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 Hard drive
    31. 31. A write’s journalwrite( k2 , c1:v1 c2:v2 ) Memory k1 c1:v1 c2:v2 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2 Hard drive
    32. 32. A write’s journalwrite( k1 , c1:v4 c3:v3 ) Memory k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 k1 c1:v1 k1 c2:v2 k2 c1:v1 c2:v2k1 c1:v4 c3:v3 Hard drive
    33. 33. A write’s journal Memory flush indexcleanup k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 SSTable Hard drive
    34. 34. A write’s journalmore updates Memory k1 c1:v5 c4:v4 k2 c1:v2 c3:v3 k2 c1:v2 c3:v3 k1 c1:v5 c4:v4 index k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 Hard drive
    35. 35. A write’s journal Memory flush index index k1 c1:v4 c2:v2 c3:v3 k1 c1:v5 c4:v4 k2 c1:v1 c2:v2 k2 c1:v2 c3:v3 Hard drive
    36. 36. Writes properties• No reads or seeks• Only sequential I/O• Immutable SSTables: easy snapshots
    37. 37. A read’s journalread( k1 ) Memory ? index index k1 c1:v4 c2:v2 c3:v3 k1 c1:v5 c4:v4 k2 c1:v1 c2:v2 k2 c1:v2 c3:v3 Hard drive
    38. 38. A read’s journalk1 c1:v5 c2:v2 c3:v3 c4:v4 Memorymerge index index k1 c1:v4 c2:v2 c3:v3 k1 c1:v5 c4:v4 k2 c1:v1 c2:v2 k2 c1:v2 c3:v3 Hard drive
    39. 39. Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O
    40. 40. Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O index k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 index k1 c1:v5 c4:v4 k2 c1:v2 c3:v3
    41. 41. Compaction• Goal: keep the number of SSTables low• Merge sort against multiple sstables• Sequential I/O index k1 c1:v4 c2:v2 c3:v3 k2 c1:v1 c2:v2 index k1 c1:v5 c2:v2 c3:v3 c4:v4 index k2 c1:v2 c2:v2 c3:v3 k1 c1:v5 c4:v4 k2 c1:v2 c3:v3
    42. 42. Optimizations• Row Cache• Bloom filters: eliminates whole SSTable• Key Cache• Rows & Columns Indexes• ...
    43. 43. Other features• Compression• Checksums• Time to live
    44. 44. Questions?
    45. 45. • Cassandra 1.1 scheduled for next month• http://cassandra.apache.org/• http://wiki.apache.org/cassandra/• http://www.datastax.com/docs/1.0
    46. 46. Data Model Keyspace name Column Family name Row key Column name Value Columns (upto 2B) Rows (∞) Column Families (10’s ➝ 100’s)Keyspaces (1 per app)
    47. 47. Leveled CompactionL0L1L2L3
    48. 48. Leveled CompactionL0L1L2L3
    49. 49. Leveled CompactionL0L1L2L3
    50. 50. Leveled CompactionL0L1L2L3
    51. 51. Leveled CompactionL0L1L2L3
    52. 52. Leveled CompactionL0L1L2L3

    ×