SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Apache Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.
This presentation, given at FOSDEM in 2010, provides a brief summary of cassandra's history, a high-level overview of the architecture and data model, and showcases some real life use-cases.
Apache Cassandra is a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model.
This presentation, given at FOSDEM in 2010, provides a brief summary of cassandra's history, a high-level overview of the architecture and data model, and showcases some real life use-cases.
1.
The Cassandra Distributed Database
Eric Evans
eevans@rackspace.com
@jericevans
FOSDEM
February 7, 2010
2.
A prophetess in Troy during the Trojan War. Her predictions were
always true, but never believed.
3.
A massively scalable, decentralized, structured data store (aka
database).
4.
Outline
1 Project History
2 Description
3 Case Studies
4 Roadmap
5.
• 7 new committers added
• Dozens of contributors
• 100+ people on IRC
• Hundreds of closed issues (bugs, features, etc)
• 3 major releases, 2 point releases
• Graduation to TLP?
6.
Outline
1 Project History
2 Description
3 Case Studies
4 Roadmap
11.
Querying
• get(): retrieve by column name
• multiget(): by column name for a set of keys
• get slice(): by column name, or a range of names
• returning columns
• returning super columns
• multiget slice(): a subset of columns for a set of keys
• get count: number of columns or sub-columns
• get range slice(): subset of columns for a range of keys
18.
About writes...
• No reads
• No seeks
• Sequential disk access
• Atomic within a column family
• Fast
• Any node
• Always writeable (hinted hand-off)
20.
About reads...
• Any node
• Read repair
• Usual caching conventions apply
21.
Outline
1 Project History
2 Description
3 Case Studies
4 Roadmap
22.
Case 1: Digg
Digg is a social news site that allows people to discover and share
content from anywhere on the Internet by submitting stories and
links, and voting and commenting on submitted stories and links.
Ranked 98th by Alexa.com.
24.
Problem
• Terabytes of data; high transaction rate (reads dominated)
• Multiple clusters; heavily sharded
• Management nightmare (high effort, error prone)
• Unsatisfied availability requirements (geographic isolation)
25.
Solution
• Currently production on ”Green Badges”
• Cassandra as primary data store RSN
• Datacenter and rack-aware replication
26.
Case 2: Twitter
Twitter is a social networking and microblogging service that
enables its users to send and read tweets, text-based posts of up to
140 characters.
Ranked 12th by Alexa.com.
28.
MySQL
• Terabytes of data, ˜1,000,000 ops/s
• Calls for heavy sharding, light replication
• Schema changes are very difficult, (if possible at all)
• Manual sharding is very high effort
• Automated sharding and replication is Hard
29.
Case 3: Facebook
Facebook is a social networking site where users can create a
profile, add friends, and send them messages. Users can also join
groups organized by location or other points of common interest.
Ranked #2 by Alexa.com.
30.
Inbox Search
• 100 TB
• 160 nodes
• 1/2 billion writes per day (2yr old number?)
31.
Case 4: Mahalo
Mahalo.com is a web directory and knowledge exchange. It
differentiates itself by tracking and building hand-crafted result
sets for many of the popular search terms.
(it also means ”thank you” in Hawaiian)
32.
MySQL
• Partial deployment; 16 million video records (and growing)
• Writes (and storage) rapidly exceeding single box limitations
• Managability suffering (clustering is painful)
• Concerns over availability
33.
Outline
1 Project History
2 Description
3 Case Studies
4 Roadmap