Your SlideShare is downloading. ×
0
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Hadoop and Cassandra at Rackspace
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and Cassandra at Rackspace

13,500

Published on

0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
13,500
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
271
Comments
0
Likes
16
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
  • 2. My, what a large dataset you have... <ul><li>Processing 3 TB/day of logs
  • 3. Using Hadoop/Pig
  • 4. And the sticking points? </li><ul><li>“ How fast can we provision machines?”
  • 5. “ How do we get data on/off the cluster?”
  • 6. “ How do we add structure?” </li></ul></ul>
  • 7. MapReduce <ul><li>Distributed processing methodology </li><ul><li>Adapt a problem to MapReduce
  • 8. Scale forever
  • 9. Crunch almost anything </li></ul><li>Typically adding structure to unstructured data </li><ul><li>Logs </li></ul><li>Also great for structured </li><ul><li>Graph processing
  • 10. Machine learning </li></ul></ul>
  • 11. “You want to use how many clients?” <ul><li>Need to store structured inputs/outputs
  • 12. Solution needs to </li><ul><li>Support arbitrary number of clients
  • 13. Preferably provide locality
  • 14. Possibly provide 'web' latency </li></ul></ul>
  • 15. Solutions of varying quality <ul><li>Sharding the RDBMS </li><ul><li>shard n. - A horizontal partition in a database </li><ul><li>Example: Sharding by userid </li></ul><li>Provided by ORM? </li><ul><li>Fixed partitions: manual rebalancing </li></ul><li>Developing from scratch? </li><ul><li>Adding/removing nodes
  • 16. Handling failover
  • 17. As a library? As a middle tier? </li></ul></ul></ul>
  • 18. Solutions of varying quality <ul><li>Leaving data in Hadoop </li><ul><li>Storage in Map/SequenceFile </li><ul><li>Serialized with Thrift/Avro/ProtoBuffs </li></ul><li>No random access
  • 19. High latency </li></ul></ul>
  • 20. Solutions of varying quality <ul><li>Storing in HBase/Hypertable </li><ul><li>Column stores implemented on Hadoop </li><ul><li>Modeled after Google's Bigtable </li></ul><li>Multiple points of failure </li><ul><li>Namenode
  • 21. Master </li></ul><li>High (almost non-web) latency </li></ul></ul>
  • 22. And the newest contender...
  • 23. Standing on the shoulders of: Amazon Dynamo <ul><li>No node in the cluster is special </li><ul><li>No special roles
  • 24. No scaling bottlenecks
  • 25. No single point of failure </li></ul><li>Techniques </li><ul><li>Gossip
  • 26. Eventual consistency </li></ul></ul>
  • 27. Standing on the shoulders of: Google Bigtable <ul><li>“Column family” data model
  • 28. Range queries for rows: </li><ul><li>Scan rows in order </li></ul><li>Memtable/SSTable structure </li><ul><li>Always writes sequentially to disk
  • 29. Bloom filters to minimize random reads
  • 30. Trounces B-Trees for big data </li><ul><li>Linear insert performance
  • 31. Log growth for reads </li></ul></ul></ul>
  • 32. Enter Cassandra <ul><li>Hybrid of ancestors </li><ul><li>Adopts listed features </li></ul><li>And adds: </li><ul><li>A sweet logo!
  • 33. Pluggable partitioning
  • 34. Multi datacenter support </li><ul><li>Pluggable locality awareness </li></ul><li>Datamodel improvements </li></ul></ul>
  • 35. Enter Cassandra <ul><li>Project status </li><ul><li>Open sourced by Facebook in 2008 (no longer active)
  • 36. Apache License
  • 37. Graduated to Apache TLP February 2010
  • 38. Major releases: 0.3 through 0.6 (0.7 in two months) </li></ul><li>cassandra.apache.org </li></ul>
  • 39. Enter Cassandra <ul><li>The code base </li><ul><li>Java, Apache Ant, Git/SVN
  • 40. 5+ committers from 3+ companies </li></ul><li>Known deployments at: </li><ul><li>Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit </li></ul></ul>
  • 41. Performance
  • 42. Like peanut butter with jelly <ul><li>Apache Cassandra 0.6:
  • 43. MapReduce input support out of the box </li><ul><li>Locality information partially exposed
  • 44. Hadoop InputFormat
  • 45. Pig LoadFunc </li></ul></ul>
  • 46. Hadoop + Cassandra at RAX <ul><li>Multiple Hadoop clusters deployed
  • 47. Smaller Cassandra deployments
  • 48. Preparing for large scale Cassandra deployment </li></ul>
  • 49. In the pipeline <ul><li>MapReduce output support </li><ul><li>Adding an OutputFormat with locality information </li></ul><li>Improving locality for Hadoop inputs </li></ul>
  • 50. Getting started <ul><li>http://cassandra.apache.org/
  • 51. Read &quot;Getting Started&quot;... Roughly: </li><ul><li>Start one node
  • 52. Test/develop app, editing node config as necessary
  • 53. Launch cluster by starting more nodes with chosen config </li></ul></ul>
  • 54. Thanks! Big Data Workshop Participants!
  • 55. Questions?
  • 56. References <ul><li>Brandon William's perf tests </li><ul><li>http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png </li></ul><li>Hadoop/Cassandra Integration </li><ul><li>http://issues.apache.org/jira/browse/CASSANDRA-342 </li></ul></ul>

×