Your SlideShare is downloading. ×
0
Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rac...
My, what a large dataset you have... <ul><li>Processing 3 TB/day of logs
Using Hadoop/Pig
And the sticking points? </li><ul><li>“ How fast can we provision machines?”
“ How do we get data on/off the cluster?”
“ How do we add structure?” </li></ul></ul>
MapReduce <ul><li>Distributed processing methodology </li><ul><li>Adapt a problem to MapReduce
Scale forever
Crunch almost anything </li></ul><li>Typically adding structure to unstructured data </li><ul><li>Logs </li></ul><li>Also ...
Machine learning </li></ul></ul>
“You want to use  how many  clients?” <ul><li>Need to store structured inputs/outputs
Solution needs to </li><ul><li>Support arbitrary number of clients
Preferably provide locality
Possibly provide 'web' latency </li></ul></ul>
Solutions of varying quality <ul><li>Sharding the RDBMS </li><ul><li>shard n. -  A horizontal partition in a database </li...
Handling failover
As a library? As a middle tier? </li></ul></ul></ul>
Solutions of varying quality <ul><li>Leaving data in Hadoop </li><ul><li>Storage in Map/SequenceFile </li><ul><li>Serializ...
High latency </li></ul></ul>
Solutions of varying quality <ul><li>Storing in HBase/Hypertable </li><ul><li>Column stores implemented on Hadoop </li><ul...
Master </li></ul><li>High (almost non-web) latency </li></ul></ul>
Upcoming SlideShare
Loading in...5
×

Hadoop and Cassandra at Rackspace

13,533

Published on

0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
13,533
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
271
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide

Transcript of "Hadoop and Cassandra at Rackspace"

  1. 1. Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
  2. 2. My, what a large dataset you have... <ul><li>Processing 3 TB/day of logs
  3. 3. Using Hadoop/Pig
  4. 4. And the sticking points? </li><ul><li>“ How fast can we provision machines?”
  5. 5. “ How do we get data on/off the cluster?”
  6. 6. “ How do we add structure?” </li></ul></ul>
  7. 7. MapReduce <ul><li>Distributed processing methodology </li><ul><li>Adapt a problem to MapReduce
  8. 8. Scale forever
  9. 9. Crunch almost anything </li></ul><li>Typically adding structure to unstructured data </li><ul><li>Logs </li></ul><li>Also great for structured </li><ul><li>Graph processing
  10. 10. Machine learning </li></ul></ul>
  11. 11. “You want to use how many clients?” <ul><li>Need to store structured inputs/outputs
  12. 12. Solution needs to </li><ul><li>Support arbitrary number of clients
  13. 13. Preferably provide locality
  14. 14. Possibly provide 'web' latency </li></ul></ul>
  15. 15. Solutions of varying quality <ul><li>Sharding the RDBMS </li><ul><li>shard n. - A horizontal partition in a database </li><ul><li>Example: Sharding by userid </li></ul><li>Provided by ORM? </li><ul><li>Fixed partitions: manual rebalancing </li></ul><li>Developing from scratch? </li><ul><li>Adding/removing nodes
  16. 16. Handling failover
  17. 17. As a library? As a middle tier? </li></ul></ul></ul>
  18. 18. Solutions of varying quality <ul><li>Leaving data in Hadoop </li><ul><li>Storage in Map/SequenceFile </li><ul><li>Serialized with Thrift/Avro/ProtoBuffs </li></ul><li>No random access
  19. 19. High latency </li></ul></ul>
  20. 20. Solutions of varying quality <ul><li>Storing in HBase/Hypertable </li><ul><li>Column stores implemented on Hadoop </li><ul><li>Modeled after Google's Bigtable </li></ul><li>Multiple points of failure </li><ul><li>Namenode
  21. 21. Master </li></ul><li>High (almost non-web) latency </li></ul></ul>
  22. 22. And the newest contender...
  23. 23. Standing on the shoulders of: Amazon Dynamo <ul><li>No node in the cluster is special </li><ul><li>No special roles
  24. 24. No scaling bottlenecks
  25. 25. No single point of failure </li></ul><li>Techniques </li><ul><li>Gossip
  26. 26. Eventual consistency </li></ul></ul>
  27. 27. Standing on the shoulders of: Google Bigtable <ul><li>“Column family” data model
  28. 28. Range queries for rows: </li><ul><li>Scan rows in order </li></ul><li>Memtable/SSTable structure </li><ul><li>Always writes sequentially to disk
  29. 29. Bloom filters to minimize random reads
  30. 30. Trounces B-Trees for big data </li><ul><li>Linear insert performance
  31. 31. Log growth for reads </li></ul></ul></ul>
  32. 32. Enter Cassandra <ul><li>Hybrid of ancestors </li><ul><li>Adopts listed features </li></ul><li>And adds: </li><ul><li>A sweet logo!
  33. 33. Pluggable partitioning
  34. 34. Multi datacenter support </li><ul><li>Pluggable locality awareness </li></ul><li>Datamodel improvements </li></ul></ul>
  35. 35. Enter Cassandra <ul><li>Project status </li><ul><li>Open sourced by Facebook in 2008 (no longer active)
  36. 36. Apache License
  37. 37. Graduated to Apache TLP February 2010
  38. 38. Major releases: 0.3 through 0.6 (0.7 in two months) </li></ul><li>cassandra.apache.org </li></ul>
  39. 39. Enter Cassandra <ul><li>The code base </li><ul><li>Java, Apache Ant, Git/SVN
  40. 40. 5+ committers from 3+ companies </li></ul><li>Known deployments at: </li><ul><li>Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit </li></ul></ul>
  41. 41. Performance
  42. 42. Like peanut butter with jelly <ul><li>Apache Cassandra 0.6:
  43. 43. MapReduce input support out of the box </li><ul><li>Locality information partially exposed
  44. 44. Hadoop InputFormat
  45. 45. Pig LoadFunc </li></ul></ul>
  46. 46. Hadoop + Cassandra at RAX <ul><li>Multiple Hadoop clusters deployed
  47. 47. Smaller Cassandra deployments
  48. 48. Preparing for large scale Cassandra deployment </li></ul>
  49. 49. In the pipeline <ul><li>MapReduce output support </li><ul><li>Adding an OutputFormat with locality information </li></ul><li>Improving locality for Hadoop inputs </li></ul>
  50. 50. Getting started <ul><li>http://cassandra.apache.org/
  51. 51. Read &quot;Getting Started&quot;... Roughly: </li><ul><li>Start one node
  52. 52. Test/develop app, editing node config as necessary
  53. 53. Launch cluster by starting more nodes with chosen config </li></ul></ul>
  54. 54. Thanks! Big Data Workshop Participants!
  55. 55. Questions?
  56. 56. References <ul><li>Brandon William's perf tests </li><ul><li>http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png </li></ul><li>Hadoop/Cassandra Integration </li><ul><li>http://issues.apache.org/jira/browse/CASSANDRA-342 </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×