Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop and Cassandra at Rackspace

15,759 views

Published on

  • Be the first to comment

Hadoop and Cassandra at Rackspace

  1. 1. Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
  2. 2. My, what a large dataset you have... <ul><li>Processing 3 TB/day of logs
  3. 3. Using Hadoop/Pig
  4. 4. And the sticking points? </li><ul><li>“ How fast can we provision machines?”
  5. 5. “ How do we get data on/off the cluster?”
  6. 6. “ How do we add structure?” </li></ul></ul>
  7. 7. MapReduce <ul><li>Distributed processing methodology </li><ul><li>Adapt a problem to MapReduce
  8. 8. Scale forever
  9. 9. Crunch almost anything </li></ul><li>Typically adding structure to unstructured data </li><ul><li>Logs </li></ul><li>Also great for structured </li><ul><li>Graph processing
  10. 10. Machine learning </li></ul></ul>
  11. 11. “You want to use how many clients?” <ul><li>Need to store structured inputs/outputs
  12. 12. Solution needs to </li><ul><li>Support arbitrary number of clients
  13. 13. Preferably provide locality
  14. 14. Possibly provide 'web' latency </li></ul></ul>
  15. 15. Solutions of varying quality <ul><li>Sharding the RDBMS </li><ul><li>shard n. - A horizontal partition in a database </li><ul><li>Example: Sharding by userid </li></ul><li>Provided by ORM? </li><ul><li>Fixed partitions: manual rebalancing </li></ul><li>Developing from scratch? </li><ul><li>Adding/removing nodes
  16. 16. Handling failover
  17. 17. As a library? As a middle tier? </li></ul></ul></ul>
  18. 18. Solutions of varying quality <ul><li>Leaving data in Hadoop </li><ul><li>Storage in Map/SequenceFile </li><ul><li>Serialized with Thrift/Avro/ProtoBuffs </li></ul><li>No random access
  19. 19. High latency </li></ul></ul>
  20. 20. Solutions of varying quality <ul><li>Storing in HBase/Hypertable </li><ul><li>Column stores implemented on Hadoop </li><ul><li>Modeled after Google's Bigtable </li></ul><li>Multiple points of failure </li><ul><li>Namenode
  21. 21. Master </li></ul><li>High (almost non-web) latency </li></ul></ul>
  22. 22. And the newest contender...
  23. 23. Standing on the shoulders of: Amazon Dynamo <ul><li>No node in the cluster is special </li><ul><li>No special roles
  24. 24. No scaling bottlenecks
  25. 25. No single point of failure </li></ul><li>Techniques </li><ul><li>Gossip
  26. 26. Eventual consistency </li></ul></ul>
  27. 27. Standing on the shoulders of: Google Bigtable <ul><li>“Column family” data model
  28. 28. Range queries for rows: </li><ul><li>Scan rows in order </li></ul><li>Memtable/SSTable structure </li><ul><li>Always writes sequentially to disk
  29. 29. Bloom filters to minimize random reads
  30. 30. Trounces B-Trees for big data </li><ul><li>Linear insert performance
  31. 31. Log growth for reads </li></ul></ul></ul>
  32. 32. Enter Cassandra <ul><li>Hybrid of ancestors </li><ul><li>Adopts listed features </li></ul><li>And adds: </li><ul><li>A sweet logo!
  33. 33. Pluggable partitioning
  34. 34. Multi datacenter support </li><ul><li>Pluggable locality awareness </li></ul><li>Datamodel improvements </li></ul></ul>
  35. 35. Enter Cassandra <ul><li>Project status </li><ul><li>Open sourced by Facebook in 2008 (no longer active)
  36. 36. Apache License
  37. 37. Graduated to Apache TLP February 2010
  38. 38. Major releases: 0.3 through 0.6 (0.7 in two months) </li></ul><li>cassandra.apache.org </li></ul>
  39. 39. Enter Cassandra <ul><li>The code base </li><ul><li>Java, Apache Ant, Git/SVN
  40. 40. 5+ committers from 3+ companies </li></ul><li>Known deployments at: </li><ul><li>Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit </li></ul></ul>
  41. 41. Performance
  42. 42. Like peanut butter with jelly <ul><li>Apache Cassandra 0.6:
  43. 43. MapReduce input support out of the box </li><ul><li>Locality information partially exposed
  44. 44. Hadoop InputFormat
  45. 45. Pig LoadFunc </li></ul></ul>
  46. 46. Hadoop + Cassandra at RAX <ul><li>Multiple Hadoop clusters deployed
  47. 47. Smaller Cassandra deployments
  48. 48. Preparing for large scale Cassandra deployment </li></ul>
  49. 49. In the pipeline <ul><li>MapReduce output support </li><ul><li>Adding an OutputFormat with locality information </li></ul><li>Improving locality for Hadoop inputs </li></ul>
  50. 50. Getting started <ul><li>http://cassandra.apache.org/
  51. 51. Read &quot;Getting Started&quot;... Roughly: </li><ul><li>Start one node
  52. 52. Test/develop app, editing node config as necessary
  53. 53. Launch cluster by starting more nodes with chosen config </li></ul></ul>
  54. 54. Thanks! Big Data Workshop Participants!
  55. 55. Questions?
  56. 56. References <ul><li>Brandon William's perf tests </li><ul><li>http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png </li></ul><li>Hadoop/Cassandra Integration </li><ul><li>http://issues.apache.org/jira/browse/CASSANDRA-342 </li></ul></ul>

×