Hadoop and Cassandra at Rackspace
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop and Cassandra at Rackspace

on

  • 16,277 views

 

Statistics

Views

Total Views
16,277
Views on SlideShare
15,058
Embed Views
1,219

Actions

Likes
15
Downloads
266
Comments
0

8 Embeds 1,219

http://bigdataworkshop.com 1071
http://www.slideshare.net 141
http://www.twittertim.es 2
http://sns.hm.fst.fujitsu.com 1
http://nosqlworkshop.com 1
http://nosqlwest.com 1
http://translate.yandex.net 1
http://nosqlpro.kaliyasblogs.net 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Cassandra at Rackspace Presentation Transcript

  • 1. Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
  • 2. My, what a large dataset you have...
    • Processing 3 TB/day of logs
    • 3. Using Hadoop/Pig
    • 4. And the sticking points?
      • “ How fast can we provision machines?”
      • 5. “ How do we get data on/off the cluster?”
      • 6. “ How do we add structure?”
  • 7. MapReduce
    • Distributed processing methodology
      • Adapt a problem to MapReduce
      • 8. Scale forever
      • 9. Crunch almost anything
    • Typically adding structure to unstructured data
      • Logs
    • Also great for structured
      • Graph processing
      • 10. Machine learning
  • 11. “You want to use how many clients?”
    • Need to store structured inputs/outputs
    • 12. Solution needs to
      • Support arbitrary number of clients
      • 13. Preferably provide locality
      • 14. Possibly provide 'web' latency
  • 15. Solutions of varying quality
    • Sharding the RDBMS
      • shard n. - A horizontal partition in a database
        • Example: Sharding by userid
      • Provided by ORM?
        • Fixed partitions: manual rebalancing
      • Developing from scratch?
        • Adding/removing nodes
        • 16. Handling failover
        • 17. As a library? As a middle tier?
  • 18. Solutions of varying quality
    • Leaving data in Hadoop
      • Storage in Map/SequenceFile
        • Serialized with Thrift/Avro/ProtoBuffs
      • No random access
      • 19. High latency
  • 20. Solutions of varying quality
    • Storing in HBase/Hypertable
      • Column stores implemented on Hadoop
        • Modeled after Google's Bigtable
      • Multiple points of failure
        • Namenode
        • 21. Master
      • High (almost non-web) latency
  • 22. And the newest contender...
  • 23. Standing on the shoulders of: Amazon Dynamo
    • No node in the cluster is special
      • No special roles
      • 24. No scaling bottlenecks
      • 25. No single point of failure
    • Techniques
      • Gossip
      • 26. Eventual consistency
  • 27. Standing on the shoulders of: Google Bigtable
    • “Column family” data model
    • 28. Range queries for rows:
      • Scan rows in order
    • Memtable/SSTable structure
      • Always writes sequentially to disk
      • 29. Bloom filters to minimize random reads
      • 30. Trounces B-Trees for big data
        • Linear insert performance
        • 31. Log growth for reads
  • 32. Enter Cassandra
    • Hybrid of ancestors
      • Adopts listed features
    • And adds:
      • A sweet logo!
      • 33. Pluggable partitioning
      • 34. Multi datacenter support
        • Pluggable locality awareness
      • Datamodel improvements
  • 35. Enter Cassandra
    • Project status
      • Open sourced by Facebook in 2008 (no longer active)
      • 36. Apache License
      • 37. Graduated to Apache TLP February 2010
      • 38. Major releases: 0.3 through 0.6 (0.7 in two months)
    • cassandra.apache.org
  • 39. Enter Cassandra
    • The code base
      • Java, Apache Ant, Git/SVN
      • 40. 5+ committers from 3+ companies
    • Known deployments at:
      • Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
  • 41. Performance
  • 42. Like peanut butter with jelly
    • Apache Cassandra 0.6:
    • 43. MapReduce input support out of the box
      • Locality information partially exposed
      • 44. Hadoop InputFormat
      • 45. Pig LoadFunc
  • 46. Hadoop + Cassandra at RAX
    • Multiple Hadoop clusters deployed
    • 47. Smaller Cassandra deployments
    • 48. Preparing for large scale Cassandra deployment
  • 49. In the pipeline
    • MapReduce output support
      • Adding an OutputFormat with locality information
    • Improving locality for Hadoop inputs
  • 50. Getting started
    • http://cassandra.apache.org/
    • 51. Read "Getting Started"... Roughly:
      • Start one node
      • 52. Test/develop app, editing node config as necessary
      • 53. Launch cluster by starting more nodes with chosen config
  • 54. Thanks! Big Data Workshop Participants!
  • 55. Questions?
  • 56. References
    • Brandon William's perf tests
      • http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png
    • Hadoop/Cassandra Integration
      • http://issues.apache.org/jira/browse/CASSANDRA-342