Hadoop and Cassandra at Rackspace
Upcoming SlideShare
Loading in...5
×
 

Hadoop and Cassandra at Rackspace

on

  • 16,151 views

 

Statistics

Views

Total Views
16,151
Views on SlideShare
14,942
Embed Views
1,209

Actions

Likes
15
Downloads
266
Comments
0

8 Embeds 1,209

http://bigdataworkshop.com 1061
http://www.slideshare.net 141
http://www.twittertim.es 2
http://sns.hm.fst.fujitsu.com 1
http://nosqlworkshop.com 1
http://nosqlwest.com 1
http://translate.yandex.net 1
http://nosqlpro.kaliyasblogs.net 1
More...

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop and Cassandra at Rackspace Hadoop and Cassandra at Rackspace Presentation Transcript

  • Making Massive Manageable: Hadoop and Cassandra (at Rackspace) Big Data Workshop Stu Hood (@stuhood) – Technical Lead, Rackspace April 23rd 2010
  • My, what a large dataset you have...
    • Processing 3 TB/day of logs
    • Using Hadoop/Pig
    • And the sticking points?
      • “ How fast can we provision machines?”
      • “ How do we get data on/off the cluster?”
      • “ How do we add structure?”
  • MapReduce
    • Distributed processing methodology
      • Adapt a problem to MapReduce
      • Scale forever
      • Crunch almost anything
    • Typically adding structure to unstructured data
      • Logs
    • Also great for structured
      • Graph processing
      • Machine learning
  • “You want to use how many clients?”
    • Need to store structured inputs/outputs
    • Solution needs to
      • Support arbitrary number of clients
      • Preferably provide locality
      • Possibly provide 'web' latency
  • Solutions of varying quality
    • Sharding the RDBMS
      • shard n. - A horizontal partition in a database
        • Example: Sharding by userid
      • Provided by ORM?
        • Fixed partitions: manual rebalancing
      • Developing from scratch?
        • Adding/removing nodes
        • Handling failover
        • As a library? As a middle tier?
  • Solutions of varying quality
    • Leaving data in Hadoop
      • Storage in Map/SequenceFile
        • Serialized with Thrift/Avro/ProtoBuffs
      • No random access
      • High latency
  • Solutions of varying quality
    • Storing in HBase/Hypertable
      • Column stores implemented on Hadoop
        • Modeled after Google's Bigtable
      • Multiple points of failure
        • Namenode
        • Master
      • High (almost non-web) latency
  • And the newest contender...
  • Standing on the shoulders of: Amazon Dynamo
    • No node in the cluster is special
      • No special roles
      • No scaling bottlenecks
      • No single point of failure
    • Techniques
      • Gossip
      • Eventual consistency
  • Standing on the shoulders of: Google Bigtable
    • “Column family” data model
    • Range queries for rows:
      • Scan rows in order
    • Memtable/SSTable structure
      • Always writes sequentially to disk
      • Bloom filters to minimize random reads
      • Trounces B-Trees for big data
        • Linear insert performance
        • Log growth for reads
  • Enter Cassandra
    • Hybrid of ancestors
      • Adopts listed features
    • And adds:
      • A sweet logo!
      • Pluggable partitioning
      • Multi datacenter support
        • Pluggable locality awareness
      • Datamodel improvements
  • Enter Cassandra
    • Project status
      • Open sourced by Facebook in 2008 (no longer active)
      • Apache License
      • Graduated to Apache TLP February 2010
      • Major releases: 0.3 through 0.6 (0.7 in two months)
    • cassandra.apache.org
  • Enter Cassandra
    • The code base
      • Java, Apache Ant, Git/SVN
      • 5+ committers from 3+ companies
    • Known deployments at:
      • Cloudkick, Digg, Mahalo, SimpleGeo, Twitter, Rackspace, Reddit
  • Performance
  • Like peanut butter with jelly
    • Apache Cassandra 0.6:
    • MapReduce input support out of the box
      • Locality information partially exposed
      • Hadoop InputFormat
      • Pig LoadFunc
  • Hadoop + Cassandra at RAX
    • Multiple Hadoop clusters deployed
    • Smaller Cassandra deployments
    • Preparing for large scale Cassandra deployment
  • In the pipeline
    • MapReduce output support
      • Adding an OutputFormat with locality information
    • Improving locality for Hadoop inputs
  • Getting started
    • http://cassandra.apache.org/
    • Read "Getting Started"... Roughly:
      • Start one node
      • Test/develop app, editing node config as necessary
      • Launch cluster by starting more nodes with chosen config
  • Thanks! Big Data Workshop Participants!
  • Questions?
  • References
    • Brandon William's perf tests
      • http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png
    • Hadoop/Cassandra Integration
      • http://issues.apache.org/jira/browse/CASSANDRA-342