Your SlideShare is downloading. ×
Cassandra/Hadoop Integration
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cassandra/Hadoop Integration


Published on

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.
  • Transcript

    • 1. Cassandra/Hadoop Integration
      OLTP + OLAP = Cassandra
    • 2. BigTable + Dynamo
      Semi-structured data model
      Decentralized – no special roles, no SPOF
      Horizontally scalable
      Ridiculously fast writes, fast reads
      Tunably consistent
      Cross-DC capable
      Cassandra (basic overview)
    • 3. Design your data model based on your query model
      Real-time ad-hoc queries aren’t viable
      Secondary indexes help
      What about analytics?
      Querying with Cassandra
    • 4. Hadoopbrings analytics
      Pig/Hive and other tools built above MapReduce
      Configurable data sources/destinations
      Many already familiar with it
      Active community
      Enter Hadoop
    • 5. Basic Recipe
      Overlay Hadoop on top of Cassandra
      Separate server for name node and job tracker
      Co-locate task trackers with Cassandra nodes
      Data nodes for distributed cache
      Data locality
      Analytics engine scales with data
      Cluster Configuration
    • 6. Always tune Cassandra to taste
      For Hadoop workloads you might
      Have a separate analytics virtual datacenter
      Using the NetworkTopologyStrategy
      Tune the rpc_timeout_in_ms in cassandra.yaml (higher)
      Tune the cassandra.range.batch.size
      See org.apache.cassandra.hadoop.ConfigHelper
      Cluster Tuning
    • 7. All-in-one Configuration
      JobTracker and NameNode
      Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
    • 8. Separate Analytics Configuration
      Separated nodes for analytics
      Nodes for real-time random access
      A single Cassandra cluster with different virtual data centers
    • 9. Cassandra specific InputFormat
      Configuration – ConfigHelper, Hadoop variables
      InputSplits over the data – tunable
      Example usage in contrib/word_count
      MapReduce - InputFormat
    • 10. OutputFormat
      Configuration – ConfigHelper, Hadoopvariables
      Batches output – tunable
      Don’t have to use Cassandra api
      Some optimizations (e.g. ConsistencyLevel.ONE)
      Uses Avro for output serialization (enables streaming)
      Example usage in contrib/word_count
      MapReduce - OutputFormat
    • 11. Visualizing
      Take vertical slices of columns
      Over the whole column family
    • 12. What about languages outside of Java?
      Build on what Hadoop uses - Streaming
      Output streaming as of0.7.0
      Example in contrib/hadoop_streaming_output
      Input streaming in progress, hoping for 0.7.2
      Hadoop Streaming
    • 13. Developed at Yahoo!
      PigLatin/Grunt shell
      Powerful scripting language for analytics
      Configuration – Hadoop/Envvariables
      Uses pig 0.7+
      Example usage in contrib/pig
    • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage()
      as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
      cols = FOREACH rows GENERATE flatten(cols) as (name, value);
      words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;
      grouped = GROUP words BY word;
      counts = FOREACH grouped GENERATE group, COUNT(words) as count;
      ordered = ORDER counts BY count DESC;
      topten = LIMIT ordered 10;
      dump topten;
    • 15. ColumnFamilyInputFormat
      Hadoop Streaming Output
      Pig support – Cassandra LoadFunc
      Summary of Integration
    • 16.
      Home grown solution -> Cassandra + Hadoop
      Query time: hours -> minutes
      Pig obviated their need for multi-lingual MR
      Speed and ease are enabling
      Imagini/Visual DNA
      The Dachis Group
      US Government (Digital Reasoning)
      Users of Cassandra + Hadoop
    • 17. Hive support in progress (HIVE-1434)
      Hadoop Input Streaming (hoping for 0.7.2 - 1497)
      Pig Storage Func (CASSANDRA-1828)
      Row predicates (pending CASSANDRA-1600)
      MapReduce et al over secondary indexes (1600)
      Performance improvements (though already good)
    • 18. Performant OLTP + powerful OLAP
      Less need to shuttle data between storage systems
      Data locality for processing
      Scales with the cluster
      Can separate analytics load into virtual DC
    • 19. About Cassandra
      Search and subscribe to the user mailing list (very active)
      #Cassandra on freenode (IRC)
      ~150-200+ users from around the world
      Cassandra: The Definitive Guide
      About Hadoop Support in Cassandra
      Check out various <source>/contrib modules: README/code
      Learn More
    • 20. About me:
      @jeromatron on Twitter
      jeromatron on IRC in #cassandra