Your SlideShare is downloading. ×

Cassandra/Hadoop Integration


Published on

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.
  • Transcript

    • 1. Cassandra/Hadoop Integration
      OLTP + OLAP = Cassandra
    • 2. BigTable + Dynamo
      Semi-structured data model
      Decentralized – no special roles, no SPOF
      Horizontally scalable
      Ridiculously fast writes, fast reads
      Tunably consistent
      Cross-DC capable
      Cassandra (basic overview)
    • 3. Design your data model based on your query model
      Real-time ad-hoc queries aren’t viable
      Secondary indexes help
      What about analytics?
      Querying with Cassandra
    • 4. Hadoopbrings analytics
      Pig/Hive and other tools built above MapReduce
      Configurable data sources/destinations
      Many already familiar with it
      Active community
      Enter Hadoop
    • 5. Basic Recipe
      Overlay Hadoop on top of Cassandra
      Separate server for name node and job tracker
      Co-locate task trackers with Cassandra nodes
      Data nodes for distributed cache
      Data locality
      Analytics engine scales with data
      Cluster Configuration
    • 6. Always tune Cassandra to taste
      For Hadoop workloads you might
      Have a separate analytics virtual datacenter
      Using the NetworkTopologyStrategy
      Tune the rpc_timeout_in_ms in cassandra.yaml (higher)
      Tune the cassandra.range.batch.size
      See org.apache.cassandra.hadoop.ConfigHelper
      Cluster Tuning
    • 7. All-in-one Configuration
      JobTracker and NameNode
      Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
    • 8. Separate Analytics Configuration
      Separated nodes for analytics
      Nodes for real-time random access
      A single Cassandra cluster with different virtual data centers
    • 9. Cassandra specific InputFormat
      Configuration – ConfigHelper, Hadoop variables
      InputSplits over the data – tunable
      Example usage in contrib/word_count
      MapReduce - InputFormat
    • 10. OutputFormat
      Configuration – ConfigHelper, Hadoopvariables
      Batches output – tunable
      Don’t have to use Cassandra api
      Some optimizations (e.g. ConsistencyLevel.ONE)
      Uses Avro for output serialization (enables streaming)
      Example usage in contrib/word_count
      MapReduce - OutputFormat
    • 11. Visualizing
      Take vertical slices of columns
      Over the whole column family
    • 12. What about languages outside of Java?
      Build on what Hadoop uses - Streaming
      Output streaming as of0.7.0
      Example in contrib/hadoop_streaming_output
      Input streaming in progress, hoping for 0.7.2
      Hadoop Streaming
    • 13. Developed at Yahoo!
      PigLatin/Grunt shell
      Powerful scripting language for analytics
      Configuration – Hadoop/Envvariables
      Uses pig 0.7+
      Example usage in contrib/pig
    • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage()
      as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
      cols = FOREACH rows GENERATE flatten(cols) as (name, value);
      words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;
      grouped = GROUP words BY word;
      counts = FOREACH grouped GENERATE group, COUNT(words) as count;
      ordered = ORDER counts BY count DESC;
      topten = LIMIT ordered 10;
      dump topten;
    • 15. ColumnFamilyInputFormat
      Hadoop Streaming Output
      Pig support – Cassandra LoadFunc
      Summary of Integration
    • 16.
      Home grown solution -> Cassandra + Hadoop
      Query time: hours -> minutes
      Pig obviated their need for multi-lingual MR
      Speed and ease are enabling
      Imagini/Visual DNA
      The Dachis Group
      US Government (Digital Reasoning)
      Users of Cassandra + Hadoop
    • 17. Hive support in progress (HIVE-1434)
      Hadoop Input Streaming (hoping for 0.7.2 - 1497)
      Pig Storage Func (CASSANDRA-1828)
      Row predicates (pending CASSANDRA-1600)
      MapReduce et al over secondary indexes (1600)
      Performance improvements (though already good)
    • 18. Performant OLTP + powerful OLAP
      Less need to shuttle data between storage systems
      Data locality for processing
      Scales with the cluster
      Can separate analytics load into virtual DC
    • 19. About Cassandra
      Search and subscribe to the user mailing list (very active)
      #Cassandra on freenode (IRC)
      ~150-200+ users from around the world
      Cassandra: The Definitive Guide
      About Hadoop Support in Cassandra
      Check out various <source>/contrib modules: README/code
      Learn More
    • 20. About me:
      @jeromatron on Twitter
      jeromatron on IRC in #cassandra