Cassandra/Hadoop Integration
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Cassandra/Hadoop Integration



Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.



Total Views
Views on SlideShare
Embed Views



10 Embeds 32 8 7 6
http://searchutil01 3 2
http://pmomale-ld1 2 1 1 1 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.

Cassandra/Hadoop Integration Presentation Transcript

  • 1. Cassandra/Hadoop Integration
    OLTP + OLAP = Cassandra
  • 2. BigTable + Dynamo
    Semi-structured data model
    Decentralized – no special roles, no SPOF
    Horizontally scalable
    Ridiculously fast writes, fast reads
    Tunably consistent
    Cross-DC capable
    Cassandra (basic overview)
  • 3. Design your data model based on your query model
    Real-time ad-hoc queries aren’t viable
    Secondary indexes help
    What about analytics?
    Querying with Cassandra
  • 4. Hadoopbrings analytics
    Pig/Hive and other tools built above MapReduce
    Configurable data sources/destinations
    Many already familiar with it
    Active community
    Enter Hadoop
  • 5. Basic Recipe
    Overlay Hadoop on top of Cassandra
    Separate server for name node and job tracker
    Co-locate task trackers with Cassandra nodes
    Data nodes for distributed cache
    Data locality
    Analytics engine scales with data
    Cluster Configuration
  • 6. Always tune Cassandra to taste
    For Hadoop workloads you might
    Have a separate analytics virtual datacenter
    Using the NetworkTopologyStrategy
    Tune the rpc_timeout_in_ms in cassandra.yaml (higher)
    Tune the cassandra.range.batch.size
    See org.apache.cassandra.hadoop.ConfigHelper
    Cluster Tuning
  • 7. All-in-one Configuration
    JobTracker and NameNode
    Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
  • 8. Separate Analytics Configuration
    Separated nodes for analytics
    Nodes for real-time random access
    A single Cassandra cluster with different virtual data centers
  • 9. Cassandra specific InputFormat
    Configuration – ConfigHelper, Hadoop variables
    InputSplits over the data – tunable
    Example usage in contrib/word_count
    MapReduce - InputFormat
  • 10. OutputFormat
    Configuration – ConfigHelper, Hadoopvariables
    Batches output – tunable
    Don’t have to use Cassandra api
    Some optimizations (e.g. ConsistencyLevel.ONE)
    Uses Avro for output serialization (enables streaming)
    Example usage in contrib/word_count
    MapReduce - OutputFormat
  • 11. Visualizing
    Take vertical slices of columns
    Over the whole column family
  • 12. What about languages outside of Java?
    Build on what Hadoop uses - Streaming
    Output streaming as of0.7.0
    Example in contrib/hadoop_streaming_output
    Input streaming in progress, hoping for 0.7.2
    Hadoop Streaming
  • 13. Developed at Yahoo!
    PigLatin/Grunt shell
    Powerful scripting language for analytics
    Configuration – Hadoop/Envvariables
    Uses pig 0.7+
    Example usage in contrib/pig
  • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage()
    as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
    cols = FOREACH rows GENERATE flatten(cols) as (name, value);
    words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;
    grouped = GROUP words BY word;
    counts = FOREACH grouped GENERATE group, COUNT(words) as count;
    ordered = ORDER counts BY count DESC;
    topten = LIMIT ordered 10;
    dump topten;
  • 15. ColumnFamilyInputFormat
    Hadoop Streaming Output
    Pig support – Cassandra LoadFunc
    Summary of Integration
  • 16.
    Home grown solution -> Cassandra + Hadoop
    Query time: hours -> minutes
    Pig obviated their need for multi-lingual MR
    Speed and ease are enabling
    Imagini/Visual DNA
    The Dachis Group
    US Government (Digital Reasoning)
    Users of Cassandra + Hadoop
  • 17. Hive support in progress (HIVE-1434)
    Hadoop Input Streaming (hoping for 0.7.2 - 1497)
    Pig Storage Func (CASSANDRA-1828)
    Row predicates (pending CASSANDRA-1600)
    MapReduce et al over secondary indexes (1600)
    Performance improvements (though already good)
  • 18. Performant OLTP + powerful OLAP
    Less need to shuttle data between storage systems
    Data locality for processing
    Scales with the cluster
    Can separate analytics load into virtual DC
  • 19. About Cassandra
    Search and subscribe to the user mailing list (very active)
    #Cassandra on freenode (IRC)
    ~150-200+ users from around the world
    Cassandra: The Definitive Guide
    About Hadoop Support in Cassandra
    Check out various <source>/contrib modules: README/code
    Learn More
  • 20. About me:
    @jeromatron on Twitter
    jeromatron on IRC in #cassandra