Your SlideShare is downloading. ×
  • Like
Cassandra/Hadoop Integration
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cassandra/Hadoop Integration


Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.


  • 1. Cassandra/Hadoop Integration
    OLTP + OLAP = Cassandra
  • 2. BigTable + Dynamo
    Semi-structured data model
    Decentralized – no special roles, no SPOF
    Horizontally scalable
    Ridiculously fast writes, fast reads
    Tunably consistent
    Cross-DC capable
    Cassandra (basic overview)
  • 3. Design your data model based on your query model
    Real-time ad-hoc queries aren’t viable
    Secondary indexes help
    What about analytics?
    Querying with Cassandra
  • 4. Hadoopbrings analytics
    Pig/Hive and other tools built above MapReduce
    Configurable data sources/destinations
    Many already familiar with it
    Active community
    Enter Hadoop
  • 5. Basic Recipe
    Overlay Hadoop on top of Cassandra
    Separate server for name node and job tracker
    Co-locate task trackers with Cassandra nodes
    Data nodes for distributed cache
    Data locality
    Analytics engine scales with data
    Cluster Configuration
  • 6. Always tune Cassandra to taste
    For Hadoop workloads you might
    Have a separate analytics virtual datacenter
    Using the NetworkTopologyStrategy
    Tune the rpc_timeout_in_ms in cassandra.yaml (higher)
    Tune the cassandra.range.batch.size
    See org.apache.cassandra.hadoop.ConfigHelper
    Cluster Tuning
  • 7. All-in-one Configuration
    JobTracker and NameNode
    Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)
  • 8. Separate Analytics Configuration
    Separated nodes for analytics
    Nodes for real-time random access
    A single Cassandra cluster with different virtual data centers
  • 9. Cassandra specific InputFormat
    Configuration – ConfigHelper, Hadoop variables
    InputSplits over the data – tunable
    Example usage in contrib/word_count
    MapReduce - InputFormat
  • 10. OutputFormat
    Configuration – ConfigHelper, Hadoopvariables
    Batches output – tunable
    Don’t have to use Cassandra api
    Some optimizations (e.g. ConsistencyLevel.ONE)
    Uses Avro for output serialization (enables streaming)
    Example usage in contrib/word_count
    MapReduce - OutputFormat
  • 11. Visualizing
    Take vertical slices of columns
    Over the whole column family
  • 12. What about languages outside of Java?
    Build on what Hadoop uses - Streaming
    Output streaming as of0.7.0
    Example in contrib/hadoop_streaming_output
    Input streaming in progress, hoping for 0.7.2
    Hadoop Streaming
  • 13. Developed at Yahoo!
    PigLatin/Grunt shell
    Powerful scripting language for analytics
    Configuration – Hadoop/Envvariables
    Uses pig 0.7+
    Example usage in contrib/pig
  • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage()
    as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});
    cols = FOREACH rows GENERATE flatten(cols) as (name, value);
    words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;
    grouped = GROUP words BY word;
    counts = FOREACH grouped GENERATE group, COUNT(words) as count;
    ordered = ORDER counts BY count DESC;
    topten = LIMIT ordered 10;
    dump topten;
  • 15. ColumnFamilyInputFormat
    Hadoop Streaming Output
    Pig support – Cassandra LoadFunc
    Summary of Integration
  • 16.
    Home grown solution -> Cassandra + Hadoop
    Query time: hours -> minutes
    Pig obviated their need for multi-lingual MR
    Speed and ease are enabling
    Imagini/Visual DNA
    The Dachis Group
    US Government (Digital Reasoning)
    Users of Cassandra + Hadoop
  • 17. Hive support in progress (HIVE-1434)
    Hadoop Input Streaming (hoping for 0.7.2 - 1497)
    Pig Storage Func (CASSANDRA-1828)
    Row predicates (pending CASSANDRA-1600)
    MapReduce et al over secondary indexes (1600)
    Performance improvements (though already good)
  • 18. Performant OLTP + powerful OLAP
    Less need to shuttle data between storage systems
    Data locality for processing
    Scales with the cluster
    Can separate analytics load into virtual DC
  • 19. About Cassandra
    Search and subscribe to the user mailing list (very active)
    #Cassandra on freenode (IRC)
    ~150-200+ users from around the world
    Cassandra: The Definitive Guide
    About Hadoop Support in Cassandra
    Check out various <source>/contrib modules: README/code
    Learn More
  • 20. About me:
    @jeromatron on Twitter
    jeromatron on IRC in #cassandra