Cassandra/Hadoop Integration<br />OLTP + OLAP = Cassandra<br />
BigTable + Dynamo<br />Semi-structured data model<br />Decentralized – no special roles, no SPOF<br />Horizontally scalabl...
Design your data model based on your query model<br />Real-time ad-hoc queries aren’t viable<br />Secondary indexes help<b...
Hadoopbrings analytics<br />MapReduce<br />Pig/Hive and other tools built above MapReduce<br />Configurable data sources/d...
Basic Recipe<br />Overlay Hadoop on top of Cassandra<br />Separate server for name node and job tracker<br />Co-locate tas...
Always tune Cassandra to taste<br />For Hadoop workloads you might<br />Have a separate analytics virtual datacenter<br />...
All-in-one Configuration<br />JobTracker and NameNode<br />Each node has Cassandra, a TaskTracker, and a DataNode (for dis...
Separate Analytics Configuration<br />Separated nodes for analytics<br />Nodes for real-time random access<br />A single C...
Cassandra specific InputFormat<br />ColumnFamilyInputFormat<br />Configuration – ConfigHelper, Hadoop variables<br />Input...
OutputFormat<br />ColumnFamilyOutputFormat<br />Configuration – ConfigHelper, Hadoopvariables<br />Batches output – tunabl...
Visualizing<br />Take vertical slices of columns<br />Over the whole column family<br />
What about languages outside of Java?<br />Build on what Hadoop uses - Streaming<br />Output streaming as of0.7.0<br />Exa...
Developed at Yahoo!<br />PigLatin/Grunt shell<br />Powerful scripting language for analytics<br />Configuration – Hadoop/E...
LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() <br />	as (key:chararray, cols:bag{col:tuple(name:bytearra...
ColumnFamilyInputFormat<br />ColumnFamilyOutputFormat<br />Hadoop Streaming Output<br />Pig support – Cassandra LoadFunc<b...
Raptr.com<br />Home grown solution -> Cassandra + Hadoop<br />Query time: hours -> minutes<br />Pig obviated their need fo...
Hive support in progress (HIVE-1434)<br />Hadoop Input Streaming (hoping for 0.7.2 - 1497)<br />Pig Storage Func (CASSANDR...
Performant OLTP + powerful OLAP<br />Less need to shuttle data between storage systems<br />Data locality for processing<b...
About Cassandra<br />http://www.datastax.com/docs<br />http://wiki.apache.org/cassandra<br />Search and subscribe to the u...
About me:<br />jeremy.hanna@dachisgroup.com<br />@jeromatron on Twitter<br />jeromatron on IRC in #cassandra<br />Question...
Upcoming SlideShare
Loading in...5
×

Cassandra/Hadoop Integration

23,120

Published on

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in: Technology
0 Comments
36 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
23,120
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
612
Comments
0
Likes
36
Embeds 0
No embeds

No notes for slide
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.
  • Cassandra/Hadoop Integration

    1. 1. Cassandra/Hadoop Integration<br />OLTP + OLAP = Cassandra<br />
    2. 2. BigTable + Dynamo<br />Semi-structured data model<br />Decentralized – no special roles, no SPOF<br />Horizontally scalable<br />Ridiculously fast writes, fast reads<br />Tunably consistent<br />Cross-DC capable<br />Cassandra (basic overview)<br />
    3. 3. Design your data model based on your query model<br />Real-time ad-hoc queries aren’t viable<br />Secondary indexes help<br />What about analytics?<br />Querying with Cassandra<br />
    4. 4. Hadoopbrings analytics<br />MapReduce<br />Pig/Hive and other tools built above MapReduce<br />Configurable data sources/destinations<br />Many already familiar with it<br />Active community<br />Enter Hadoop<br />
    5. 5. Basic Recipe<br />Overlay Hadoop on top of Cassandra<br />Separate server for name node and job tracker<br />Co-locate task trackers with Cassandra nodes<br />Data nodes for distributed cache<br />Voilà<br />Data locality<br />Analytics engine scales with data<br />Cluster Configuration<br />
    6. 6. Always tune Cassandra to taste<br />For Hadoop workloads you might<br />Have a separate analytics virtual datacenter<br />Using the NetworkTopologyStrategy<br />Tune the rpc_timeout_in_ms in cassandra.yaml (higher)<br />Tune the cassandra.range.batch.size<br />See org.apache.cassandra.hadoop.ConfigHelper<br />Cluster Tuning<br />
    7. 7. All-in-one Configuration<br />JobTracker and NameNode<br />Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)<br />
    8. 8. Separate Analytics Configuration<br />Separated nodes for analytics<br />Nodes for real-time random access<br />A single Cassandra cluster with different virtual data centers<br />
    9. 9. Cassandra specific InputFormat<br />ColumnFamilyInputFormat<br />Configuration – ConfigHelper, Hadoop variables<br />InputSplits over the data – tunable<br />Example usage in contrib/word_count<br />MapReduce - InputFormat<br />
    10. 10. OutputFormat<br />ColumnFamilyOutputFormat<br />Configuration – ConfigHelper, Hadoopvariables<br />Batches output – tunable<br />Don’t have to use Cassandra api<br />Some optimizations (e.g. ConsistencyLevel.ONE)<br />Uses Avro for output serialization (enables streaming)<br />Example usage in contrib/word_count<br />MapReduce - OutputFormat<br />
    11. 11. Visualizing<br />Take vertical slices of columns<br />Over the whole column family<br />
    12. 12. What about languages outside of Java?<br />Build on what Hadoop uses - Streaming<br />Output streaming as of0.7.0<br />Example in contrib/hadoop_streaming_output<br />Input streaming in progress, hoping for 0.7.2<br />Hadoop Streaming<br />
    13. 13. Developed at Yahoo!<br />PigLatin/Grunt shell<br />Powerful scripting language for analytics<br />Configuration – Hadoop/Envvariables<br />Uses pig 0.7+<br />Example usage in contrib/pig<br />Pig<br />
    14. 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() <br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});<br />cols = FOREACH rows GENERATE flatten(cols) as (name, value);<br />words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;<br />grouped = GROUP words BY word;<br />counts = FOREACH grouped GENERATE group, COUNT(words) as count;<br />ordered = ORDER counts BY count DESC;<br />topten = LIMIT ordered 10;<br />dump topten;<br />
    15. 15. ColumnFamilyInputFormat<br />ColumnFamilyOutputFormat<br />Hadoop Streaming Output<br />Pig support – Cassandra LoadFunc<br />Summary of Integration<br />
    16. 16. Raptr.com<br />Home grown solution -> Cassandra + Hadoop<br />Query time: hours -> minutes<br />Pig obviated their need for multi-lingual MR<br />Speed and ease are enabling<br />Imagini/Visual DNA<br />The Dachis Group<br />US Government (Digital Reasoning)<br />See http://github.com/digitalreasoning/PyStratus<br />Users of Cassandra + Hadoop<br />
    17. 17. Hive support in progress (HIVE-1434)<br />Hadoop Input Streaming (hoping for 0.7.2 - 1497)<br />Pig Storage Func (CASSANDRA-1828)<br />Row predicates (pending CASSANDRA-1600)<br />MapReduce et al over secondary indexes (1600)<br />Performance improvements (though already good)<br />Future<br />
    18. 18. Performant OLTP + powerful OLAP<br />Less need to shuttle data between storage systems<br />Data locality for processing<br />Scales with the cluster<br />Can separate analytics load into virtual DC<br />Conclusion<br />
    19. 19. About Cassandra<br />http://www.datastax.com/docs<br />http://wiki.apache.org/cassandra<br />Search and subscribe to the user mailing list (very active)<br />#Cassandra on freenode (IRC)<br />~150-200+ users from around the world<br />Cassandra: The Definitive Guide<br />About Hadoop Support in Cassandra<br />Check out various <source>/contrib modules: README/code<br />http://wiki.apache.org/cassandra/HadoopSupport<br />Learn More<br />
    20. 20. About me:<br />jeremy.hanna@dachisgroup.com<br />@jeromatron on Twitter<br />jeromatron on IRC in #cassandra<br />Questions<br />
    1. Gostou de algum slide específico?

      Recortar slides é uma maneira fácil de colecionar informações para acessar mais tarde.

    ×