Your SlideShare is downloading. ×
0
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Cassandra/Hadoop Integration
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Cassandra/Hadoop Integration

22,721

Published on

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in: Technology
0 Comments
36 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
22,721
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
601
Comments
0
Likes
36
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Floating above the clouds
  • Mention how InputSplit works and how it can choose among replicas – array of locations returned.
  • Highlight how this is the same extension point that is used with HDFS, HBase and any other data source/destination for MapReduce.
  • Mention Jeff Hodges, Johan, Stu, and Todd Lipcon.
  • IOW, are people using this stuff in the real world? In production?Put some notes in here about raptr and imagini’s use cases.
  • Transcript

    • 1. Cassandra/Hadoop Integration<br />OLTP + OLAP = Cassandra<br />
    • 2. BigTable + Dynamo<br />Semi-structured data model<br />Decentralized – no special roles, no SPOF<br />Horizontally scalable<br />Ridiculously fast writes, fast reads<br />Tunably consistent<br />Cross-DC capable<br />Cassandra (basic overview)<br />
    • 3. Design your data model based on your query model<br />Real-time ad-hoc queries aren’t viable<br />Secondary indexes help<br />What about analytics?<br />Querying with Cassandra<br />
    • 4. Hadoopbrings analytics<br />MapReduce<br />Pig/Hive and other tools built above MapReduce<br />Configurable data sources/destinations<br />Many already familiar with it<br />Active community<br />Enter Hadoop<br />
    • 5. Basic Recipe<br />Overlay Hadoop on top of Cassandra<br />Separate server for name node and job tracker<br />Co-locate task trackers with Cassandra nodes<br />Data nodes for distributed cache<br />Voilà<br />Data locality<br />Analytics engine scales with data<br />Cluster Configuration<br />
    • 6. Always tune Cassandra to taste<br />For Hadoop workloads you might<br />Have a separate analytics virtual datacenter<br />Using the NetworkTopologyStrategy<br />Tune the rpc_timeout_in_ms in cassandra.yaml (higher)<br />Tune the cassandra.range.batch.size<br />See org.apache.cassandra.hadoop.ConfigHelper<br />Cluster Tuning<br />
    • 7. All-in-one Configuration<br />JobTracker and NameNode<br />Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)<br />
    • 8. Separate Analytics Configuration<br />Separated nodes for analytics<br />Nodes for real-time random access<br />A single Cassandra cluster with different virtual data centers<br />
    • 9. Cassandra specific InputFormat<br />ColumnFamilyInputFormat<br />Configuration – ConfigHelper, Hadoop variables<br />InputSplits over the data – tunable<br />Example usage in contrib/word_count<br />MapReduce - InputFormat<br />
    • 10. OutputFormat<br />ColumnFamilyOutputFormat<br />Configuration – ConfigHelper, Hadoopvariables<br />Batches output – tunable<br />Don’t have to use Cassandra api<br />Some optimizations (e.g. ConsistencyLevel.ONE)<br />Uses Avro for output serialization (enables streaming)<br />Example usage in contrib/word_count<br />MapReduce - OutputFormat<br />
    • 11. Visualizing<br />Take vertical slices of columns<br />Over the whole column family<br />
    • 12. What about languages outside of Java?<br />Build on what Hadoop uses - Streaming<br />Output streaming as of0.7.0<br />Example in contrib/hadoop_streaming_output<br />Input streaming in progress, hoping for 0.7.2<br />Hadoop Streaming<br />
    • 13. Developed at Yahoo!<br />PigLatin/Grunt shell<br />Powerful scripting language for analytics<br />Configuration – Hadoop/Envvariables<br />Uses pig 0.7+<br />Example usage in contrib/pig<br />Pig<br />
    • 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() <br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});<br />cols = FOREACH rows GENERATE flatten(cols) as (name, value);<br />words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;<br />grouped = GROUP words BY word;<br />counts = FOREACH grouped GENERATE group, COUNT(words) as count;<br />ordered = ORDER counts BY count DESC;<br />topten = LIMIT ordered 10;<br />dump topten;<br />
    • 15. ColumnFamilyInputFormat<br />ColumnFamilyOutputFormat<br />Hadoop Streaming Output<br />Pig support – Cassandra LoadFunc<br />Summary of Integration<br />
    • 16. Raptr.com<br />Home grown solution -> Cassandra + Hadoop<br />Query time: hours -> minutes<br />Pig obviated their need for multi-lingual MR<br />Speed and ease are enabling<br />Imagini/Visual DNA<br />The Dachis Group<br />US Government (Digital Reasoning)<br />See http://github.com/digitalreasoning/PyStratus<br />Users of Cassandra + Hadoop<br />
    • 17. Hive support in progress (HIVE-1434)<br />Hadoop Input Streaming (hoping for 0.7.2 - 1497)<br />Pig Storage Func (CASSANDRA-1828)<br />Row predicates (pending CASSANDRA-1600)<br />MapReduce et al over secondary indexes (1600)<br />Performance improvements (though already good)<br />Future<br />
    • 18. Performant OLTP + powerful OLAP<br />Less need to shuttle data between storage systems<br />Data locality for processing<br />Scales with the cluster<br />Can separate analytics load into virtual DC<br />Conclusion<br />
    • 19. About Cassandra<br />http://www.datastax.com/docs<br />http://wiki.apache.org/cassandra<br />Search and subscribe to the user mailing list (very active)<br />#Cassandra on freenode (IRC)<br />~150-200+ users from around the world<br />Cassandra: The Definitive Guide<br />About Hadoop Support in Cassandra<br />Check out various <source>/contrib modules: README/code<br />http://wiki.apache.org/cassandra/HadoopSupport<br />Learn More<br />
    • 20. About me:<br />jeremy.hanna@dachisgroup.com<br />@jeromatron on Twitter<br />jeromatron on IRC in #cassandra<br />Questions<br />

    ×