Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra/Hadoop Integration


Published on

Cassandra/Hadoop Integration presentation given at Data Day Austin on January 29, 2011.

Published in: Technology
  • Be the first to comment

Cassandra/Hadoop Integration

  1. 1. Cassandra/Hadoop Integration<br />OLTP + OLAP = Cassandra<br />
  2. 2. BigTable + Dynamo<br />Semi-structured data model<br />Decentralized – no special roles, no SPOF<br />Horizontally scalable<br />Ridiculously fast writes, fast reads<br />Tunably consistent<br />Cross-DC capable<br />Cassandra (basic overview)<br />
  3. 3. Design your data model based on your query model<br />Real-time ad-hoc queries aren’t viable<br />Secondary indexes help<br />What about analytics?<br />Querying with Cassandra<br />
  4. 4. Hadoopbrings analytics<br />MapReduce<br />Pig/Hive and other tools built above MapReduce<br />Configurable data sources/destinations<br />Many already familiar with it<br />Active community<br />Enter Hadoop<br />
  5. 5. Basic Recipe<br />Overlay Hadoop on top of Cassandra<br />Separate server for name node and job tracker<br />Co-locate task trackers with Cassandra nodes<br />Data nodes for distributed cache<br />Voilà<br />Data locality<br />Analytics engine scales with data<br />Cluster Configuration<br />
  6. 6. Always tune Cassandra to taste<br />For Hadoop workloads you might<br />Have a separate analytics virtual datacenter<br />Using the NetworkTopologyStrategy<br />Tune the rpc_timeout_in_ms in cassandra.yaml (higher)<br />Tune the cassandra.range.batch.size<br />See org.apache.cassandra.hadoop.ConfigHelper<br />Cluster Tuning<br />
  7. 7. All-in-one Configuration<br />JobTracker and NameNode<br />Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)<br />
  8. 8. Separate Analytics Configuration<br />Separated nodes for analytics<br />Nodes for real-time random access<br />A single Cassandra cluster with different virtual data centers<br />
  9. 9. Cassandra specific InputFormat<br />ColumnFamilyInputFormat<br />Configuration – ConfigHelper, Hadoop variables<br />InputSplits over the data – tunable<br />Example usage in contrib/word_count<br />MapReduce - InputFormat<br />
  10. 10. OutputFormat<br />ColumnFamilyOutputFormat<br />Configuration – ConfigHelper, Hadoopvariables<br />Batches output – tunable<br />Don’t have to use Cassandra api<br />Some optimizations (e.g. ConsistencyLevel.ONE)<br />Uses Avro for output serialization (enables streaming)<br />Example usage in contrib/word_count<br />MapReduce - OutputFormat<br />
  11. 11. Visualizing<br />Take vertical slices of columns<br />Over the whole column family<br />
  12. 12. What about languages outside of Java?<br />Build on what Hadoop uses - Streaming<br />Output streaming as of0.7.0<br />Example in contrib/hadoop_streaming_output<br />Input streaming in progress, hoping for 0.7.2<br />Hadoop Streaming<br />
  13. 13. Developed at Yahoo!<br />PigLatin/Grunt shell<br />Powerful scripting language for analytics<br />Configuration – Hadoop/Envvariables<br />Uses pig 0.7+<br />Example usage in contrib/pig<br />Pig<br />
  14. 14. LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() <br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)});<br />cols = FOREACH rows GENERATE flatten(cols) as (name, value);<br />words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word;<br />grouped = GROUP words BY word;<br />counts = FOREACH grouped GENERATE group, COUNT(words) as count;<br />ordered = ORDER counts BY count DESC;<br />topten = LIMIT ordered 10;<br />dump topten;<br />
  15. 15. ColumnFamilyInputFormat<br />ColumnFamilyOutputFormat<br />Hadoop Streaming Output<br />Pig support – Cassandra LoadFunc<br />Summary of Integration<br />
  16. 16.<br />Home grown solution -> Cassandra + Hadoop<br />Query time: hours -> minutes<br />Pig obviated their need for multi-lingual MR<br />Speed and ease are enabling<br />Imagini/Visual DNA<br />The Dachis Group<br />US Government (Digital Reasoning)<br />See<br />Users of Cassandra + Hadoop<br />
  17. 17. Hive support in progress (HIVE-1434)<br />Hadoop Input Streaming (hoping for 0.7.2 - 1497)<br />Pig Storage Func (CASSANDRA-1828)<br />Row predicates (pending CASSANDRA-1600)<br />MapReduce et al over secondary indexes (1600)<br />Performance improvements (though already good)<br />Future<br />
  18. 18. Performant OLTP + powerful OLAP<br />Less need to shuttle data between storage systems<br />Data locality for processing<br />Scales with the cluster<br />Can separate analytics load into virtual DC<br />Conclusion<br />
  19. 19. About Cassandra<br /><br /><br />Search and subscribe to the user mailing list (very active)<br />#Cassandra on freenode (IRC)<br />~150-200+ users from around the world<br />Cassandra: The Definitive Guide<br />About Hadoop Support in Cassandra<br />Check out various <source>/contrib modules: README/code<br /><br />Learn More<br />
  20. 20. About me:<br /><br />@jeromatron on Twitter<br />jeromatron on IRC in #cassandra<br />Questions<br />