Cassandra/Hadoop Integration

Cassandra/Hadoop Integration OLTP + OLAP = Cassandra

BigTable + Dynamo Semi-structured data model Decentralized – no special roles, no SPOF Horizontally scalable Ridiculously fast writes, fast reads Tunably consistent Cross-DC capable Cassandra (basic overview)

Design your data model based on your query model Real-time ad-hoc queries aren’t viable Secondary indexes help What about analytics? Querying with Cassandra

Hadoopbrings analytics MapReduce Pig/Hive and other tools built above MapReduce Configurable data sources/destinations Many already familiar with it Active community Enter Hadoop

Basic Recipe Overlay Hadoop on top of Cassandra Separate server for name node and job tracker Co-locate task trackers with Cassandra nodes Data nodes for distributed cache Voilà Data locality Analytics engine scales with data Cluster Configuration

Always tune Cassandra to taste For Hadoop workloads you might Have a separate analytics virtual datacenter Using the NetworkTopologyStrategy Tune the rpc_timeout_in_ms in cassandra.yaml (higher) Tune the cassandra.range.batch.size See org.apache.cassandra.hadoop.ConfigHelper Cluster Tuning

All-in-one Configuration JobTracker and NameNode Each node has Cassandra, a TaskTracker, and a DataNode (for distributed cache)

Separate Analytics Configuration Separated nodes for analytics Nodes for real-time random access A single Cassandra cluster with different virtual data centers

Cassandra specific InputFormat ColumnFamilyInputFormat Configuration – ConfigHelper, Hadoop variables InputSplits over the data – tunable Example usage in contrib/word_count MapReduce - InputFormat

OutputFormat ColumnFamilyOutputFormat Configuration – ConfigHelper, Hadoopvariables Batches output – tunable Don’t have to use Cassandra api Some optimizations (e.g. ConsistencyLevel.ONE) Uses Avro for output serialization (enables streaming) Example usage in contrib/word_count MapReduce - OutputFormat

Visualizing Take vertical slices of columns Over the whole column family

What about languages outside of Java? Build on what Hadoop uses - Streaming Output streaming as of0.7.0 Example in contrib/hadoop_streaming_output Input streaming in progress, hoping for 0.7.2 Hadoop Streaming

Developed at Yahoo! PigLatin/Grunt shell Powerful scripting language for analytics Configuration – Hadoop/Envvariables Uses pig 0.7+ Example usage in contrib/pig Pig

LOAD 'cassandra://Keyspace1/Standard1' USING CassandraStorage() br /> as (key:chararray, cols:bag{col:tuple(name:bytearray, value:bytearray)}); cols = FOREACH rows GENERATE flatten(cols) as (name, value); words = FOREACH cols GENERATE flatten(TOKENIZE((chararray) value)) as word; grouped = GROUP words BY word; counts = FOREACH grouped GENERATE group, COUNT(words) as count; ordered = ORDER counts BY count DESC; topten = LIMIT ordered 10; dump topten;

ColumnFamilyInputFormat ColumnFamilyOutputFormat Hadoop Streaming Output Pig support – Cassandra LoadFunc Summary of Integration

Raptr.com Home grown solution -> Cassandra + Hadoop Query time: hours -> minutes Pig obviated their need for multi-lingual MR Speed and ease are enabling Imagini/Visual DNA The Dachis Group US Government (Digital Reasoning) See http://github.com/digitalreasoning/PyStratus Users of Cassandra + Hadoop

Hive support in progress (HIVE-1434) Hadoop Input Streaming (hoping for 0.7.2 - 1497) Pig Storage Func (CASSANDRA-1828) Row predicates (pending CASSANDRA-1600) MapReduce et al over secondary indexes (1600) Performance improvements (though already good) Future

Performant OLTP + powerful OLAP Less need to shuttle data between storage systems Data locality for processing Scales with the cluster Can separate analytics load into virtual DC Conclusion

About Cassandra http://www.datastax.com/docs http://wiki.apache.org/cassandra Search and subscribe to the user mailing list (very active) #Cassandra on freenode (IRC) ~150-200+ users from around the world Cassandra: The Definitive Guide About Hadoop Support in Cassandra Check out various <source>/contrib modules: README/code http://wiki.apache.org/cassandra/HadoopSupport Learn More

About me: jeremy.hanna@dachisgroup.com @jeromatron on Twitter jeromatron on IRC in #cassandra Questions

Cassandra/Hadoop Integration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Cassandra/Hadoop Integration

Similar to Cassandra/Hadoop Integration (20)

More from Jeremy Hanna

More from Jeremy Hanna (11)

Recently uploaded

Recently uploaded (20)

Cassandra/Hadoop Integration

Editor's Notes