Hadoop Integration in Cassandra
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Hadoop Integration in Cassandra



Presented at Cassandra-London Meetup on 21st March, 2011.

Presented at Cassandra-London Meetup on 21st March, 2011.
Introduction to Hadoop support in Cassandra with a small example and some stats.



Total Views
Views on SlideShare
Embed Views



6 Embeds 881

http://blog.jairam.me 836
http://jairam.me 35
http://www.linkedin.com 7
https://si0.twimg.com 1
http://webcache.googleusercontent.com 1
https://www.linkedin.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop Integration in Cassandra Presentation Transcript

  • 1. Hadoop +Cassandra
  • 2. Cassandra Distributed and decentralized data store Very efficient for fast writes and reads (we ourselves run a website that reads/writes in real time to Cassandra) But what about analytics?
  • 3. Hadoop over Cassandra Useful for - Built-in support for hadoop since 0.6 Can use any language without having to understand the Thrift API Distributed analysis - massively reduces time Possible to use Pig/Hive What is supported - Read from Cassandra since 0.6 Write to Cassandra since 0.7 Support for Hadoop Streaming since 0.7 (only output streaming supported as of now)
  • 4. Cluster Configuration Ideal configuration - Overlay a Hadoop cluster over the Cassandra nodes Separate server for namenode/jobtracker Tasktracker on each Cassandra node At least one node needs to be a data node for house-keeping purposes What this achieves - Data locality Analytics engine scales with data
  • 5. Ideal is not always ideal enough Certain level of tuning always required Tune cassandra.range.batch.size. Usually would want to reduce it. Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf. xml for 0.6+) to avoid time-outs. Use NetworkTopologyStrategy and custom Snitches to separate the analytics as a virtual data-center.
  • 6. Sample cluster topology Real time random access - All-in-one topology Separate analytics nodes
  • 7. Classes that make all this possible ColumnFamilyRecordReader and ColumnFamilyRecordWriter To read/write rows from/to Cassandra ColumnFamilySplit Create splits over the Cassandra data ConfigHelper Helper to configure Cassandra specific information ColumnFamilyInputFormat and ColumnFamilyOutputFormat Inherit Hadoop classes so that Hadoop jobs can interact with data (read/write) AvroOutputReader Stream output to Cassandra
  • 8. Example
  • 9. public class Lookalike extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new Lookalike(), args); System.exit(0); } @Override public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "Lookalike Report"); job.setJarByClass(Lookalike.class); job.setMapperClass(LookalikeMapper.class); job.setReducerClass(LookalikeReducer.class); TextPair.class1); job.setOutputKeyClass( job.setOutputValueClass(TextPair.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE)); KeyPartitioner.class2); job.setPartitionerClass( job.setGroupingComparatorClass( extPair.GroupingComparator.class2); T job.setInputFormatClass(ColumnFamilyInputFormat.class); ConsetThriftContact(conf, host, 9160); ConfigHelper.setColumnFamily(conf, keyspace, columnFamily); ConfigHelper.setRangeBatchSize(conf, batchSize); List<byte[]> columnNames = Arrays.asList("properties".getBytes(), "personality".getBytes()) SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames); ConfigHelper.setSlicePredicate(conf, predicate); job.waitForCompletion(true); return 0; } 1 - See this for more on TextPair - http://bit.ly/fCtaZA 2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
  • 10. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,TextPair, TextPair>{@Override protected void setup(Context context) { HasMap<String, String> targetUserMap = loadTargetUserMap(); } public void map(String key, SortedMap<byte[], IColumn> columns, Contextcontext) throws IOException, InterruptedException { //Read the properties and personality columns IColumn propertiesColumn = columns.get("properties".getBytes()); if (propertiesColumn == null) return; String propertiesValue = new String(propertiesColumn.value()); //JSONObject IColumn personalityColumn = columns.get("personality".getBytes()); if (personalityColumn == null) return; String personalityValue = new String(personalityColumn.value());//JSONObject for(Map.Entry<String, String> targetUser : targetUserMap.entrySet()) { double score = scoreLookAlike(targetUser.getValue(), personalityValue); if(score>=FILTER_SCORE) { context.write(new TextPair(propertiesValue,score.toString()), new TextPair(targetUserMap.getKey(), score.toString)); } } }}
  • 11. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {{ @Override public void reduce(TextPair key, Iterable<TextPair> values, Context context) throws IOException, InterruptedException { { int counter = 1; for(TextPair value : values) { if(counter >= USER_COUNT) //USER_COUNT = 100 { break; } context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond()); counter++; } }}//Sample Output//TargetUser Lookalike User Score//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739//It is also possible to write this output to Cassandra (we dont do this currently).//It is quite straight forward. See word_count example in Cassandra contrib folder
  • 12. Some stats Cassandra cluster of 16 nodes Hadoop cluster of 5 nodes Over 120 million rows Over 600 GB of data Over 20 Trillion computations Hadoop - Just over 4 hours Serial PHP script - crossed 48 hours and was still chugging along
  • 13. LinksCassandra : The Definitive GuideHadoop MapReduce in Cassandra cluster (DataStax)Cassandra and Hadoop MapReduce (Datastax)Cassandra Wiki - Hadoop SupportCassandra/Hadoop Integration (Jeremy Hanna)Hadoop : The Definitive Guide
  • 14. Questions