Hadoop Integration in Cassandra


Published on

Presented at Cassandra-London Meetup on 21st March, 2011.
Introduction to Hadoop support in Cassandra with a small example and some stats.

Published in: Technology

Hadoop Integration in Cassandra

  1. 1. Hadoop +Cassandra
  2. 2. Cassandra Distributed and decentralized data store Very efficient for fast writes and reads (we ourselves run a website that reads/writes in real time to Cassandra) But what about analytics?
  3. 3. Hadoop over Cassandra Useful for - Built-in support for hadoop since 0.6 Can use any language without having to understand the Thrift API Distributed analysis - massively reduces time Possible to use Pig/Hive What is supported - Read from Cassandra since 0.6 Write to Cassandra since 0.7 Support for Hadoop Streaming since 0.7 (only output streaming supported as of now)
  4. 4. Cluster Configuration Ideal configuration - Overlay a Hadoop cluster over the Cassandra nodes Separate server for namenode/jobtracker Tasktracker on each Cassandra node At least one node needs to be a data node for house-keeping purposes What this achieves - Data locality Analytics engine scales with data
  5. 5. Ideal is not always ideal enough Certain level of tuning always required Tune cassandra.range.batch.size. Usually would want to reduce it. Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf. xml for 0.6+) to avoid time-outs. Use NetworkTopologyStrategy and custom Snitches to separate the analytics as a virtual data-center.
  6. 6. Sample cluster topology Real time random access - All-in-one topology Separate analytics nodes
  7. 7. Classes that make all this possible ColumnFamilyRecordReader and ColumnFamilyRecordWriter To read/write rows from/to Cassandra ColumnFamilySplit Create splits over the Cassandra data ConfigHelper Helper to configure Cassandra specific information ColumnFamilyInputFormat and ColumnFamilyOutputFormat Inherit Hadoop classes so that Hadoop jobs can interact with data (read/write) AvroOutputReader Stream output to Cassandra
  8. 8. Example
  9. 9. public class Lookalike extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new Lookalike(), args); System.exit(0); } @Override public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "Lookalike Report"); job.setJarByClass(Lookalike.class); job.setMapperClass(LookalikeMapper.class); job.setReducerClass(LookalikeReducer.class); TextPair.class1); job.setOutputKeyClass( job.setOutputValueClass(TextPair.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE)); KeyPartitioner.class2); job.setPartitionerClass( job.setGroupingComparatorClass( extPair.GroupingComparator.class2); T job.setInputFormatClass(ColumnFamilyInputFormat.class); ConsetThriftContact(conf, host, 9160); ConfigHelper.setColumnFamily(conf, keyspace, columnFamily); ConfigHelper.setRangeBatchSize(conf, batchSize); List<byte[]> columnNames = Arrays.asList("properties".getBytes(), "personality".getBytes()) SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames); ConfigHelper.setSlicePredicate(conf, predicate); job.waitForCompletion(true); return 0; } 1 - See this for more on TextPair - http://bit.ly/fCtaZA 2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
  10. 10. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,TextPair, TextPair>{@Override protected void setup(Context context) { HasMap<String, String> targetUserMap = loadTargetUserMap(); } public void map(String key, SortedMap<byte[], IColumn> columns, Contextcontext) throws IOException, InterruptedException { //Read the properties and personality columns IColumn propertiesColumn = columns.get("properties".getBytes()); if (propertiesColumn == null) return; String propertiesValue = new String(propertiesColumn.value()); //JSONObject IColumn personalityColumn = columns.get("personality".getBytes()); if (personalityColumn == null) return; String personalityValue = new String(personalityColumn.value());//JSONObject for(Map.Entry<String, String> targetUser : targetUserMap.entrySet()) { double score = scoreLookAlike(targetUser.getValue(), personalityValue); if(score>=FILTER_SCORE) { context.write(new TextPair(propertiesValue,score.toString()), new TextPair(targetUserMap.getKey(), score.toString)); } } }}
  11. 11. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {{ @Override public void reduce(TextPair key, Iterable<TextPair> values, Context context) throws IOException, InterruptedException { { int counter = 1; for(TextPair value : values) { if(counter >= USER_COUNT) //USER_COUNT = 100 { break; } context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond()); counter++; } }}//Sample Output//TargetUser Lookalike User Score//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739//It is also possible to write this output to Cassandra (we dont do this currently).//It is quite straight forward. See word_count example in Cassandra contrib folder
  12. 12. Some stats Cassandra cluster of 16 nodes Hadoop cluster of 5 nodes Over 120 million rows Over 600 GB of data Over 20 Trillion computations Hadoop - Just over 4 hours Serial PHP script - crossed 48 hours and was still chugging along
  13. 13. LinksCassandra : The Definitive GuideHadoop MapReduce in Cassandra cluster (DataStax)Cassandra and Hadoop MapReduce (Datastax)Cassandra Wiki - Hadoop SupportCassandra/Hadoop Integration (Jeremy Hanna)Hadoop : The Definitive Guide
  14. 14. Questions