Your SlideShare is downloading. ×
Hadoop Integration in Cassandra
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop Integration in Cassandra


Published on

Presented at Cassandra-London Meetup on 21st March, 2011. …

Presented at Cassandra-London Meetup on 21st March, 2011.
Introduction to Hadoop support in Cassandra with a small example and some stats.

Published in: Technology
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Hadoop +Cassandra
  • 2. Cassandra Distributed and decentralized data store Very efficient for fast writes and reads (we ourselves run a website that reads/writes in real time to Cassandra) But what about analytics?
  • 3. Hadoop over Cassandra Useful for - Built-in support for hadoop since 0.6 Can use any language without having to understand the Thrift API Distributed analysis - massively reduces time Possible to use Pig/Hive What is supported - Read from Cassandra since 0.6 Write to Cassandra since 0.7 Support for Hadoop Streaming since 0.7 (only output streaming supported as of now)
  • 4. Cluster Configuration Ideal configuration - Overlay a Hadoop cluster over the Cassandra nodes Separate server for namenode/jobtracker Tasktracker on each Cassandra node At least one node needs to be a data node for house-keeping purposes What this achieves - Data locality Analytics engine scales with data
  • 5. Ideal is not always ideal enough Certain level of tuning always required Tune cassandra.range.batch.size. Usually would want to reduce it. Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf. xml for 0.6+) to avoid time-outs. Use NetworkTopologyStrategy and custom Snitches to separate the analytics as a virtual data-center.
  • 6. Sample cluster topology Real time random access - All-in-one topology Separate analytics nodes
  • 7. Classes that make all this possible ColumnFamilyRecordReader and ColumnFamilyRecordWriter To read/write rows from/to Cassandra ColumnFamilySplit Create splits over the Cassandra data ConfigHelper Helper to configure Cassandra specific information ColumnFamilyInputFormat and ColumnFamilyOutputFormat Inherit Hadoop classes so that Hadoop jobs can interact with data (read/write) AvroOutputReader Stream output to Cassandra
  • 8. Example
  • 9. public class Lookalike extends Configured implements Tool { public static void main(String[] args) throws Exception { Configuration(), new Lookalike(), args); System.exit(0); } @Override public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "Lookalike Report"); job.setJarByClass(Lookalike.class); job.setMapperClass(LookalikeMapper.class); job.setReducerClass(LookalikeReducer.class); TextPair.class1); job.setOutputKeyClass( job.setOutputValueClass(TextPair.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE)); KeyPartitioner.class2); job.setPartitionerClass( job.setGroupingComparatorClass( extPair.GroupingComparator.class2); T job.setInputFormatClass(ColumnFamilyInputFormat.class); ConsetThriftContact(conf, host, 9160); ConfigHelper.setColumnFamily(conf, keyspace, columnFamily); ConfigHelper.setRangeBatchSize(conf, batchSize); List<byte[]> columnNames = Arrays.asList("properties".getBytes(), "personality".getBytes()) SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames); ConfigHelper.setSlicePredicate(conf, predicate); job.waitForCompletion(true); return 0; } 1 - See this for more on TextPair - 2 - See this for more on Secondary Sort -
  • 10. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,TextPair, TextPair>{@Override protected void setup(Context context) { HasMap<String, String> targetUserMap = loadTargetUserMap(); } public void map(String key, SortedMap<byte[], IColumn> columns, Contextcontext) throws IOException, InterruptedException { //Read the properties and personality columns IColumn propertiesColumn = columns.get("properties".getBytes()); if (propertiesColumn == null) return; String propertiesValue = new String(propertiesColumn.value()); //JSONObject IColumn personalityColumn = columns.get("personality".getBytes()); if (personalityColumn == null) return; String personalityValue = new String(personalityColumn.value());//JSONObject for(Map.Entry<String, String> targetUser : targetUserMap.entrySet()) { double score = scoreLookAlike(targetUser.getValue(), personalityValue); if(score>=FILTER_SCORE) { context.write(new TextPair(propertiesValue,score.toString()), new TextPair(targetUserMap.getKey(), score.toString)); } } }}
  • 11. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {{ @Override public void reduce(TextPair key, Iterable<TextPair> values, Context context) throws IOException, InterruptedException { { int counter = 1; for(TextPair value : values) { if(counter >= USER_COUNT) //USER_COUNT = 100 { break; } context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond()); counter++; } }}//Sample Output//TargetUser Lookalike User Score//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739//It is also possible to write this output to Cassandra (we dont do this currently).//It is quite straight forward. See word_count example in Cassandra contrib folder
  • 12. Some stats Cassandra cluster of 16 nodes Hadoop cluster of 5 nodes Over 120 million rows Over 600 GB of data Over 20 Trillion computations Hadoop - Just over 4 hours Serial PHP script - crossed 48 hours and was still chugging along
  • 13. LinksCassandra : The Definitive GuideHadoop MapReduce in Cassandra cluster (DataStax)Cassandra and Hadoop MapReduce (Datastax)Cassandra Wiki - Hadoop SupportCassandra/Hadoop Integration (Jeremy Hanna)Hadoop : The Definitive Guide
  • 14. Questions