Hadoop Integration in Cassandra
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Hadoop Integration in Cassandra

  • 9,147 views
Uploaded on

Presented at Cassandra-London Meetup on 21st March, 2011. ...

Presented at Cassandra-London Meetup on 21st March, 2011.
Introduction to Hadoop support in Cassandra with a small example and some stats.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
9,147
On Slideshare
8,266
From Embeds
881
Number of Embeds
6

Actions

Shares
Downloads
141
Comments
0
Likes
8

Embeds 881

http://blog.jairam.me 836
http://jairam.me 35
http://www.linkedin.com 7
https://si0.twimg.com 1
http://webcache.googleusercontent.com 1
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop +Cassandra
  • 2. Cassandra Distributed and decentralized data store Very efficient for fast writes and reads (we ourselves run a website that reads/writes in real time to Cassandra) But what about analytics?
  • 3. Hadoop over Cassandra Useful for - Built-in support for hadoop since 0.6 Can use any language without having to understand the Thrift API Distributed analysis - massively reduces time Possible to use Pig/Hive What is supported - Read from Cassandra since 0.6 Write to Cassandra since 0.7 Support for Hadoop Streaming since 0.7 (only output streaming supported as of now)
  • 4. Cluster Configuration Ideal configuration - Overlay a Hadoop cluster over the Cassandra nodes Separate server for namenode/jobtracker Tasktracker on each Cassandra node At least one node needs to be a data node for house-keeping purposes What this achieves - Data locality Analytics engine scales with data
  • 5. Ideal is not always ideal enough Certain level of tuning always required Tune cassandra.range.batch.size. Usually would want to reduce it. Tune rpc_timeout_in_ms in cassandra.yaml (or storage-conf. xml for 0.6+) to avoid time-outs. Use NetworkTopologyStrategy and custom Snitches to separate the analytics as a virtual data-center.
  • 6. Sample cluster topology Real time random access - All-in-one topology Separate analytics nodes
  • 7. Classes that make all this possible ColumnFamilyRecordReader and ColumnFamilyRecordWriter To read/write rows from/to Cassandra ColumnFamilySplit Create splits over the Cassandra data ConfigHelper Helper to configure Cassandra specific information ColumnFamilyInputFormat and ColumnFamilyOutputFormat Inherit Hadoop classes so that Hadoop jobs can interact with data (read/write) AvroOutputReader Stream output to Cassandra
  • 8. Example
  • 9. public class Lookalike extends Configured implements Tool { public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new Lookalike(), args); System.exit(0); } @Override public int run(String[] arg0) throws Exception { Job job = new Job(getConf(), "Lookalike Report"); job.setJarByClass(Lookalike.class); job.setMapperClass(LookalikeMapper.class); job.setReducerClass(LookalikeReducer.class); TextPair.class1); job.setOutputKeyClass( job.setOutputValueClass(TextPair.class); FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH_LOOKALIKE)); KeyPartitioner.class2); job.setPartitionerClass( job.setGroupingComparatorClass( extPair.GroupingComparator.class2); T job.setInputFormatClass(ColumnFamilyInputFormat.class); ConsetThriftContact(conf, host, 9160); ConfigHelper.setColumnFamily(conf, keyspace, columnFamily); ConfigHelper.setRangeBatchSize(conf, batchSize); List<byte[]> columnNames = Arrays.asList("properties".getBytes(), "personality".getBytes()) SlicePredicate predicate = new SlicePredicate().setColumn_names(columnNames); ConfigHelper.setSlicePredicate(conf, predicate); job.waitForCompletion(true); return 0; } 1 - See this for more on TextPair - http://bit.ly/fCtaZA 2 - See this for more on Secondary Sort - http://bit.ly/eNWbN8
  • 10. public static class LookalikeMapper extends Mapper<String, SortedMap<byte[], IColumn>,TextPair, TextPair>{@Override protected void setup(Context context) { HasMap<String, String> targetUserMap = loadTargetUserMap(); } public void map(String key, SortedMap<byte[], IColumn> columns, Contextcontext) throws IOException, InterruptedException { //Read the properties and personality columns IColumn propertiesColumn = columns.get("properties".getBytes()); if (propertiesColumn == null) return; String propertiesValue = new String(propertiesColumn.value()); //JSONObject IColumn personalityColumn = columns.get("personality".getBytes()); if (personalityColumn == null) return; String personalityValue = new String(personalityColumn.value());//JSONObject for(Map.Entry<String, String> targetUser : targetUserMap.entrySet()) { double score = scoreLookAlike(targetUser.getValue(), personalityValue); if(score>=FILTER_SCORE) { context.write(new TextPair(propertiesValue,score.toString()), new TextPair(targetUserMap.getKey(), score.toString)); } } }}
  • 11. public class LookalikeReducer extends Reducer<TextPair, TextPair, Text, Text> {{ @Override public void reduce(TextPair key, Iterable<TextPair> values, Context context) throws IOException, InterruptedException { { int counter = 1; for(TextPair value : values) { if(counter >= USER_COUNT) //USER_COUNT = 100 { break; } context.write(key.getFirst(), value.getFirst() + "t" + value.getSecond()); counter++; } }}//Sample Output//TargetUser Lookalike User Score//7f55fdd8-76dc-102e-b2e6-001ec9d506ae de6fbeac-7205-ff9c-d74d-2ec57841fd0b 0.2602739//It is also possible to write this output to Cassandra (we dont do this currently).//It is quite straight forward. See word_count example in Cassandra contrib folder
  • 12. Some stats Cassandra cluster of 16 nodes Hadoop cluster of 5 nodes Over 120 million rows Over 600 GB of data Over 20 Trillion computations Hadoop - Just over 4 hours Serial PHP script - crossed 48 hours and was still chugging along
  • 13. LinksCassandra : The Definitive GuideHadoop MapReduce in Cassandra cluster (DataStax)Cassandra and Hadoop MapReduce (Datastax)Cassandra Wiki - Hadoop SupportCassandra/Hadoop Integration (Jeremy Hanna)Hadoop : The Definitive Guide
  • 14. Questions