MongoDB Hadoop Integration

3,739 views

Published on

MongoDB and Hadoop usually are seen as separate worlds. In this presentation we will show how to use them combined and to use MongoDB as a filesystem for Hadoop.

Published in: Technology, Business

MongoDB Hadoop Integration

  1. 1. MongoDB – Hadoop Integration Senior Solutions Architect, MongoDB Inc. Massimo Brignoli #massimobrignoli
  2. 2. We will Cover… • Aquick briefing on what MongoDB and Hadoop are • The Mongo-Hadoop connector: – What it is – How it works – Atour of what it can do
  3. 3. MongoDB and Hadoop Overview
  4. 4. MongoDB • document-oriented database with dynamic schema
  5. 5. MongoDB • document-oriented database with dynamic schema • stores data in JSON-like documents: { _id : “mike”, age : 21, location : { state : ”NY”, zip : ”11222” }, favorite_colors : [“red”, “green”] }
  6. 6. MongoDB • Scales horizontally
  7. 7. MongoDB • Scales horizontally • With sharding to handle lots of data and load
  8. 8. MongoDB • Scales horizontally • With sharding to handle lots of data and load
  9. 9. MongoDB • Scales horizontally • With sharding to handle lots of data and load
  10. 10. Hadoop • Java-based framework for Map/Reduce • Excels at batch processing on large data sets by taking advantage of parallelism
  11. 11. Mongo-Hadoop Connector - Why • Lots of people using Hadoop and Mongo separately, but need integration • Need to process data across multiple sources • Custom code or slow and hacky import/export scripts often used to get data in+out • Scalability and flexibility with changes in • Scalability and flexibility with changes in Hadoop or MongoDB configurations
  12. 12. ResultsInput data Hadoop Cluster -or- -or- .BSON .BSON Mongo-Hadoop Connector • Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop • New Feature:As of v1.1, also works with MongoDB backup files (.bson)
  13. 13. Benefits and Features • Takes advantage of full multi-core parallelism to process data in Mongo • Full integration with Hadoop and JVM ecosystems • Can be used with Amazon Elastic MapReduce • Can read and write backup files from local filesystem, HDFS, or S3
  14. 14. Benefits and Features • Vanilla Java MapReduce • or if you don’t want to use Java, support for Hadoop Streaming. • write MapReduce code in
  15. 15. Benefits and Features • Support for Pig – high-level scripting language for data analysis and building map/reduce workflows • Support for Hive – SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-compatible file systems
  16. 16. How It Works • Adapter examines the MongoDB input collection and calculates a set of splits from the data • Each split gets assigned to a node in Hadoop cluster • In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally • Hadoop merges results and streams output back to MongoDB or BSON
  17. 17. Tour of Mongo-Hadoop, by Example
  18. 18. Tour of Mongo-Hadoop, by Example • Using Java MapReduce with Mongo-Hadoop • Using Hadoop Streaming • Pig and Hive with Mongo-Hadoop • Elastic MapReduce + BSON
  19. 19. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb)
  20. 20. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) sender
  21. 21. { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Input Data: Enron e-mail corpus (501k records, 1.75Gb) sender recipients
  22. 22. The Problem Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair
  23. 23. The Output Required alice14 99 bob 48 charlie 20eve9 {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
  24. 24. Example 1 - Java MapReduce
  25. 25. Map Phase Map phase - each input doc gets passed through a Mapper function @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String)headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } }
  26. 26. Reduce Phase Map phase - each input doc gets passed through a Mapper function @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String)headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } } mongoDB document passed into Hadoop MapReduce
  27. 27. Reduce Phase Reduce phase - outputs of Map are grouped together by key and passed to Reducer public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get();} BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); }
  28. 28. Reduce Phase Reduce phase - outputs of Map are grouped together by key and passed to Reducer public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get();} BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); } the {to, from} key
  29. 29. Reduce Phase Reduce phase - outputs of Map are grouped together by key and passed to Reducer public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get();} BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); } list of all the values collected under the key the {to, from} key
  30. 30. Reduce Phase Reduce phase - outputs of Map are grouped together by key and passed to Reducer public void reduce( final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get();} BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); } output written back to MongoDB list of all the values collected under the key the {to, from} key
  31. 31. Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages
  32. 32. Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir= file:///tmp/messages.bson hdfs:///tmp/messages.bson s3:///tmp/messages.bson
  33. 33. Write Output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out
  34. 34. Write Output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir= file:///tmp/results.bson hdfs:///tmp/results.bson s3:///tmp/results.bson
  35. 35. Write Output to BSON mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : kenneth.lay@enron.com, "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : kenneth.lay@enron.com, "f" : "a..schroeder@enron.com" }, "count" : 1 } ... has more
  36. 36. Example 2 - Hadoop Streaming
  37. 37. Example 2 - Hadoop Streaming • Let’s do the same Enron Map/Reduce job with Python instead of Java $ pip install pymongo_hadoop
  38. 38. Example 2 - Hadoop Streaming • Hadoop passes data to an external process via STDOUT/STDIN hadoop (JVM) STDIN Python / Ruby / JS interpreter def mapper(documents): STDOUT ... map(k, v) map(k, v)
  39. 39. from pymongo_hadoop import BSONMapper def mapper(documents): i=0 for doc in documents: i=i+1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Mapper in Python
  40. 40. from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Reducer in Python
  41. 41. Example 3 - Mongo-Hadoop with Pig and Hive
  42. 42. Mongo-Hadoop and Pig • Let’s do the same thing yet again, but this time using Pig • Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts • Can perform JOIN, GROUP, and execute user- defined functions (UDFs)
  43. 43. Loading/Writing Data Pig directives for loading data: BSONLoader and MongoLoader data = LOAD 'mongodb://localhost:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader; Writing data out BSONStorage and MongoInsertStorage STORE records INTO 'file:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage;
  44. 44. Datatype Conversion • Pig has its own special datatypes: – Bags – Maps – Tuples • Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes
  45. 45. raw = LOAD 'hdfs:///messages.bson’ using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson’ using com.mongodb.hadoop.pig.BSONStorage; Code in Pig
  46. 46. Mongo-Hadoop and Hive Similar idea to Pig - process your data without needing to write Map/Reduce code from scratch ...but with SQL as the language of choice
  47. 47. db.users.find() { "_id": 1, "name": "Tom", "age": 28 } { "_id": 2, "name": "Alice", "age": 18 } { "_id": 3, "name": "Bob", "age": 29 } { "_id": 101, "name": "Scott", "age": 10 } { "_id": 104, "name": "Jesse", "age": 52 } { "_id": 110, "name": "Mike", "age": 32 } Sample Data: db.users
  48. 48. CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users"); Declare Collection in Hive
  49. 49. SELECT name,age FROM mongo_users WHERE id > 100 ; SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id; Run SQL on it
  50. 50. DROP TABLE old_users; INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Write The Results…
  51. 51. Example 4: Amazon MapReduce
  52. 52. Example 4 • Usage with Amazon Elastic MapReduce • Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster.
  53. 53. Bootstrap • First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers) #!/bin/sh wget -P /home/hadoop/lib http://central.maven.org/maven2/org/mongodb/mongo-java- driver/2.11.1/mongo-java-driver-2.11.1.jar wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo- hadoop-code/mongo-hadoop-core_1.1.2-1.1.0.jar
  54. 54. Bootstrap Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it. s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh s3mod s3://$S3_BUCKET/bootstrap.sh public-read s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/enron-example.jar s3mod s3://$S3_BUCKET/enron-example.jar public-read
  55. 55. Launch the Job! • ...then launch the job from the command line, pointing to your S3 locations $ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg –D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here)
  56. 56. So why Amazon? • Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster • Turn up the “num-instances” to make jobs complete faster • Logs get captured into S3 files • (Pig, Hive, and streaming work on EMR, too!)
  57. 57. Example 5: MongoUpdateWritable
  58. 58. Example 5: New Feature • In previous examples, we wrote job output data by inserting into a new collection • ... but we can also modify an existing output collection • Works by applying mongoDB update modifiers: $push, $pull, $addToSet, $inc, $set, etc. • Can be used to do incremental Map/Reduce or “join” two collections
  59. 59. Sample of Data Let’s say we have two collections. { "_id": ObjectId("51b792d381c3e67b0a18d0ed"), "name": "730LsRkX", "type": "pressure", "owner": "steve” } { "_id": ObjectId("51b792d381c3e67b0a18d678"), "sensor_id": ObjectId("51b792d381c3e67b0a18d4a1"), "value": 3328.5895416489802, "timestamp”: ISODate("2013-05-18T13:11:38.709-0400"), "loc": [-175.13,51.658] } sensors log events
  60. 60. Sample of Data For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Bob’s sensors for temperature have stored 1300 readings Bob’s sensors for pressure have stored 400 readings Alice’s sensors for humidity have stored 600 readings Alice’s sensors for temperature have stored 700 readings etc...
  61. 61. Stage 1 - Map/Reduce on sensors collection Sensors (MongoDB Collection) Log events (MongoDB Collection) map/reduce for each sensor, emit: {key: owner+type, value: _id} group data from map() under each key, output: {key: owner+type, val: [ list of _ids] } Results (MongoDB Collection)
  62. 62. Stage 1 - Results After stage one, the output docs look like: { "_id": "alice pressure", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), … ] }
  63. 63. Stage 1 - Results After stage one, the output docs look like: { "_id": "alice pressure", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf” … ] } the sensor’s owner and type list of ID’s of sensors with this owner and type Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group.
  64. 64. Stage 2 - Map/Reduce on events collection Sensors (MongoDB Collection) Log events (MongoDB Collection) map/reduce for each sensor, emit: {key: sensor_id, value: 1} group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) Results (MongoDB Collection) update() existing records in mongoDB context.write(null, new MongoUpdateWritable( query, //which documents to modify update, //how to modify ($inc) true, //upsert false) ); // multi
  65. 65. Stage 2 - Results Result after stage 2 { "_id": "1UoTcvnCTz temp", "sensors": [ ObjectId("51b792d381c3e67b0a18d475"), ObjectId("51b792d381c3e67b0a18d16d"), ObjectId("51b792d381c3e67b0a18d2bf"), ... ], "logs_count": 1050616 } now populated with correct count
  66. 66. Conclusions
  67. 67. Recap • Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON • MongoDB becomes a Hadoop-enabled filesystem • Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc.
  68. 68. Examples Examples can be found on github: https://github.com/mongodb/mongo-hadoop/tree/master/examples

×