Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Mongo-Hadoop Integration
Mike O’Brien, Software Engineer @ 10gen

Thursday, August 8, 13
We will cover:
A quick briefing on what Mongo
and Hadoop are all about

The Mongo-Hadoop connector:
•what it is
•how it wor...
Choosing the Right Tool for the Task
Upcoming Webinar:
MongoDB and Hadoop - Essential Tools for
Your Big Data Playbook
Aug...
Thursday, August 8, 13
document-oriented database with
dynamic schema

Thursday, August 8, 13
document-oriented database with
dynamic schema
stores data in JSON-like documents:
{

}

Thursday, August 8, 13

_id : “mi...
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
mongoDB scales horizontally
with sharding to handle lots of
data and load
app

Thursday, August 8, 13
Java-based framework for Map/Reduce
Excels at batch processing on large data sets
by taking advantage of parallelism

Thur...
Mongo-Hadoop Connector - Why
Lots of people using Hadoop and Mongo
separately, but need integration
Need to process data a...
Mongo-Hadoop Connector
Turn MongoDB into a Hadoop-enabled filesystem:
use as the input or output for Hadoop

New Feature: A...
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo

Thursda...
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full int...
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full int...
Mongo-Hadoop Connector
Benefits + Features
Takes advantage of full multi-core
parallelism to process data in Mongo
Full int...
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Str...
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Str...
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Str...
Mongo-Hadoop Connector
Benefits + Features

Vanilla Java MapReduce
or if you don’t want to use Java,
support for Hadoop Str...
Mongo-Hadoop Connector
Benefits + Features

Thursday, August 8, 13
Mongo-Hadoop Connector
Benefits + Features
Support for Pig
high-level scripting language for data analysis and
building map...
Mongo-Hadoop Connector
Benefits + Features
Support for Pig
high-level scripting language for data analysis and
building map...
Mongo-Hadoop Connector
How it works:

Thursday, August 8, 13
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the...
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the...
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the...
Mongo-Hadoop Connector
How it works:
Adapter examines the MongoDB input collection and
calculates a set of splits from the...
Tour of Mongo-Hadoop, by Example

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming

Thursday, August 8, 13
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mon...
Tour of Mongo-Hadoop, by Example
- Using Java MapReduce with Mongo-Hadoop
- Using Hadoop Streaming
- Pig and Hive with Mon...
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a0...
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a0...
Input Data: Enron e-mail corpus (501k records, 1.75Gb)

each document is one email
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a0...
Thursday, August 8, 13
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair

Thursda...
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
14

alic...
Let’s use Hadoop to build a graph of
(senders → recipients) and the count of
messages exchanged between each pair
14

alic...
Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function

@Override
public	
  void	
  m...
Example 1 - Java MapReduce
Map phase - each input doc gets
passed through a Mapper function
mongoDB document passed into
H...
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer

	
  	
 ...
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to,...
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to,...
Example 1 - Java MapReduce (cont)
Reduce phase - outputs of Map are grouped
together by key and passed to Reducer
the {to,...
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input...
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input...
Example 1 - Java MapReduce (cont)
Read from MongoDB
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input...
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mon...
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mon...
Example 1 - Java MapReduce (cont)
Write output to MongoDB
mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mon...
Results : Output Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t" : "kenneth.lay@enron.com...
Example 2 - Hadoop Streaming

Let’s do the same Enron Map/Reduce job
with Python instead of Java

$ pip install pymongo_ha...
Example 2 - Hadoop Streaming (cont)
Hadoop passes data to an external
process via STDOUT/STDIN
hadoop (JVM)
STDIN

map(k, ...
Example 2 - Hadoop Streaming (cont)
from pymongo_hadoop import BSONMapper
def mapper(documents):
i = 0
for doc in document...
Example 2 - Hadoop Streaming (cont)

from pymongo_hadoop import BSONReducer
def reducer(key, values):
print >> sys.stderr,...
Surviving Hadoop:
making MapReduce easier

with Pig + Hive
Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig

Thursday, August 8, 13
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language tha...
Example 3 - Mongo-Hadoop and Pig
Let’s do the same thing yet again,
but this time using Pig
Pig is a powerful language tha...
Example 3 - Mongo-Hadoop and Pig (cont)
Pig directives for loading data:
BSONLoader and MongoLoader
data = LOAD 'mongodb:/...
Example 3 - Mongo-Hadoop and Pig (cont)

Pig has its own special datatypes:
Bags, Maps, and Tuples
Mongo-Hadoop Connector ...
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','hea...
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','hea...
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','hea...
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','hea...
Example 3 - Mongo-Hadoop and Pig (cont)
raw = LOAD 'hdfs:///messages.bson'
using com.mongodb.hadoop.pig.BSONLoader('','hea...
Hive with Mongo-Hadoop

Similar idea to Pig - process your data
without needing to write Map/Reduce
code from scratch

......
Hive with Mongo-Hadoop
db.users.find()
{ "_id": 1, "name": "Tom", "age": 28 }

Sample Data:
db.users

{ "_id": 2, "name": ...
Hive with Mongo-Hadoop

Thursday, August 8, 13
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

Thu...
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

you...
Hive with Mongo-Hadoop
...then you can run SQL on it, like a table.
SELECT name,age FROM mongo_users WHERE id > 100 ;

you...
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHE...
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHE...
Write the output of queries back into new tables:
INSERT OVERWRITE TABLE old_users SELECT id,name,age
FROM mongo_users WHE...
Usage with Amazon Elastic MapReduce

Run mongo-hadoop jobs without
needing to set up or manage your
own Hadoop cluster.

T...
Usage with Amazon Elastic MapReduce
First, make a “bootstrap” script that
fetches dependencies (mongo-hadoop
jar and java ...
Example 4 - Usage with Amazon Elastic MapReduce
Put the bootstrap script, and all your code,
into an S3 bucket where Amazo...
Example 4 - Usage with Amazon Elastic MapReduce
...then launch the job from the command
line, pointing to your S3 location...
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
...
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
...
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
...
Example 4 - Usage with Amazon Elastic MapReduce
Easy to kick off a Hadoop job, without needing
to manage a Hadoop cluster
...
Example 5 - new feature: MongoUpdateWritable
In previous examples, we wrote job output data
by inserting into a new collec...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

Thursday, August 8, 13

{
	
  	
  "_id":	
  Ob...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

{
	
  	
  "_id":	
  ObjectId("51b792d381c3e67b...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  ...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  ...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

Thursday, August 8, 13

{
	
  	
  ...
Example 5 - MongoUpdateWritable
Let’s say we have two collections.
sensors

log
events

{
	
  	
  "_id":	
  ObjectId("51b7...
Thursday, August 8, 13
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.

Thursday, Augu...
For each owner, we want to calculate how many events
were recorded for each type of sensor that logged it.
Plain english:
...
Stage 1 -Map/Reduce
on sensors collection
sensors
(mongoDB collection)

log events

read from
mongoDB

map/reduce
for each...
After stage one, the output
docs look like:

Thursday, August 8, 13
After stage one, the output
docs look like:
the sensor’s
owner and type

Thursday, August 8, 13
After stage one, the output
docs look like:
the sensor’s
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "...
After stage one, the output
docs look like:
the sensor’s
owner and type
{
	
  	
  "_id":	
  "alice	
  pressure",
	
  	
  "...
Stage 2 -Map/Reduce on
log events collection
sensors

for each sensor, emit:
{key: sensor_id, value: 1}

(mongoDB collecti...
Stage 2 -Map/Reduce on
log events collection
sensors

for each sensor, emit:
{key: sensor_id, value: 1}

(mongoDB collecti...
Example - MongoUpdateWritable

Result after stage 2
{
	
  	
  "_id":	
  "1UoTcvnCTz	
  temp",
	
  	
  "sensors":	
  [
	
  ...
Upcoming Features (v1.2 and beyond)
Performance Improvements - Lazy BSON
Full-featured Hive support
Support for multi-coll...
Recap
Mongo-Hadoop - use Hadoop to do massive computations
on big data sets stored in Mongo/BSON

MongoDB becomes a Hadoop...
Questions?

Examples can be found on github:
https://github.com/mongodb/mongo-hadoop/tree/
master/examples

Thursday, Augu...
Upcoming SlideShare
Loading in …5
×

Hadoop webinar-130808141030-phpapp01

460 views

Published on

Published in: Technology, Business
  • Be the first to comment

Hadoop webinar-130808141030-phpapp01

  1. 1. Mongo-Hadoop Integration Mike O’Brien, Software Engineer @ 10gen Thursday, August 8, 13
  2. 2. We will cover: A quick briefing on what Mongo and Hadoop are all about The Mongo-Hadoop connector: •what it is •how it works •a tour of what it can do (Q+A at the end) Thursday, August 8, 13
  3. 3. Choosing the Right Tool for the Task Upcoming Webinar: MongoDB and Hadoop - Essential Tools for Your Big Data Playbook August 21st, 2013 10am PDT, 1pm EDT, 6pm BST Register at 10gen.com/events/biz-hadoop Thursday, August 8, 13
  4. 4. Thursday, August 8, 13
  5. 5. document-oriented database with dynamic schema Thursday, August 8, 13
  6. 6. document-oriented database with dynamic schema stores data in JSON-like documents: { } Thursday, August 8, 13 _id : “mike”, age : 21, location : { state : ”NY”, zip : ”11222” }, favorite_colors : [“red”, “green”]
  7. 7. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  8. 8. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  9. 9. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  10. 10. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  11. 11. mongoDB scales horizontally with sharding to handle lots of data and load app Thursday, August 8, 13
  12. 12. Java-based framework for Map/Reduce Excels at batch processing on large data sets by taking advantage of parallelism Thursday, August 8, 13
  13. 13. Mongo-Hadoop Connector - Why Lots of people using Hadoop and Mongo separately, but need integration Need to process data across multiple sources Custom code or slow and hacky import/ export scripts often used to get data in+out Scalability and flexibility with changes in Hadoop or MongoDB configurations Thursday, August 8, 13
  14. 14. Mongo-Hadoop Connector Turn MongoDB into a Hadoop-enabled filesystem: use as the input or output for Hadoop New Feature: As of v1.1, also works with MongoDB backup files (.bson) input data Hadoop Cluster output results -or- -or- .BSON .BSON Thursday, August 8, 13
  15. 15. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  16. 16. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Thursday, August 8, 13
  17. 17. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Thursday, August 8, 13
  18. 18. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Thursday, August 8, 13
  19. 19. Mongo-Hadoop Connector Benefits + Features Takes advantage of full multi-core parallelism to process data in Mongo Full integration with Hadoop and JVM ecosystems Can be used with Amazon Elastic Mapreduce Can read and write backup files from local filesystem, HDFS, or S3 Thursday, August 8, 13
  20. 20. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  21. 21. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce Thursday, August 8, 13
  22. 22. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. Thursday, August 8, 13
  23. 23. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13
  24. 24. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13
  25. 25. Mongo-Hadoop Connector Benefits + Features Vanilla Java MapReduce or if you don’t want to use Java, support for Hadoop Streaming. write MapReduce code in ruby Thursday, August 8, 13 python
  26. 26. Mongo-Hadoop Connector Benefits + Features Thursday, August 8, 13
  27. 27. Mongo-Hadoop Connector Benefits + Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Thursday, August 8, 13
  28. 28. Mongo-Hadoop Connector Benefits + Features Support for Pig high-level scripting language for data analysis and building map/reduce workflows Support for Hive SQL-like language for ad-hoc queries + analysis of data sets on Hadoop-compatible file systems Thursday, August 8, 13
  29. 29. Mongo-Hadoop Connector How it works: Thursday, August 8, 13
  30. 30. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Thursday, August 8, 13
  31. 31. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster Thursday, August 8, 13
  32. 32. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Thursday, August 8, 13
  33. 33. Mongo-Hadoop Connector How it works: Adapter examines the MongoDB input collection and calculates a set of splits from the data Each split gets assigned to a node in Hadoop cluster In parallel, Hadoop nodes pull data for splits from MongoDB (or BSON) and process them locally Hadoop merges results and streams output back to MongoDB or BSON Thursday, August 8, 13
  34. 34. Tour of Mongo-Hadoop, by Example Thursday, August 8, 13
  35. 35. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop Thursday, August 8, 13
  36. 36. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming Thursday, August 8, 13
  37. 37. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop Thursday, August 8, 13
  38. 38. Tour of Mongo-Hadoop, by Example - Using Java MapReduce with Mongo-Hadoop - Using Hadoop Streaming - Pig and Hive with Mongo-Hadoop - Elastic MapReduce + BSON Thursday, August 8, 13
  39. 39. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Thursday, August 8, 13
  40. 40. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } sender } Thursday, August 8, 13
  41. 41. Input Data: Enron e-mail corpus (501k records, 1.75Gb) each document is one email { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } sender recipients } Thursday, August 8, 13
  42. 42. Thursday, August 8, 13
  43. 43. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair Thursday, August 8, 13
  44. 44. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair 14 alice bob 48 9 Thursday, August 8, 13 99 eve charlie 20
  45. 45. Let’s use Hadoop to build a graph of (senders → recipients) and the count of messages exchanged between each pair 14 alice bob 99 48 9 eve charlie 20 {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20} Thursday, August 8, 13
  46. 46. Example 1 - Java MapReduce Map phase - each input doc gets passed through a Mapper function @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  47. 47. Example 1 - Java MapReduce Map phase - each input doc gets passed through a Mapper function mongoDB document passed into Hadoop MapReduce @Override public  void  map(NullWritable  key,  BSONObject  val,  final  Context  context){        BSONObject  headers  =  (BSONObject)val.get("headers");        if(headers.containsKey("From")  &&  headers.containsKey("To")){                String  from  =  (String)headers.get("From");                String  to  =  (String)headers.get("To");                String[]  recips  =  to.split(",");                for(int  i=0;i<recips.length;i++){                        String  recip  =  recips[i].trim();                        context.write(new  MailPair(from,  recip),  new  IntWritable(1));                }        } } Thursday, August 8, 13
  48. 48. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  49. 49. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13
  50. 50. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0;                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } Thursday, August 8, 13 list of all the values collected under the key
  51. 51. Example 1 - Java MapReduce (cont) Reduce phase - outputs of Map are grouped together by key and passed to Reducer the {to, from} key        public  void  reduce(  final  MailPair  pKey,                                                final  Iterable<IntWritable>  pValues,                                                final  Context  pContext  ){                int  sum  =  0; list of all the values collected under the key                for  (  final  IntWritable  value  :  pValues  ){                        sum  +=  value.get();                }                BSONObject  outDoc  =  new  BasicDBObjectBuilder().start()                                                        .add(  "f"  ,  pKey.from) .add(  "t"  ,  pKey.to  ) .get();                BSONWritable  pkeyOut  =  new  BSONWritable(outDoc);                pContext.write(  pkeyOut,  new  IntWritable(sum)  );        } output written back to MongoDB Thursday, August 8, 13
  52. 52. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Thursday, August 8, 13
  53. 53. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson Thursday, August 8, 13
  54. 54. Example 1 - Java MapReduce (cont) Read from MongoDB mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages Read from BSON mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir=file:///tmp/messages.bson hdfs:///tmp/messages.bson s3:///tmp/messages.bson Thursday, August 8, 13
  55. 55. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Thursday, August 8, 13
  56. 56. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson Thursday, August 8, 13
  57. 57. Example 1 - Java MapReduce (cont) Write output to MongoDB mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out Write output to BSON mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir=file:///tmp/results.bson hdfs:///tmp/results.bson s3:///tmp/results.bson Thursday, August 8, 13
  58. 58. Results : Output Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "15126-1267@m2.innovyx.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "2586207@www4.imakenews.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "40enron@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..davis@enron.com" }, "count" : 2 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..hughes@enron.com" }, "count" : 4 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..lindholm@enron.com" }, "count" : 1 } { "_id" : { "t" : "kenneth.lay@enron.com", "f" : "a..schroeder@enron.com" }, "count" : 1 } ... has more Thursday, August 8, 13
  59. 59. Example 2 - Hadoop Streaming Let’s do the same Enron Map/Reduce job with Python instead of Java $ pip install pymongo_hadoop Thursday, August 8, 13
  60. 60. Example 2 - Hadoop Streaming (cont) Hadoop passes data to an external process via STDOUT/STDIN hadoop (JVM) STDIN map(k, v) map(k, v) map() map(k, v) JVM Thursday, August 8, 13 STDOUT Python / Ruby / JS interpreter def mapper(documents): . . .
  61. 61. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONMapper def mapper(documents): i = 0 for doc in documents: i = i + 1 from_field = doc['headers']['From'] to_field = doc['headers']['To'] recips = [x.strip() for x in to_field.split(',')] for r in recips: yield {'_id': {'f':from_field, 't':r}, 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Thursday, August 8, 13
  62. 62. Example 2 - Hadoop Streaming (cont) from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing from/to %s" % str(key) _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer) Thursday, August 8, 13
  63. 63. Surviving Hadoop: making MapReduce easier with Pig + Hive Thursday, August 8, 13
  64. 64. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Thursday, August 8, 13
  65. 65. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Thursday, August 8, 13
  66. 66. Example 3 - Mongo-Hadoop and Pig Let’s do the same thing yet again, but this time using Pig Pig is a powerful language that can generate sophisticated map/reduce workflows from simple scripts Can perform JOIN, GROUP, and execute user-defined functions (UDFs) Thursday, August 8, 13
  67. 67. Example 3 - Mongo-Hadoop and Pig (cont) Pig directives for loading data: BSONLoader and MongoLoader data = LOAD 'mongodb://localhost:27017/db.collection' using com.mongodb.hadoop.pig.MongoLoader; Writing data out BSONStorage and MongoInsertStorage STORE records INTO 'file:///output.bson' using com.mongodb.hadoop.pig.BSONStorage; Thursday, August 8, 13
  68. 68. Example 3 - Mongo-Hadoop and Pig (cont) Pig has its own special datatypes: Bags, Maps, and Tuples Mongo-Hadoop Connector intelligently converts between Pig datatypes and MongoDB datatypes Thursday, August 8, 13
  69. 69. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; Thursday, August 8, 13
  70. 70. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; Thursday, August 8, 13
  71. 71. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; Thursday, August 8, 13
  72. 72. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; Thursday, August 8, 13
  73. 73. Example 3 - Mongo-Hadoop and Pig (cont) raw = LOAD 'hdfs:///messages.bson' using com.mongodb.hadoop.pig.BSONLoader('','headers:[]') ; send_recip = FOREACH raw GENERATE $0#'From' as from, $0#'To' as to; send_recip_filtered = FILTER send_recip BY to IS NOT NULL; send_recip_split = FOREACH send_recip_filtered GENERATE from as from, TRIM(FLATTEN(TOKENIZE(to))) as to; send_recip_grouped = GROUP send_recip_split BY (from, to); send_recip_counted = FOREACH send_recip_grouped GENERATE group, COUNT($1) as count; STORE send_recip_counted INTO 'file:///enron_results.bson' using com.mongodb.hadoop.pig.BSONStorage; Thursday, August 8, 13
  74. 74. Hive with Mongo-Hadoop Similar idea to Pig - process your data without needing to write Map/Reduce code from scratch ...but with SQL as the language of choice Thursday, August 8, 13
  75. 75. Hive with Mongo-Hadoop db.users.find() { "_id": 1, "name": "Tom", "age": 28 } Sample Data: db.users { "_id": 2, "name": "Alice", "age": 18 } { "_id": 3, "name": "Bob", "age": 29 } { "_id": 101, "name": "Scott", "age": 10 } { "_id": 104, "name": "Jesse", "age": 52 } { "_id": 110, "name": "Mike", "age": 32 } ... first, declare the collection to be accessible in Hive: CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES( "mongo.columns.mapping" = "_id,name,age" ) TBLPROPERTIES ( "mongo.uri" = "mongodb://localhost:27017/test.users"); Thursday, August 8, 13
  76. 76. Hive with Mongo-Hadoop Thursday, August 8, 13
  77. 77. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; Thursday, August 8, 13
  78. 78. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; you can use GROUP BY: SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; Thursday, August 8, 13
  79. 79. Hive with Mongo-Hadoop ...then you can run SQL on it, like a table. SELECT name,age FROM mongo_users WHERE id > 100 ; you can use GROUP BY: SELECT * FROM mongo_users GROUP BY age WHERE id > 100 ; or JOIN multiple tables/collections together: SELECT * FROM mongo_users T1 JOIN user_emails T2 WHERE T1.id = T2.id; Thursday, August 8, 13
  80. 80. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Thursday, August 8, 13
  81. 81. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; DROP TABLE mongo_users; Thursday, August 8, 13
  82. 82. Write the output of queries back into new tables: INSERT OVERWRITE TABLE old_users SELECT id,name,age FROM mongo_users WHERE age > 100 ; Drop a table in Hive to delete the underlying collection in MongoDB DROP TABLE mongo_users; Thursday, August 8, 13
  83. 83. Usage with Amazon Elastic MapReduce Run mongo-hadoop jobs without needing to set up or manage your own Hadoop cluster. Thursday, August 8, 13
  84. 84. Usage with Amazon Elastic MapReduce First, make a “bootstrap” script that fetches dependencies (mongo-hadoop jar and java drivers) #!/bin/sh wget -P /home/hadoop/lib http://central.maven.org/maven2/org/ mongodb/mongo-java-driver/2.11.1/mongo-java-driver-2.11.1.jar wget -P /home/hadoop/lib https://s3.amazonaws.com/mongo-hadoopcode/mongo-hadoop-core_1.1.2-1.1.0.jar this will get executed on each node in the cluster that EMR builds for us. Thursday, August 8, 13
  85. 85. Example 4 - Usage with Amazon Elastic MapReduce Put the bootstrap script, and all your code, into an S3 bucket where Amazon can see it. s3cp ./bootstrap.sh s3://$S3_BUCKET/bootstrap.sh s3mod s3://$S3_BUCKET/bootstrap.sh public-read s3cp $HERE/../enron/target/enron-example.jar s3://$S3_BUCKET/ enron-example.jar s3mod s3://$S3_BUCKET/enron-example.jar public-read Thursday, August 8, 13
  86. 86. Example 4 - Usage with Amazon Elastic MapReduce ...then launch the job from the command line, pointing to your S3 locations Control the type and number of instances in the cluster $ elastic-mapreduce --create --jobflow ENRON000 --instance-type m1.xlarge --num-instances 5 --bootstrap-action s3://$S3_BUCKET/bootstrap.sh --log-uri s3://$S3_BUCKET/enron_logs --jar s3://$S3_BUCKET/enron-example.jar --arg -D --arg mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat --arg -D --arg mapred.input.dir=s3n://mongo-test-data/messages.bson --arg -D --arg mapred.output.dir=s3n://$S3_BUCKET/BSON_OUT --arg -D --arg mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat # (any additional parameters here) Thursday, August 8, 13
  87. 87. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Thursday, August 8, 13
  88. 88. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Thursday, August 8, 13
  89. 89. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Logs get captured into S3 files Thursday, August 8, 13
  90. 90. Example 4 - Usage with Amazon Elastic MapReduce Easy to kick off a Hadoop job, without needing to manage a Hadoop cluster Turn up the “num-instances” knob to make jobs complete faster Logs get captured into S3 files (Pig, Hive, and streaming work on EMR, too!) Thursday, August 8, 13
  91. 91. Example 5 - new feature: MongoUpdateWritable In previous examples, we wrote job output data by inserting into a new collection ... but we can also modify an existing output collection Works by applying mongoDB update modifiers: $push, $pull, $addToSet, $inc, $set, etc. Can be used to do incremental Map/Reduce or “join” two collections Thursday, August 8, 13
  92. 92. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", }
  93. 93. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } Thursday, August 8, 13
  94. 94. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] }
  95. 95. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event
  96. 96. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events Thursday, August 8, 13 {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event
  97. 97. Example 5 - MongoUpdateWritable Let’s say we have two collections. sensors log events {    "_id":  ObjectId("51b792d381c3e67b0a18d0ed"),    "name":  "730LsRkX",    "type":  "pressure",    "owner":  "steve", } {    "_id":  ObjectId("51b792d381c3e67b0a18d678"),    "sensor_id":  ObjectId("51b792d381c3e67b0a18d4a1"),    "value":  3328.5895416489802,    "timestamp":  ISODate("2013-­‐05-­‐18T13:11:38.709-­‐0400"),    "loc":  [-­‐175.13,51.658] } refers to which sensor logged the event For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  98. 98. Thursday, August 8, 13
  99. 99. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Thursday, August 8, 13
  100. 100. For each owner, we want to calculate how many events were recorded for each type of sensor that logged it. Plain english: Bob’s sensors for temperature have stored 1300 readings Bob’s sensors for pressure have stored 400 readings Alice’s sensors for humidity have stored 600 readings Alice’s sensors for temperature have stored 700 readings etc... Thursday, August 8, 13
  101. 101. Stage 1 -Map/Reduce on sensors collection sensors (mongoDB collection) log events read from mongoDB map/reduce for each sensor, emit: {key: owner+type, value: _id} group data from map() under each key, output: {key: owner+type, val: [ list of _ids] } (mongoDB collection) insert() new records to mongoDB Results (mongoDB collection) Thursday, August 8, 13
  102. 102. After stage one, the output docs look like: Thursday, August 8, 13
  103. 103. After stage one, the output docs look like: the sensor’s owner and type Thursday, August 8, 13
  104. 104. After stage one, the output docs look like: the sensor’s owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } Thursday, August 8, 13 list of ID’s of sensors with this owner and type
  105. 105. After stage one, the output docs look like: the sensor’s owner and type {    "_id":  "alice  pressure",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ] } list of ID’s of sensors with this owner and type Now we just need to count the total # of log events recorded for any sensors that appear in the list for each owner/type group. Thursday, August 8, 13
  106. 106. Stage 2 -Map/Reduce on log events collection sensors for each sensor, emit: {key: sensor_id, value: 1} (mongoDB collection) group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) log events (mongoDB collection) map/reduce read from mongoDB update() existing records in mongoDB Results (mongoDB collection) Thursday, August 8, 13
  107. 107. Stage 2 -Map/Reduce on log events collection sensors for each sensor, emit: {key: sensor_id, value: 1} (mongoDB collection) group data from map() under each key for each value in that key: update({sensors: key}, {$inc : {logs_count:1}}) log events (mongoDB collection) map/reduce read from mongoDB context.write(null,   new  MongoUpdateWritable(      query,  //which  documents  to  modify        update,  //how  to  modify  ($inc)      true,        //upsert      false) );  //  multi Thursday, August 8, 13 update() existing records in mongoDB Results (mongoDB collection)
  108. 108. Example - MongoUpdateWritable Result after stage 2 {    "_id":  "1UoTcvnCTz  temp",    "sensors":  [        ObjectId("51b792d381c3e67b0a18d475"),        ObjectId("51b792d381c3e67b0a18d16d"),        ObjectId("51b792d381c3e67b0a18d2bf"),        …    ],    "logs_count":  1050616 } now populated with correct count Thursday, August 8, 13
  109. 109. Upcoming Features (v1.2 and beyond) Performance Improvements - Lazy BSON Full-featured Hive support Support for multi-collection input sources API for adding custom splitter implementations and more Thursday, August 8, 13
  110. 110. Recap Mongo-Hadoop - use Hadoop to do massive computations on big data sets stored in Mongo/BSON MongoDB becomes a Hadoop-enabled filesystem Tools and APIs make it easier: Streaming, Pig, Hive, EMR, etc. Thursday, August 8, 13
  111. 111. Questions? Examples can be found on github: https://github.com/mongodb/mongo-hadoop/tree/ master/examples Thursday, August 8, 13

×