SlideShare a Scribd company logo
1 of 48
Download to read offline
@h_ingo

Analytics with MongoDB
alone and with Hadoop Connector
Henrik Ingo
Solution Architect, MongoDB
The Science in Data Science
• Collect data
• Explore the data, use visualization
• Use math
• Make predictions
• Test predictions
– Collect even more data
• Repeat...
Why MongoDB?
When MongoDB?
5 NoSQL categories

Redis

Cassandra

Key Value
Graph
Neo4j

Wide Column

Document

Map Reduce
Hadoop
MongoDB and Enterprise IT Stack

CRM, ERP, Collaboration, Mobile, BI

Data Management
Online Data

Offline Data

RDBMS
RDBMS

Hadoop

EDW

Infrastructure
OS & Virtualization, Compute, Storage, Network

Security & Auditing

Management & Monitoring

Applications
How do we do it with MongoDB?
Collect data
Exponential Data Growth

http://www.worldwidewebsize.com/
Volume Velocity Variety
Volume Velocity Variety

Upserts avoid
unnecessary reads

Asynchronous writes

Data
Data
Sources
Data
Sources
Data
Sources
Sources

Spread writes over
multiple shards

Writes buffered in RAM
and flushed to disk in
bulk
Volume Velocity Variety
MongoDB

RDBMS

{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{

type :

"Health",

plan : "PPO Plus" },
{

type :

"Dental",

plan : "Standard" }
]
}
Visualization
Visualization

d3js.org, …
Use math
Data Processing in MongoDB
• Pre-aggregated documents

• Aggregation Framework
• Map/Reduce
• Hadoop Connector
Pre-aggregated documents
Design Pattern
Pre-Aggregation
Data for
URL /
Date

{
_id: "20101010/site-1/apache_pb.gif",
metadata: {
date: ISODate("2000-10-10T00:00:00Z"),
site: "site-1",
page: "/apache_pb.gif" },
daily: 5468426,
hourly: {
"0": 227850,
"1": 210231,
...
"23": 20457 },
minute: {
"0": 3612,
"1": 3241,
...
"1439": 2819 }
}
Pre-Aggregation
Data for
URL /
Date

query = { '_id': "20101010/site-1/apache_pb.gif" }
update = { '$inc': {
'hourly.12' : 1,
'minute.739': 1 } }
db.stats.daily.update(query, update, upsert=True)
Aggregation framework
Dynamic Queries
Find all logs for a
URL

db.logs.find( { ‘path’ : ‘/index.html’ } )

Find all logs for a
time range

db.logs.find( {
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, s1) }
} )

Find all logs for a
host over a range of
dates

db.logs.find( {
‘host’ : ‘127.0.0.1’,
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, 1) }
} )
Aggregation Framework
Requests db.logs.aggregate( [
{ '$match': {
per day by
'time': {
URL

'$gte': new Date(2013, 0),
'$lt': new Date(2013, 1) } } },
{ '$project': {
'path': 1,
'date': {
'y': { '$year':
'$time' },
'm': { '$month':
'$time' },
'd': { '$dayOfMonth': '$time' } } } },
{ '$group': {
'_id': {
'p': '$path',
'y': '$date.y',
'm': '$date.m',
'd': '$date.d' },
'hits': { '$sum': 1 } } },
])
Aggregation Framework
{
‘ok’: 1,
‘result’: [
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
]
}

2013,'m':
2013,'m':
2013,'m':
2013,'m':
2013,'m':

1,'d':
1,'d':
1,'d':
1,'d':
1,'d':

1
2
3
4
5

},
},
},
},
},

'hits’:
'hits’:
'hits’:
'hits’:
'hits’:

124 },
245 },
322 },
175 },
94 }
Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Scale-out
• Declared in JSON, executes in C++
• Runs inside MongoDB on local data
Map Reduce in MongoDB
MongoDB Map/Reduce
Map Reduce – Map Phase
Generate hourly
rollups from log
data

var map =

function() {

var key = {
p: this.path,
d: new Date(
this.ts.getFullYear(),
this.ts.getMonth(),
this.ts.getDate(),
this.ts.getHours(),
0, 0, 0) };
emit( key, { hits: 1 } );
}
Map Reduce – Reduce Phase
Generate hourly
rollups from log
data

var reduce = function(key, values) {
var r = { hits: 0 };
values.forEach(function(v) {
r.hits += v.hits;
});
return r;
}
)
Map Reduce - Execution
query = { 'ts': {
'$gte': new Date(2013, 0, 1),
'$lte': new Date(2013, 0, 31) } }
db.logs.mapReduce( map, reduce, {
‘query’: query,
‘out’: {
‘reduce’ : ‘stats.monthly’ }
} )
MongoDB Map/Reduce Benefits
• Runs inside MongoDB
• Sharding supported
• JavaScript
– Pro: functionality, expressiveness
– Con: overhead

• Input can be a collection or query!

• Output directly to document or collection
• Easy, when you don’t want overhead of Hadoop
Hadoop Connector
MongoDB with Hadoop
MongoDB with Hadoop
MongoDB

MongoDB with Hadoop
How it works
•

Adapter examines MongoDB input collection and
calculates a set of splits from data

•

Each split is assigned to a Hadoop node

•

In parallel hadoop pulls data from splits on
MongoDB (or BSON) and starts processing locally

•

Hadoop merges results and streams output back to
MongoDB (or BSON) output collection
Read From MongoDB (or BSON)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages

mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir= file:///tmp/messages.bson
mapred.input.dir= hdfs:///tmp/messages.bson
mapred.input.dir= s3:///tmp/messages.bson
Write To MongoDB

(or BSON)

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir= file:///tmp/results.bson
mapred.output.dir= hdfs:///tmp/results.bson
mapred.output.dir= s3:///tmp/results.bson
Document Example
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" :
"<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Graph Sketch
Receiver Sender Pairs
{"_id": {"t":"bob@enron.com",

"f":"alice@enron.com"},

"count" : 14}

{"_id": {"t":"bob@enron.com",

"f":"eve@enron.com"},

"count" : 9}

{"_id": {"t":"alice@enron.com",

"f":"charlie@enron.com"},

"count" : 99}

{"_id": {"t":"charlie@enron.com",

"f":"bob@enron.com"},

"count" : 48}

{"_id": {"t":"eve@enron.com",

"f":"charlie@enron.com"},

"count" : 20}
Map Phase – each document get’s
through mapper function
@Override
public void map(NullWritable key, BSONObject val,
final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From"); String to =
(String) headers.get("To"); String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new
IntWritable(1));
}
}

}
Reduce Phase – output Maps are
grouped by key and passed to Reducer
public void reduce(final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){
sum += value.get();
}

BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) ); }
Query Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t"
"f"
}
{ "_id" : { "t"
"f"
}
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"

: "kenneth.lay@enron.com",
: "15126-1267@m2.innovyx.com" }, "count" : 1
: "kenneth.lay@enron.com",
: "2586207@www4.imakenews.com" }, "count" : 1
:
:
:
:
:
:
:
:
:
:

"kenneth.lay@enron.com",
"40enron@enron.com" }, "count" : 2 }
"kenneth.lay@enron.com",
"a..davis@enron.com" }, "count" : 2 }
"kenneth.lay@enron.com",
"a..hughes@enron.com" }, "count" : 4 }
"kenneth.lay@enron.com",
"a..lindholm@enron.com" }, "count" : 1 }
"kenneth.lay@enron.com",
"a..schroeder@enron.com" }, "count" : 1 }
Hadoop Connector Benefits
•

Full multi-core parallelism to process MongoDB data

•

mongo.input.query

•

Full integration w/ Hadoop and JVM ecosystem
•

Mahout, et.al.

•

Can be used on Amazon Elastic MapReduce

•

Read and write backup files to local, HDFS and S3

•

Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive
Make predictions & test
A/B testing
• Hey, it looks like teenage girls clicked a lot on that ad

with a pink background...
• Hypothesis: Given otherwise the same ad, teenage

girls are more likely to click on ads with pink
backgrounds than white
• Test 50-50 pink vs white ads

• Collect click stream stats in MongoDB or Hadoop
• Analyze results
Recommendations – social filtering
• ”Customers who bought this book also bought”
• Computed offline / nightly
• As easy as it sounds!

google it: Amazon item-to-item algorithm
Personalization
• ”Even if you are a teenage girl, you seem to be 60%

more likely to click on blue ads than pink.”
• User specific recommendations a hybrid of offline &

online recommendations
• User profile in MongoDB
• May even be updated real time
@h_ingo

Questions?
Henrik Ingo
Solution Architect, MongoDB

More Related Content

What's hot

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation FrameworkMongoDB
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineJason Terpko
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkCaserta
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationJoe Drumgoole
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopSteven Francia
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBMongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBNosh Petigara
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation Amit Ghosh
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework MongoDB
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Anuj Jain
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamskyData Con LA
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for AnalyticsMongoDB
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichNorberto Leite
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysCAPSiDE
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 

What's hot (20)

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 

Similar to Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...Gianfranco Palumbo
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsMongoDB
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDBNorberto Leite
 
Data Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineData Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineMongoDB
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6Maxime Beugnet
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessMongoDB
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopAhmedabadJavaMeetup
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Maxime Beugnet
 
Mongodb intro
Mongodb introMongodb intro
Mongodb introchristkv
 
Building a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBBuilding a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBMongoDB
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Railsrfischer20
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMongoDB
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB
 

Similar to Analytics with MongoDB Aggregation Framework and Hadoop Connector (20)

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
Data Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineData Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane Fine
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
MongoDB Meetup
MongoDB MeetupMongoDB Meetup
MongoDB Meetup
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Building a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBBuilding a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDB
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 

More from Henrik Ingo

Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0Henrik Ingo
 
Meteor - The next generation software stack
Meteor - The next generation software stackMeteor - The next generation software stack
Meteor - The next generation software stackHenrik Ingo
 
MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014 MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014 Henrik Ingo
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB AppHenrik Ingo
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Henrik Ingo
 
Failover or not to failover
Failover or not to failoverFailover or not to failover
Failover or not to failoverHenrik Ingo
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersHenrik Ingo
 
Introducing Xtrabackup Manager
Introducing Xtrabackup ManagerIntroducing Xtrabackup Manager
Introducing Xtrabackup ManagerHenrik Ingo
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Henrik Ingo
 
Froscon 2012 how big corporations play the open source game
Froscon 2012   how big corporations play the open source gameFroscon 2012   how big corporations play the open source game
Froscon 2012 how big corporations play the open source gameHenrik Ingo
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to GaleraHenrik Ingo
 
Databases and the Cloud
Databases and the CloudDatabases and the Cloud
Databases and the CloudHenrik Ingo
 
Fixed in drizzle
Fixed in drizzleFixed in drizzle
Fixed in drizzleHenrik Ingo
 
Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011Henrik Ingo
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use itHenrik Ingo
 
How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011Henrik Ingo
 

More from Henrik Ingo (16)

Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0
 
Meteor - The next generation software stack
Meteor - The next generation software stackMeteor - The next generation software stack
Meteor - The next generation software stack
 
MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014 MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
 
Failover or not to failover
Failover or not to failoverFailover or not to failover
Failover or not to failover
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
 
Introducing Xtrabackup Manager
Introducing Xtrabackup ManagerIntroducing Xtrabackup Manager
Introducing Xtrabackup Manager
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
 
Froscon 2012 how big corporations play the open source game
Froscon 2012   how big corporations play the open source gameFroscon 2012   how big corporations play the open source game
Froscon 2012 how big corporations play the open source game
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
 
Databases and the Cloud
Databases and the CloudDatabases and the Cloud
Databases and the Cloud
 
Fixed in drizzle
Fixed in drizzleFixed in drizzle
Fixed in drizzle
 
Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use it
 
How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011
 

Recently uploaded

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Analytics with MongoDB Aggregation Framework and Hadoop Connector

  • 1. @h_ingo Analytics with MongoDB alone and with Hadoop Connector Henrik Ingo Solution Architect, MongoDB
  • 2. The Science in Data Science • Collect data • Explore the data, use visualization • Use math • Make predictions • Test predictions – Collect even more data • Repeat...
  • 4. 5 NoSQL categories Redis Cassandra Key Value Graph Neo4j Wide Column Document Map Reduce Hadoop
  • 5. MongoDB and Enterprise IT Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data RDBMS RDBMS Hadoop EDW Infrastructure OS & Virtualization, Compute, Storage, Network Security & Auditing Management & Monitoring Applications
  • 6. How do we do it with MongoDB?
  • 10. Volume Velocity Variety Upserts avoid unnecessary reads Asynchronous writes Data Data Sources Data Sources Data Sources Sources Spread writes over multiple shards Writes buffered in RAM and flushed to disk in bulk
  • 11. Volume Velocity Variety MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  • 15. Data Processing in MongoDB • Pre-aggregated documents • Aggregation Framework • Map/Reduce • Hadoop Connector
  • 17. Pre-Aggregation Data for URL / Date { _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": 3612, "1": 3241, ... "1439": 2819 } }
  • 18. Pre-Aggregation Data for URL / Date query = { '_id': "20101010/site-1/apache_pb.gif" } update = { '$inc': { 'hourly.12' : 1, 'minute.739': 1 } } db.stats.daily.update(query, update, upsert=True)
  • 20. Dynamic Queries Find all logs for a URL db.logs.find( { ‘path’ : ‘/index.html’ } ) Find all logs for a time range db.logs.find( { ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, s1) } } ) Find all logs for a host over a range of dates db.logs.find( { ‘host’ : ‘127.0.0.1’, ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, 1) } } )
  • 21. Aggregation Framework Requests db.logs.aggregate( [ { '$match': { per day by 'time': { URL '$gte': new Date(2013, 0), '$lt': new Date(2013, 1) } } }, { '$project': { 'path': 1, 'date': { 'y': { '$year': '$time' }, 'm': { '$month': '$time' }, 'd': { '$dayOfMonth': '$time' } } } }, { '$group': { '_id': { 'p': '$path', 'y': '$date.y', 'm': '$date.m', 'd': '$date.d' }, 'hits': { '$sum': 1 } } }, ])
  • 22. Aggregation Framework { ‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': ] } 2013,'m': 2013,'m': 2013,'m': 2013,'m': 2013,'m': 1,'d': 1,'d': 1,'d': 1,'d': 1,'d': 1 2 3 4 5 }, }, }, }, }, 'hits’: 'hits’: 'hits’: 'hits’: 'hits’: 124 }, 245 }, 322 }, 175 }, 94 }
  • 23. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Scale-out • Declared in JSON, executes in C++ • Runs inside MongoDB on local data
  • 24. Map Reduce in MongoDB
  • 26. Map Reduce – Map Phase Generate hourly rollups from log data var map = function() { var key = { p: this.path, d: new Date( this.ts.getFullYear(), this.ts.getMonth(), this.ts.getDate(), this.ts.getHours(), 0, 0, 0) }; emit( key, { hits: 1 } ); }
  • 27. Map Reduce – Reduce Phase Generate hourly rollups from log data var reduce = function(key, values) { var r = { hits: 0 }; values.forEach(function(v) { r.hits += v.hits; }); return r; } )
  • 28. Map Reduce - Execution query = { 'ts': { '$gte': new Date(2013, 0, 1), '$lte': new Date(2013, 0, 31) } } db.logs.mapReduce( map, reduce, { ‘query’: query, ‘out’: { ‘reduce’ : ‘stats.monthly’ } } )
  • 29. MongoDB Map/Reduce Benefits • Runs inside MongoDB • Sharding supported • JavaScript – Pro: functionality, expressiveness – Con: overhead • Input can be a collection or query! • Output directly to document or collection • Easy, when you don’t want overhead of Hadoop
  • 34. How it works • Adapter examines MongoDB input collection and calculates a set of splits from data • Each split is assigned to a Hadoop node • In parallel hadoop pulls data from splits on MongoDB (or BSON) and starts processing locally • Hadoop merges results and streams output back to MongoDB (or BSON) output collection
  • 35. Read From MongoDB (or BSON) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir= file:///tmp/messages.bson mapred.input.dir= hdfs:///tmp/messages.bson mapred.input.dir= s3:///tmp/messages.bson
  • 36. Write To MongoDB (or BSON) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir= file:///tmp/results.bson mapred.output.dir= hdfs:///tmp/results.bson mapred.output.dir= s3:///tmp/results.bson
  • 37. Document Example { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }
  • 39. Receiver Sender Pairs {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
  • 40. Map Phase – each document get’s through mapper function @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String) headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } }
  • 41. Reduce Phase – output Maps are grouped by key and passed to Reducer public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get(); } BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); }
  • 42. Query Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" "f" } { "_id" : { "t" "f" } { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" : "kenneth.lay@enron.com", : "15126-1267@m2.innovyx.com" }, "count" : 1 : "kenneth.lay@enron.com", : "2586207@www4.imakenews.com" }, "count" : 1 : : : : : : : : : : "kenneth.lay@enron.com", "40enron@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..davis@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..hughes@enron.com" }, "count" : 4 } "kenneth.lay@enron.com", "a..lindholm@enron.com" }, "count" : 1 } "kenneth.lay@enron.com", "a..schroeder@enron.com" }, "count" : 1 }
  • 43. Hadoop Connector Benefits • Full multi-core parallelism to process MongoDB data • mongo.input.query • Full integration w/ Hadoop and JVM ecosystem • Mahout, et.al. • Can be used on Amazon Elastic MapReduce • Read and write backup files to local, HDFS and S3 • Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive
  • 45. A/B testing • Hey, it looks like teenage girls clicked a lot on that ad with a pink background... • Hypothesis: Given otherwise the same ad, teenage girls are more likely to click on ads with pink backgrounds than white • Test 50-50 pink vs white ads • Collect click stream stats in MongoDB or Hadoop • Analyze results
  • 46. Recommendations – social filtering • ”Customers who bought this book also bought” • Computed offline / nightly • As easy as it sounds! google it: Amazon item-to-item algorithm
  • 47. Personalization • ”Even if you are a teenage girl, you seem to be 60% more likely to click on blue ads than pink.” • User specific recommendations a hybrid of offline & online recommendations • User profile in MongoDB • May even be updated real time