SlideShare a Scribd company logo
@h_ingo

Analytics with MongoDB
alone and with Hadoop Connector
Henrik Ingo
Solution Architect, MongoDB
The Science in Data Science
• Collect data
• Explore the data, use visualization
• Use math
• Make predictions
• Test predictions
– Collect even more data
• Repeat...
Why MongoDB?
When MongoDB?
5 NoSQL categories

Redis

Cassandra

Key Value
Graph
Neo4j

Wide Column

Document

Map Reduce
Hadoop
MongoDB and Enterprise IT Stack

CRM, ERP, Collaboration, Mobile, BI

Data Management
Online Data

Offline Data

RDBMS
RDBMS

Hadoop

EDW

Infrastructure
OS & Virtualization, Compute, Storage, Network

Security & Auditing

Management & Monitoring

Applications
How do we do it with MongoDB?
Collect data
Exponential Data Growth

http://www.worldwidewebsize.com/
Volume Velocity Variety
Volume Velocity Variety

Upserts avoid
unnecessary reads

Asynchronous writes

Data
Data
Sources
Data
Sources
Data
Sources
Sources

Spread writes over
multiple shards

Writes buffered in RAM
and flushed to disk in
bulk
Volume Velocity Variety
MongoDB

RDBMS

{
_id : ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketing",
title : "Product Manager, Web",
report_up: "Neray, Graham",
pay_band: “C",
benefits : [
{

type :

"Health",

plan : "PPO Plus" },
{

type :

"Dental",

plan : "Standard" }
]
}
Visualization
Visualization

d3js.org, …
Use math
Data Processing in MongoDB
• Pre-aggregated documents

• Aggregation Framework
• Map/Reduce
• Hadoop Connector
Pre-aggregated documents
Design Pattern
Pre-Aggregation
Data for
URL /
Date

{
_id: "20101010/site-1/apache_pb.gif",
metadata: {
date: ISODate("2000-10-10T00:00:00Z"),
site: "site-1",
page: "/apache_pb.gif" },
daily: 5468426,
hourly: {
"0": 227850,
"1": 210231,
...
"23": 20457 },
minute: {
"0": 3612,
"1": 3241,
...
"1439": 2819 }
}
Pre-Aggregation
Data for
URL /
Date

query = { '_id': "20101010/site-1/apache_pb.gif" }
update = { '$inc': {
'hourly.12' : 1,
'minute.739': 1 } }
db.stats.daily.update(query, update, upsert=True)
Aggregation framework
Dynamic Queries
Find all logs for a
URL

db.logs.find( { ‘path’ : ‘/index.html’ } )

Find all logs for a
time range

db.logs.find( {
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, s1) }
} )

Find all logs for a
host over a range of
dates

db.logs.find( {
‘host’ : ‘127.0.0.1’,
‘time’ : {
‘$gte’: new Date(2013, 0),
‘$lt’: new Date(2013, 1) }
} )
Aggregation Framework
Requests db.logs.aggregate( [
{ '$match': {
per day by
'time': {
URL

'$gte': new Date(2013, 0),
'$lt': new Date(2013, 1) } } },
{ '$project': {
'path': 1,
'date': {
'y': { '$year':
'$time' },
'm': { '$month':
'$time' },
'd': { '$dayOfMonth': '$time' } } } },
{ '$group': {
'_id': {
'p': '$path',
'y': '$date.y',
'm': '$date.m',
'd': '$date.d' },
'hits': { '$sum': 1 } } },
])
Aggregation Framework
{
‘ok’: 1,
‘result’: [
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
{ '_id': {'p':’/index.html’,'y':
]
}

2013,'m':
2013,'m':
2013,'m':
2013,'m':
2013,'m':

1,'d':
1,'d':
1,'d':
1,'d':
1,'d':

1
2
3
4
5

},
},
},
},
},

'hits’:
'hits’:
'hits’:
'hits’:
'hits’:

124 },
245 },
322 },
175 },
94 }
Aggregation Framework Benefits
• Real-time
• Simple yet powerful interface
• Scale-out
• Declared in JSON, executes in C++
• Runs inside MongoDB on local data
Map Reduce in MongoDB
MongoDB Map/Reduce
Map Reduce – Map Phase
Generate hourly
rollups from log
data

var map =

function() {

var key = {
p: this.path,
d: new Date(
this.ts.getFullYear(),
this.ts.getMonth(),
this.ts.getDate(),
this.ts.getHours(),
0, 0, 0) };
emit( key, { hits: 1 } );
}
Map Reduce – Reduce Phase
Generate hourly
rollups from log
data

var reduce = function(key, values) {
var r = { hits: 0 };
values.forEach(function(v) {
r.hits += v.hits;
});
return r;
}
)
Map Reduce - Execution
query = { 'ts': {
'$gte': new Date(2013, 0, 1),
'$lte': new Date(2013, 0, 31) } }
db.logs.mapReduce( map, reduce, {
‘query’: query,
‘out’: {
‘reduce’ : ‘stats.monthly’ }
} )
MongoDB Map/Reduce Benefits
• Runs inside MongoDB
• Sharding supported
• JavaScript
– Pro: functionality, expressiveness
– Con: overhead

• Input can be a collection or query!

• Output directly to document or collection
• Easy, when you don’t want overhead of Hadoop
Hadoop Connector
MongoDB with Hadoop
MongoDB with Hadoop
MongoDB

MongoDB with Hadoop
How it works
•

Adapter examines MongoDB input collection and
calculates a set of splits from data

•

Each split is assigned to a Hadoop node

•

In parallel hadoop pulls data from splits on
MongoDB (or BSON) and starts processing locally

•

Hadoop merges results and streams output back to
MongoDB (or BSON) output collection
Read From MongoDB (or BSON)
mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat
mongo.input.uri=mongodb://my-db:27017/enron.messages

mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat
mapred.input.dir= file:///tmp/messages.bson
mapred.input.dir= hdfs:///tmp/messages.bson
mapred.input.dir= s3:///tmp/messages.bson
Write To MongoDB

(or BSON)

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri=mongodb://my-db:27017/enron.results_out

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat
mapred.output.dir= file:///tmp/results.bson
mapred.output.dir= hdfs:///tmp/results.bson
mapred.output.dir= s3:///tmp/results.bson
Document Example
{
"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),
"body" : "Here is our forecastnn ",
"filename" : "1.",
"headers" : {
"From" : "phillip.allen@enron.com",
"Subject" : "Forecast Info",
"X-bcc" : "",
"To" : "tim.belden@enron.com",
"X-Origin" : "Allen-P",
"X-From" : "Phillip K Allen",
"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
"X-To" : "Tim Belden ",
"Message-ID" :
"<18782981.1075855378110.JavaMail.evans@thyme>",
"Content-Type" : "text/plain; charset=us-ascii",
"Mime-Version" : "1.0"
}
}
Graph Sketch
Receiver Sender Pairs
{"_id": {"t":"bob@enron.com",

"f":"alice@enron.com"},

"count" : 14}

{"_id": {"t":"bob@enron.com",

"f":"eve@enron.com"},

"count" : 9}

{"_id": {"t":"alice@enron.com",

"f":"charlie@enron.com"},

"count" : 99}

{"_id": {"t":"charlie@enron.com",

"f":"bob@enron.com"},

"count" : 48}

{"_id": {"t":"eve@enron.com",

"f":"charlie@enron.com"},

"count" : 20}
Map Phase – each document get’s
through mapper function
@Override
public void map(NullWritable key, BSONObject val,
final Context context){
BSONObject headers = (BSONObject)val.get("headers");
if(headers.containsKey("From") && headers.containsKey("To")){
String from = (String)headers.get("From"); String to =
(String) headers.get("To"); String[] recips = to.split(",");
for(int i=0;i<recips.length;i++){
String recip = recips[i].trim();
context.write(new MailPair(from, recip), new
IntWritable(1));
}
}

}
Reduce Phase – output Maps are
grouped by key and passed to Reducer
public void reduce(final MailPair pKey,
final Iterable<IntWritable> pValues,
final Context pContext ){
int sum = 0;
for ( final IntWritable value : pValues ){
sum += value.get();
}

BSONObject outDoc = new BasicDBObjectBuilder().start()
.add( "f" , pKey.from)
.add( "t" , pKey.to )
.get();
BSONWritable pkeyOut = new BSONWritable(outDoc);
pContext.write( pkeyOut, new IntWritable(sum) ); }
Query Data
mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})
{ "_id" : { "t"
"f"
}
{ "_id" : { "t"
"f"
}
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"
{ "_id" : { "t"
"f"

: "kenneth.lay@enron.com",
: "15126-1267@m2.innovyx.com" }, "count" : 1
: "kenneth.lay@enron.com",
: "2586207@www4.imakenews.com" }, "count" : 1
:
:
:
:
:
:
:
:
:
:

"kenneth.lay@enron.com",
"40enron@enron.com" }, "count" : 2 }
"kenneth.lay@enron.com",
"a..davis@enron.com" }, "count" : 2 }
"kenneth.lay@enron.com",
"a..hughes@enron.com" }, "count" : 4 }
"kenneth.lay@enron.com",
"a..lindholm@enron.com" }, "count" : 1 }
"kenneth.lay@enron.com",
"a..schroeder@enron.com" }, "count" : 1 }
Hadoop Connector Benefits
•

Full multi-core parallelism to process MongoDB data

•

mongo.input.query

•

Full integration w/ Hadoop and JVM ecosystem
•

Mahout, et.al.

•

Can be used on Amazon Elastic MapReduce

•

Read and write backup files to local, HDFS and S3

•

Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive
Make predictions & test
A/B testing
• Hey, it looks like teenage girls clicked a lot on that ad

with a pink background...
• Hypothesis: Given otherwise the same ad, teenage

girls are more likely to click on ads with pink
backgrounds than white
• Test 50-50 pink vs white ads

• Collect click stream stats in MongoDB or Hadoop
• Analyze results
Recommendations – social filtering
• ”Customers who bought this book also bought”
• Computed offline / nightly
• As easy as it sounds!

google it: Amazon item-to-item algorithm
Personalization
• ”Even if you are a teenage girl, you seem to be 60%

more likely to click on blue ads than pink.”
• User specific recommendations a hybrid of offline &

online recommendations
• User profile in MongoDB
• May even be updated real time
@h_ingo

Questions?
Henrik Ingo
Solution Architect, MongoDB

More Related Content

What's hot

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
Jason Terpko
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
Caserta
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
Joe Drumgoole
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
Steven Francia
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Nosh Petigara
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
Amit Ghosh
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
MongoDB
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
Anuj Jain
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...MongoDB
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
Data Con LA
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014MongoDB
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
MongoDB
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
Norberto Leite
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
CAPSiDE
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation FrameworkTyler Brock
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
MongoDB
 

What's hot (20)

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
Webinarserie: Einführung in MongoDB: “Back to Basics” - Teil 3 - Interaktion ...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014Hadoop - MongoDB Webinar June 2014
Hadoop - MongoDB Webinar June 2014
 
MongoDB for Analytics
MongoDB for AnalyticsMongoDB for Analytics
MongoDB for Analytics
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Aggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days MunichAggregation Framework MongoDB Days Munich
Aggregation Framework MongoDB Days Munich
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
MongoDB Aggregation Framework
MongoDB Aggregation FrameworkMongoDB Aggregation Framework
MongoDB Aggregation Framework
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 

Similar to Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
Gianfranco Palumbo
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
MongoDB
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
Norberto Leite
 
Data Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineData Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane Fine
MongoDB
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
Maxime Beugnet
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revisedMongoDB
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
MongoDB
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
AhmedabadJavaMeetup
 
MongoDB Meetup
MongoDB MeetupMongoDB Meetup
MongoDB Meetup
Maxime Beugnet
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
Maxime Beugnet
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
christkv
 
Building a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBBuilding a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDB
MongoDB
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Railsrfischer20
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
MongoDB
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
Massimo Brignoli
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
MongoDB
 

Similar to Analytics with MongoDB Aggregation Framework and Hadoop Connector (20)

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's A...
 
Webinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev TeamsWebinar: General Technical Overview of MongoDB for Dev Teams
Webinar: General Technical Overview of MongoDB for Dev Teams
 
Building your first app with MongoDB
Building your first app with MongoDBBuilding your first app with MongoDB
Building your first app with MongoDB
 
Data Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane FineData Analytics with MongoDB - Jane Fine
Data Analytics with MongoDB - Jane Fine
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Eagle6 mongo dc revised
Eagle6 mongo dc revisedEagle6 mongo dc revised
Eagle6 mongo dc revised
 
Eagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational AwarenessEagle6 Enterprise Situational Awareness
Eagle6 Enterprise Situational Awareness
 
Introduction to MongoDB and Workshop
Introduction to MongoDB and WorkshopIntroduction to MongoDB and Workshop
Introduction to MongoDB and Workshop
 
MongoDB Meetup
MongoDB MeetupMongoDB Meetup
MongoDB Meetup
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...Simplifying & accelerating application development with MongoDB's intelligent...
Simplifying & accelerating application development with MongoDB's intelligent...
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Building a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDBBuilding a Cross Channel Content Delivery Platform with MongoDB
Building a Cross Channel Content Delivery Platform with MongoDB
 
MongoDB and Ruby on Rails
MongoDB and Ruby on RailsMongoDB and Ruby on Rails
MongoDB and Ruby on Rails
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 

More from Henrik Ingo

Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0
Henrik Ingo
 
Meteor - The next generation software stack
Meteor - The next generation software stackMeteor - The next generation software stack
Meteor - The next generation software stack
Henrik Ingo
 
MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014 MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014
Henrik Ingo
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
Henrik Ingo
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Henrik Ingo
 
Failover or not to failover
Failover or not to failoverFailover or not to failover
Failover or not to failoverHenrik Ingo
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersHenrik Ingo
 
Introducing Xtrabackup Manager
Introducing Xtrabackup ManagerIntroducing Xtrabackup Manager
Introducing Xtrabackup ManagerHenrik Ingo
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
Henrik Ingo
 
Froscon 2012 how big corporations play the open source game
Froscon 2012   how big corporations play the open source gameFroscon 2012   how big corporations play the open source game
Froscon 2012 how big corporations play the open source game
Henrik Ingo
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
Henrik Ingo
 
Databases and the Cloud
Databases and the CloudDatabases and the Cloud
Databases and the CloudHenrik Ingo
 
Fixed in drizzle
Fixed in drizzleFixed in drizzle
Fixed in drizzleHenrik Ingo
 
Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011
Henrik Ingo
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use it
Henrik Ingo
 
How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011Henrik Ingo
 

More from Henrik Ingo (16)

Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0Introduction to new high performance storage engines in mongodb 3.0
Introduction to new high performance storage engines in mongodb 3.0
 
Meteor - The next generation software stack
Meteor - The next generation software stackMeteor - The next generation software stack
Meteor - The next generation software stack
 
MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014 MongoDB for Oracle Experts - OUGF Harmony 2014
MongoDB for Oracle Experts - OUGF Harmony 2014
 
Building Your First MongoDB App
Building Your First MongoDB AppBuilding Your First MongoDB App
Building Your First MongoDB App
 
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
Whats new in mongoDB 2.4 at Copenhagen user group 2013-06-19
 
Failover or not to failover
Failover or not to failoverFailover or not to failover
Failover or not to failover
 
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and othersSpatial functions in  MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
Spatial functions in MySQL 5.6, MariaDB 5.5, PostGIS 2.0 and others
 
Introducing Xtrabackup Manager
Introducing Xtrabackup ManagerIntroducing Xtrabackup Manager
Introducing Xtrabackup Manager
 
Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)Using and Benchmarking Galera in different architectures (PLUK 2012)
Using and Benchmarking Galera in different architectures (PLUK 2012)
 
Froscon 2012 how big corporations play the open source game
Froscon 2012   how big corporations play the open source gameFroscon 2012   how big corporations play the open source game
Froscon 2012 how big corporations play the open source game
 
Introduction to Galera
Introduction to GaleraIntroduction to Galera
Introduction to Galera
 
Databases and the Cloud
Databases and the CloudDatabases and the Cloud
Databases and the Cloud
 
Fixed in drizzle
Fixed in drizzleFixed in drizzle
Fixed in drizzle
 
Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011Choosing a MySQL High Availability solution - Percona Live UK 2011
Choosing a MySQL High Availability solution - Percona Live UK 2011
 
Froscon2011: How i learned to use sql and then learned not to use it
Froscon2011:  How i learned to use sql and then learned not to use itFroscon2011:  How i learned to use sql and then learned not to use it
Froscon2011: How i learned to use sql and then learned not to use it
 
How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011How to grow your open source project 10x and revenues 5x OSCON2011
How to grow your open source project 10x and revenues 5x OSCON2011
 

Recently uploaded

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Analytics with MongoDB Aggregation Framework and Hadoop Connector

  • 1. @h_ingo Analytics with MongoDB alone and with Hadoop Connector Henrik Ingo Solution Architect, MongoDB
  • 2. The Science in Data Science • Collect data • Explore the data, use visualization • Use math • Make predictions • Test predictions – Collect even more data • Repeat...
  • 4. 5 NoSQL categories Redis Cassandra Key Value Graph Neo4j Wide Column Document Map Reduce Hadoop
  • 5. MongoDB and Enterprise IT Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data Offline Data RDBMS RDBMS Hadoop EDW Infrastructure OS & Virtualization, Compute, Storage, Network Security & Auditing Management & Monitoring Applications
  • 6. How do we do it with MongoDB?
  • 10. Volume Velocity Variety Upserts avoid unnecessary reads Asynchronous writes Data Data Sources Data Sources Data Sources Sources Spread writes over multiple shards Writes buffered in RAM and flushed to disk in bulk
  • 11. Volume Velocity Variety MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  • 15. Data Processing in MongoDB • Pre-aggregated documents • Aggregation Framework • Map/Reduce • Hadoop Connector
  • 17. Pre-Aggregation Data for URL / Date { _id: "20101010/site-1/apache_pb.gif", metadata: { date: ISODate("2000-10-10T00:00:00Z"), site: "site-1", page: "/apache_pb.gif" }, daily: 5468426, hourly: { "0": 227850, "1": 210231, ... "23": 20457 }, minute: { "0": 3612, "1": 3241, ... "1439": 2819 } }
  • 18. Pre-Aggregation Data for URL / Date query = { '_id': "20101010/site-1/apache_pb.gif" } update = { '$inc': { 'hourly.12' : 1, 'minute.739': 1 } } db.stats.daily.update(query, update, upsert=True)
  • 20. Dynamic Queries Find all logs for a URL db.logs.find( { ‘path’ : ‘/index.html’ } ) Find all logs for a time range db.logs.find( { ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, s1) } } ) Find all logs for a host over a range of dates db.logs.find( { ‘host’ : ‘127.0.0.1’, ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, 1) } } )
  • 21. Aggregation Framework Requests db.logs.aggregate( [ { '$match': { per day by 'time': { URL '$gte': new Date(2013, 0), '$lt': new Date(2013, 1) } } }, { '$project': { 'path': 1, 'date': { 'y': { '$year': '$time' }, 'm': { '$month': '$time' }, 'd': { '$dayOfMonth': '$time' } } } }, { '$group': { '_id': { 'p': '$path', 'y': '$date.y', 'm': '$date.m', 'd': '$date.d' }, 'hits': { '$sum': 1 } } }, ])
  • 22. Aggregation Framework { ‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': { '_id': {'p':’/index.html’,'y': ] } 2013,'m': 2013,'m': 2013,'m': 2013,'m': 2013,'m': 1,'d': 1,'d': 1,'d': 1,'d': 1,'d': 1 2 3 4 5 }, }, }, }, }, 'hits’: 'hits’: 'hits’: 'hits’: 'hits’: 124 }, 245 }, 322 }, 175 }, 94 }
  • 23. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Scale-out • Declared in JSON, executes in C++ • Runs inside MongoDB on local data
  • 24. Map Reduce in MongoDB
  • 26. Map Reduce – Map Phase Generate hourly rollups from log data var map = function() { var key = { p: this.path, d: new Date( this.ts.getFullYear(), this.ts.getMonth(), this.ts.getDate(), this.ts.getHours(), 0, 0, 0) }; emit( key, { hits: 1 } ); }
  • 27. Map Reduce – Reduce Phase Generate hourly rollups from log data var reduce = function(key, values) { var r = { hits: 0 }; values.forEach(function(v) { r.hits += v.hits; }); return r; } )
  • 28. Map Reduce - Execution query = { 'ts': { '$gte': new Date(2013, 0, 1), '$lte': new Date(2013, 0, 31) } } db.logs.mapReduce( map, reduce, { ‘query’: query, ‘out’: { ‘reduce’ : ‘stats.monthly’ } } )
  • 29. MongoDB Map/Reduce Benefits • Runs inside MongoDB • Sharding supported • JavaScript – Pro: functionality, expressiveness – Con: overhead • Input can be a collection or query! • Output directly to document or collection • Easy, when you don’t want overhead of Hadoop
  • 34. How it works • Adapter examines MongoDB input collection and calculates a set of splits from data • Each split is assigned to a Hadoop node • In parallel hadoop pulls data from splits on MongoDB (or BSON) and starts processing locally • Hadoop merges results and streams output back to MongoDB (or BSON) output collection
  • 35. Read From MongoDB (or BSON) mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat mongo.input.uri=mongodb://my-db:27017/enron.messages mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat mapred.input.dir= file:///tmp/messages.bson mapred.input.dir= hdfs:///tmp/messages.bson mapred.input.dir= s3:///tmp/messages.bson
  • 36. Write To MongoDB (or BSON) mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat mongo.output.uri=mongodb://my-db:27017/enron.results_out mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat mapred.output.dir= file:///tmp/results.bson mapred.output.dir= hdfs:///tmp/results.bson mapred.output.dir= s3:///tmp/results.bson
  • 37. Document Example { "_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"), "body" : "Here is our forecastnn ", "filename" : "1.", "headers" : { "From" : "phillip.allen@enron.com", "Subject" : "Forecast Info", "X-bcc" : "", "To" : "tim.belden@enron.com", "X-Origin" : "Allen-P", "X-From" : "Phillip K Allen", "Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)", "X-To" : "Tim Belden ", "Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } }
  • 39. Receiver Sender Pairs {"_id": {"t":"bob@enron.com", "f":"alice@enron.com"}, "count" : 14} {"_id": {"t":"bob@enron.com", "f":"eve@enron.com"}, "count" : 9} {"_id": {"t":"alice@enron.com", "f":"charlie@enron.com"}, "count" : 99} {"_id": {"t":"charlie@enron.com", "f":"bob@enron.com"}, "count" : 48} {"_id": {"t":"eve@enron.com", "f":"charlie@enron.com"}, "count" : 20}
  • 40. Map Phase – each document get’s through mapper function @Override public void map(NullWritable key, BSONObject val, final Context context){ BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){ String from = (String)headers.get("From"); String to = (String) headers.get("To"); String[] recips = to.split(","); for(int i=0;i<recips.length;i++){ String recip = recips[i].trim(); context.write(new MailPair(from, recip), new IntWritable(1)); } } }
  • 41. Reduce Phase – output Maps are grouped by key and passed to Reducer public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){ int sum = 0; for ( final IntWritable value : pValues ){ sum += value.get(); } BSONObject outDoc = new BasicDBObjectBuilder().start() .add( "f" , pKey.from) .add( "t" , pKey.to ) .get(); BSONWritable pkeyOut = new BSONWritable(outDoc); pContext.write( pkeyOut, new IntWritable(sum) ); }
  • 42. Query Data mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/}) { "_id" : { "t" "f" } { "_id" : { "t" "f" } { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" { "_id" : { "t" "f" : "kenneth.lay@enron.com", : "15126-1267@m2.innovyx.com" }, "count" : 1 : "kenneth.lay@enron.com", : "2586207@www4.imakenews.com" }, "count" : 1 : : : : : : : : : : "kenneth.lay@enron.com", "40enron@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..davis@enron.com" }, "count" : 2 } "kenneth.lay@enron.com", "a..hughes@enron.com" }, "count" : 4 } "kenneth.lay@enron.com", "a..lindholm@enron.com" }, "count" : 1 } "kenneth.lay@enron.com", "a..schroeder@enron.com" }, "count" : 1 }
  • 43. Hadoop Connector Benefits • Full multi-core parallelism to process MongoDB data • mongo.input.query • Full integration w/ Hadoop and JVM ecosystem • Mahout, et.al. • Can be used on Amazon Elastic MapReduce • Read and write backup files to local, HDFS and S3 • Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive
  • 45. A/B testing • Hey, it looks like teenage girls clicked a lot on that ad with a pink background... • Hypothesis: Given otherwise the same ad, teenage girls are more likely to click on ads with pink backgrounds than white • Test 50-50 pink vs white ads • Collect click stream stats in MongoDB or Hadoop • Analyze results
  • 46. Recommendations – social filtering • ”Customers who bought this book also bought” • Computed offline / nightly • As easy as it sounds! google it: Amazon item-to-item algorithm
  • 47. Personalization • ”Even if you are a teenage girl, you seem to be 60% more likely to click on blue ads than pink.” • User specific recommendations a hybrid of offline & online recommendations • User profile in MongoDB • May even be updated real time