#mongodb #mongodbdays #hadoop 
MongoDB and Hadoop: 
Driving Business Insights 
Sandeep Parikh 
@crcsmnky 
Senior Solutions Architect, MongoDB
Agenda 
• Introduction 
• Use Cases 
• Components 
• Connector 
• Demo 
• Questions
Introduction
Hadoop 
The Apache Hadoop software library is a framework 
that allows for the distributed processing of large data 
sets across clusters of computers using simple 
programming models. 
• Terabyte and Petabtye datasets 
• Data warehousing 
• Advanced analytics
Enterprise IT Stack 
Operational Analytical 
EDW 
Management & Monitoring 
Security & Auditing 
Applications 
CRM, ERP, Collaboration, Mobile, BI 
Data Management 
RDBMS 
RDBMS 
Infrastructure 
OS & Virtualization, Compute, Storage, Network
Operational vs. Analytical: 
Enrichment 
Applications, Interactions Warehouse, Analytics
Operational: MongoDB 
First-level 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
Analytical: Hadoop 
First-level 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
Operational vs. Analytical: Lifecycle 
First-level 
Analytics 
Product/Asset 
Catalogs 
Security & 
Fraud 
Internet of 
Things 
Mobile Apps 
Customer 
Data Mgmt 
Single View Social 
Churn Analysis Recommender 
Warehouse & 
ETL 
Risk Modeling 
Trade 
Surveillance 
Predictive 
Analytics 
Ad Targeting 
Sentiment 
Analysis
Use Cases
Commerce 
Applications 
powered by 
Analysis 
powered by 
• Products & Inventory 
• Recommended products 
• Customer profile 
• Session management 
• Elastic pricing 
• Recommendation models 
• Predictive analytics 
• Clickstream history 
MongoDB 
Connector for 
Hadoop
Insurance 
Applications 
powered by 
Analysis 
powered by 
• Customer profiles 
• Insurance policies 
• Session data 
• Call center data 
• Customer action analysis 
• Churn analysis 
• Churn prediction 
• Policy rates 
MongoDB 
Connector for 
Hadoop
Fraud Detection 
Payments 
Nightly 
Analysis 
Fraud modeling 
MongoDB Connector 
for Hadoop 
Results 
Cache 
Online payments 
processing 
3rd Party Data 
Sources 
Fraud 
Detection 
query 
only 
query 
only
Components
Overview 
Pig Hive 
YARN 
HDFS 
MapReduce 
Spark
HDFS and YARN 
• Hadoop Distributed File System 
– Distributed file-system that stores data on commodity 
machines in a Hadoop cluster 
• YARN 
– Resource management platform responsible for 
managing and scheduling compute resources in a 
Hadoop cluster
MapReduce 
• Paralell, distributed 
computation across a 
Hadoop cluster 
• Process and/or generate 
large datasets 
• Simplistic model for 
individual tasks 
Map(k1, v1) 
list(k2,v2) 
Reduce(k2, list(v2)) 
list(v3)
Pig 
• High-level platform for creating 
MapReduce 
• Pig Latin abstracts Java into 
easier-to-use notation 
• Executed as a series of 
MapReduce applications 
• Supports user-defined 
functions (UDFs)
Hive 
• Data warehouse infrastructure built on top of 
Hadoop 
• Provides data summarization, query, and analysis 
• HiveQL is a subset of SQL 
• Support for user-defined functions (UDFs)
Spark 
Spark is a fast and powerful engine for 
processing Hadoop data. It is designed to 
perform both general data processing (similar 
to MapReduce) and new workloads like 
streaming, interactive queries, and machine 
learning. 
• Powerful built-in transformations and actions 
– map, reduceByKey, union, distinct, sample, intersection, and 
more 
– foreach, count, collect, take, and many more
MongoDB Connector for 
Hadoop
Data 
Read/Write 
MongoDB 
Read/Write 
BSON 
Tools 
MapReduce 
Pig 
Hive 
Spark 
Platforms 
Apache Hadoop 
Cloudera CDH 
Hortonworks 
HDP 
Amazon EMR 
Connector Overview
Features and Functionality 
• MongoDB and BSON 
– Input and Output formats 
• Computes splits to read data 
• Support for 
– Filtering data with MongoDB queries 
– Authentication 
– Reading directly from shard Primaries 
– ReadPreferences and Replica Set tags 
– Appending to existing collections
MapReduce Configuration 
• MongoDB input 
– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat 
– mongo.input.uri = mongodb://mydb:27017/db1.collection1 
• MongoDB output 
– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat 
– mongo.output.uri = mongodb://mydb:27017/db1.collection2 
• BSON input/output 
– mongo.job.input.format = com.hadoop.BSONFileInputFormat 
– mapred.input.dir = hdfs:///tmp/database.bson 
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat 
– mapred.output.dir = hdfs:///tmp/output.bson
Mapper Example 
public class Map extends Mapper<Object, BSONObject, Text, IntWritable> { 
public void map(Object key, BSONObject doc, Context context) { 
List<String> genres = (List<String>)doc.get("genres"); 
for(String genre : genres) { 
context.write(new Text(genre), new IntWritable(1)); 
} 
} 
} 
{ _id: ObjectId(…), title: “Toy Story”, 
genres: [“Animation”, “Children”] } 
{ _id: ObjectId(…), title: “Goldeneye”, 
genres: [“Action”, “Crime”, “Thriller”] } 
{ _id: ObjectId(…), title: “Jumanji”, 
genres: [“Adventure”, “Children”, “Fantasy”] }
Reducer Example 
public class Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> { 
public void reduce(Text key, Iterable<IntWritable> values, Context context) { 
int sum = 0; 
for(IntWritable value : values) { 
sum += value.get(); 
} 
DBObject object = new BasicDBObjectBuilder().start() 
.add("genre", key.toString()) 
.add("count", sum) 
.get(); 
BSONWritable doc = new BSONWritable(object); 
context.write(NullWritable.get(), doc); 
{ _id: ObjectId(…), genre: “Action”, count: 1370 } 
{ _id: ObjectId(…), genre: “Adventure”, count: 957 } 
{ _id: ObjectId(…), genre: “Animation”, count: 258 } 
} 
}
Pig – Mappings 
Read: 
– BSONLoader and MongoLoader 
data = LOAD ‘mongodb://mydb:27017/db.collection’ 
using com.mongodb.hadoop.pig.MongoLoader 
– Map schema, _id, datatypes 
Insert: 
– BSONStorage and MongoInsertStorage 
STORE records INTO ‘hdfs:///output.bson’ 
using com.mongodb.hadoop.pig.BSONStorage 
– Map output id, schema 
Update: 
– MongoUpdateStorage 
– Specify query, update operations, schema, update options
Pig Specifics 
• Fixed or dynamic schema with Loader 
• Types auto-mapped 
– Embedded documents → Map 
– Arrays → Tuple 
• Supply alias for “_id” 
– not a legal Pig variable name
Hive – Tables 
CREATE TABLE mongo_users (id int, name string, age int) 
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”) 
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONStorageHandler
Hive Particulars 
• Queries are not (currently) pushed down to MongoDB 
• WHERE predicates are evaluated after reading data 
from MongoDB 
• Types auto-mapped 
– Embedded documents (mixed types) → STRUCT 
– Embedded documents (single type) → MAP 
– Arrays → ARRAY 
– ObjectId → STRUCT 
• Use EXTERNAL when creating tables otherwise 
dropping Hive table drops underlying collection
Spark Usage 
• Use with MapReduce 
input/output formats 
• Create Configuration objects 
with input/output formats and 
data URI 
• Load/save data using 
SparkContext Hadoop file or 
RDD APIs
Spark Input Example 
Configuration inputDataConfig = new Configuration(); 
inputDataConfig.set("mongo.job.input.format”, "MongoInputFormat.class"); 
inputDataConfig.set(“mongo.input.uri”, “mongodb://127.0.0.1/test.foo”); 
JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD( 
inputDataConfig MongoInputFormat.class, Object.class, 
BSONObject.class); 
Configuration bsonDataConfig = new Configuration(); 
bsonDataConfig.set("mongo.job.input.format”, "BSONFileInputFormat.class"); 
JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile( 
"hdfs://namenode:9000/data/test/foo.bson", 
BSONFileInputFormat.class, Object.class, 
BSONObject.class, bsonDataConfig); 
MongoDB 
BSON
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in 
HDFS 
Dynamic queries with 
most recent data 
Puts load on operational 
database 
Snapshots move load to 
Hadoop 
Snapshots add 
predictable load to 
MongoDB
Demo
MovieWeb
MovieWeb Components 
• MovieLens dataset 
– 10M ratings, 10K movies, 70K users 
• Python web app to browse movies, 
recommendations 
– Flask, PyMongo 
• Spark app computes recommendations 
– MLLib collaborative filter 
• Predicted ratings are exposed in web app 
– New predictions collection
MovieWeb Web Application 
• Browse 
– Top movies by ratings count 
– Top genres by movie count 
• Log in to 
– See My Ratings 
– Rate movies 
• What’s missing? 
– Movies You May Like 
– Recommendations
Spark Recommender 
• Apache Hadoop 2.3.0 
– HDFS 
• Spark 1.0 
– Execute locally 
– Assign executor 
resources 
• Data 
– From HDFS 
– To MongoDB
Snapshot 
database as 
BSON 
Store BSON in 
HDFS 
Read BSON into 
Spark app 
Train model from 
existing ratings 
Create user-movie 
pairings 
Predict ratings for 
all pairings 
Write predictions 
to MongoDB 
collection 
Web application 
exposes 
recommendations 
Repeat the 
process 
MovieWeb Workflow
Execution 
$ bin/spark-submit 
--master local 
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar 
--jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar 
--driver-memory 2G 
--executor-memory 1G 
[insert job args here]
Questions? 
• MongoDB Connector for Hadoop 
– http://github.com/mongodb/mongo-hadoop 
• Getting Started with MongoDB and Hadoop 
– http://docs.mongodb.org/ecosystem/tutorial/getting-started- 
with-hadoop/ 
• MongoDB-Spark Demo 
– http://github.com/crcsmnky/mongodb-spark-demo
#mongodb #mongodbdays #hadoop 
Thank You 
Sandeep Parikh 
@crcsmnky 
Senior Solutions Architect, MongoDB

MongoDB and Hadoop: Driving Business Insights

  • 1.
    #mongodb #mongodbdays #hadoop MongoDB and Hadoop: Driving Business Insights Sandeep Parikh @crcsmnky Senior Solutions Architect, MongoDB
  • 2.
    Agenda • Introduction • Use Cases • Components • Connector • Demo • Questions
  • 3.
  • 4.
    Hadoop The ApacheHadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabtye datasets • Data warehousing • Advanced analytics
  • 5.
    Enterprise IT Stack Operational Analytical EDW Management & Monitoring Security & Auditing Applications CRM, ERP, Collaboration, Mobile, BI Data Management RDBMS RDBMS Infrastructure OS & Virtualization, Compute, Storage, Network
  • 6.
    Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  • 7.
    Operational: MongoDB First-level Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 8.
    Analytical: Hadoop First-level Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 9.
    Operational vs. Analytical:Lifecycle First-level Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 10.
  • 11.
    Commerce Applications poweredby Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 12.
    Insurance Applications poweredby Analysis powered by • Customer profiles • Insurance policies • Session data • Call center data • Customer action analysis • Churn analysis • Churn prediction • Policy rates MongoDB Connector for Hadoop
  • 13.
    Fraud Detection Payments Nightly Analysis Fraud modeling MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  • 14.
  • 15.
    Overview Pig Hive YARN HDFS MapReduce Spark
  • 16.
    HDFS and YARN • Hadoop Distributed File System – Distributed file-system that stores data on commodity machines in a Hadoop cluster • YARN – Resource management platform responsible for managing and scheduling compute resources in a Hadoop cluster
  • 17.
    MapReduce • Paralell,distributed computation across a Hadoop cluster • Process and/or generate large datasets • Simplistic model for individual tasks Map(k1, v1) list(k2,v2) Reduce(k2, list(v2)) list(v3)
  • 18.
    Pig • High-levelplatform for creating MapReduce • Pig Latin abstracts Java into easier-to-use notation • Executed as a series of MapReduce applications • Supports user-defined functions (UDFs)
  • 19.
    Hive • Datawarehouse infrastructure built on top of Hadoop • Provides data summarization, query, and analysis • HiveQL is a subset of SQL • Support for user-defined functions (UDFs)
  • 20.
    Spark Spark isa fast and powerful engine for processing Hadoop data. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning. • Powerful built-in transformations and actions – map, reduceByKey, union, distinct, sample, intersection, and more – foreach, count, collect, take, and many more
  • 21.
  • 22.
    Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR Connector Overview
  • 23.
    Features and Functionality • MongoDB and BSON – Input and Output formats • Computes splits to read data • Support for – Filtering data with MongoDB queries – Authentication – Reading directly from shard Primaries – ReadPreferences and Replica Set tags – Appending to existing collections
  • 24.
    MapReduce Configuration •MongoDB input – mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 25.
    Mapper Example publicclass Map extends Mapper<Object, BSONObject, Text, IntWritable> { public void map(Object key, BSONObject doc, Context context) { List<String> genres = (List<String>)doc.get("genres"); for(String genre : genres) { context.write(new Text(genre), new IntWritable(1)); } } } { _id: ObjectId(…), title: “Toy Story”, genres: [“Animation”, “Children”] } { _id: ObjectId(…), title: “Goldeneye”, genres: [“Action”, “Crime”, “Thriller”] } { _id: ObjectId(…), title: “Jumanji”, genres: [“Adventure”, “Children”, “Fantasy”] }
  • 26.
    Reducer Example publicclass Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } DBObject object = new BasicDBObjectBuilder().start() .add("genre", key.toString()) .add("count", sum) .get(); BSONWritable doc = new BSONWritable(object); context.write(NullWritable.get(), doc); { _id: ObjectId(…), genre: “Action”, count: 1370 } { _id: ObjectId(…), genre: “Adventure”, count: 957 } { _id: ObjectId(…), genre: “Animation”, count: 258 } } }
  • 27.
    Pig – Mappings Read: – BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader – Map schema, _id, datatypes Insert: – BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage – Map output id, schema Update: – MongoUpdateStorage – Specify query, update operations, schema, update options
  • 28.
    Pig Specifics •Fixed or dynamic schema with Loader • Types auto-mapped – Embedded documents → Map – Arrays → Tuple • Supply alias for “_id” – not a legal Pig variable name
  • 29.
    Hive – Tables CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler
  • 30.
    Hive Particulars •Queries are not (currently) pushed down to MongoDB • WHERE predicates are evaluated after reading data from MongoDB • Types auto-mapped – Embedded documents (mixed types) → STRUCT – Embedded documents (single type) → MAP – Arrays → ARRAY – ObjectId → STRUCT • Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection
  • 31.
    Spark Usage •Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file or RDD APIs
  • 32.
    Spark Input Example Configuration inputDataConfig = new Configuration(); inputDataConfig.set("mongo.job.input.format”, "MongoInputFormat.class"); inputDataConfig.set(“mongo.input.uri”, “mongodb://127.0.0.1/test.foo”); JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD( inputDataConfig MongoInputFormat.class, Object.class, BSONObject.class); Configuration bsonDataConfig = new Configuration(); bsonDataConfig.set("mongo.job.input.format”, "BSONFileInputFormat.class"); JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile( "hdfs://namenode:9000/data/test/foo.bson", BSONFileInputFormat.class, Object.class, BSONObject.class, bsonDataConfig); MongoDB BSON
  • 33.
    Data Movement Dynamicqueries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 34.
  • 35.
  • 36.
    MovieWeb Components •MovieLens dataset – 10M ratings, 10K movies, 70K users • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  • 37.
    MovieWeb Web Application • Browse – Top movies by ratings count – Top genres by movie count • Log in to – See My Ratings – Rate movies • What’s missing? – Movies You May Like – Recommendations
  • 38.
    Spark Recommender •Apache Hadoop 2.3.0 – HDFS • Spark 1.0 – Execute locally – Assign executor resources • Data – From HDFS – To MongoDB
  • 39.
    Snapshot database as BSON Store BSON in HDFS Read BSON into Spark app Train model from existing ratings Create user-movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process MovieWeb Workflow
  • 40.
    Execution $ bin/spark-submit --master local --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar --jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar --driver-memory 2G --executor-memory 1G [insert job args here]
  • 41.
    Questions? • MongoDBConnector for Hadoop – http://github.com/mongodb/mongo-hadoop • Getting Started with MongoDB and Hadoop – http://docs.mongodb.org/ecosystem/tutorial/getting-started- with-hadoop/ • MongoDB-Spark Demo – http://github.com/crcsmnky/mongodb-spark-demo
  • 42.
    #mongodb #mongodbdays #hadoop Thank You Sandeep Parikh @crcsmnky Senior Solutions Architect, MongoDB