Paris 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall
MongoDB & Hadoop 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall
Agenda 
Evolving Data Landscape 
MongoDB & Hadoop Use Cases 
MongoDB Connector Features 
Demo
Evolving Data Landscape
Hadoop 
“The Apache Hadoop software library is a framework 
that allows for the distributed processing of large 
data sets across clusters of computers using simple 
programming models.” 
• Terabyte and Petabyte datasets 
• Data warehousing 
• Advanced analytics 
http://hadoop.apache.org
‹#› 
Enterprise IT Stack
‹#› 
Operational vs. Analytical: Enrichment 
Applications, Interactions Warehouse, Analytics
Operational: MongoDB 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
Analytical: Hadoop 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
Operational & Analytical: Lifecycle 
First-Level 
Analytics 
Internet of Things 
Mobile Apps 
Social 
Product/Asset 
Catalog 
Security & Fraud 
Customer Data 
Management 
Single View 
Churn Analysis 
Risk Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommender 
Warehouse & ETL 
Predictive 
Analytics 
Ad Targeting
MongoDB & Hadoop Use Cases
Commerce 
Applications 
powered by 
Analysis 
powered by 
Products & Inventory 
Recommended products 
Customer profile 
Session management 
Elastic pricing 
Recommendation models 
Predictive analytics 
Clickstream history 
MongoDB Connector 
for Hadoop
Insurance 
Applications 
powered by 
Analysis 
powered by 
Customer profiles 
Insurance policies 
Session data 
Call center data 
Customer action analysis 
Churn analysis 
Churn prediction 
Policy rates 
MongoDB Connector 
for Hadoop
Fraud Detection 
Payments Nightly Analysis 
MongoDB Connector 
for Hadoop 
3rd Party 
Data Sources 
Results Cache 
Fraud 
Detection 
Query Only 
Query Only
MongoDB Connector for Hadoop
‹#› 
Connector Overview 
DATA 
• Read/Write MongoDB 
• Read/Write BSON 
TOOLS 
• MapReduce 
• Pig 
• Hive 
• Spark 
PLATFORMS 
• Apache Hadoop 
• Cloudera CDH 
• Hortonworks HDP 
• MapR 
• Amazon EMR
‹#› 
Connector Features and Functionality 
• Computes splits to read data 
• Single Node, Replica Sets, Sharded Clusters 
• Mappings for Pig and Hive 
• MongoDB as a standard data source/destination 
• Support for 
• Filtering data with MongoDB queries 
• Authentication 
• Reading from Replica Set tags 
• Appending to existing collections
‹#› 
MapReduce Configuration 
• MongoDB input/output 
mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat 
mongo.input.uri = mongodb://mydb:27017/db1.collection1 
mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat 
mongo.output.uri = mongodb://mydb:27017/db1.collection2 
• BSON input/output 
mongo.job.input.format = com.hadoop.BSONFileInputFormat 
mapred.input.dir = hdfs:///tmp/database.bson 
mongo.job.output.format = com.hadoop.BSONFileOutputFormat 
mapred.output.dir = hdfs:///tmp/output.bson
‹#› 
Pig Mappings 
• Input: BSONLoader and MongoLoader 
data = LOAD ‘mongodb://mydb:27017/db.collection’ 
using com.mongodb.hadoop.pig.MongoLoader 
• Output: BSONStorage and MongoInsertStorage 
STORE records INTO ‘hdfs:///output.bson’ 
using com.mongodb.hadoop.pig.BSONStorage
‹#› 
Hive Support 
• Access collections as Hive tables 
• Use with MongoStorageHandler or BSONStorageHandler 
CREATE TABLE mongo_users (id int, name string, age int) 
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" 
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) 
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
‹#› 
Spark 
• Use with MapReduce input/output 
formats 
• Create Configuration objects with 
input/output formats and data URI 
• Load/save data using SparkContext 
Hadoop file API
‹#› 
Data Movement 
Dynamic queries to MongoDB vs. BSON snapshots in HDFS 
Dynamic queries with most 
recent data 
Puts load on operational 
database 
Snapshots move load to 
Hadoop 
Snapshots add predictable 
load to MongoDB
Demo : Recommendation Platform
‹#› 
Movie Web
‹#› 
MovieWeb Web Application 
• Browse 
- Top movies by ratings count 
- Top genres by movie count 
• Log in to 
- See My Ratings 
- Rate movies 
• Recommendations 
- Movies You May Like 
- Recommendations
‹#› 
MovieWeb Components 
• MovieLens dataset 
– 10M ratings, 10K movies, 70K users 
– http://grouplens.org/datasets/movielens/ 
• Python web app to browse movies, recommendations 
– Flask, PyMongo 
• Spark app computes recommendations 
– MLLib collaborative filter 
• Predicted ratings are exposed in web app 
– New predictions collection
‹#› 
Spark Recommender 
• Apache Hadoop (2.3) 
- HDFS & YARN 
- Top genres by movie count 
• Spark (1.0) 
- Execute within YARN 
- Assign executor resources 
• Data 
- From HDFS, MongoDB 
- To MongoDB
‹#› 
MovieWeb Workflow 
Snapshot db 
as BSON 
Predict ratings for 
all pairings 
Write Prediction to 
MongoDB 
collection 
Store BSON 
in HDFS 
Read BSON 
into Spark App 
Create user movie 
pairing 
Web Application 
exposes 
recommendations 
Train Model from 
existing ratings 
Repeat Process
‹#› 
Execution 
$ spark-submit --master local  
--driver-memory 2G --executor-memory 2G  
--jars mongo-hadoop-core.jar,mongo-java-driver.jar  
--class com.mongodb.workshop.SparkExercise  
./target/spark-1.0-SNAPSHOT.jar  
hdfs://localhost:9000  
mongodb://127.0.0.1:27017/movielens  
predictions
Should I use MongoDB or Hadoop?
‹#› 
Business First! 
First-Level 
Analytics 
Internet of 
Things 
Mobile Apps 
Social 
What/Why How 
Product/Asse 
t Catalog 
Security & 
Fraud 
Customer 
Data 
Management 
Single View 
Churn 
Analysis 
Risk 
Modeling 
Trade 
Surveillance 
Sentiment 
Analysis 
Recommend 
er 
Warehouse 
& ETL 
Predictive 
Analytics 
Ad Targeting
‹#› 
The good tool for the task 
• Dataset size 
• Data processing complexity 
• Continuous improvement 
V1.0
‹#› 
The good tool for the task 
• Dataset size 
• Data processing complexity 
• Continuous improvement 
V2.0
‹#› 
Resources / Questions 
• MongoDB Connector for Hadoop 
- http://github.com/mongodb/mongo-hadoop 
• Getting Started with MongoDB and Hadoop 
- http://docs.mongodb.org/ecosystem/tutorial/getting-started- 
with-hadoop/ 
• MongoDB-Spark Demo 
- https://github.com/crcsmnky/mongodb-hadoop-workshop
MongoDB & Hadoop 
Tugdual Grall 
Technical Evangelist 
tug@mongodb.com 
@tgrall

MongoDB and Hadoop

  • 1.
    Paris Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  • 2.
    MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall
  • 3.
    Agenda Evolving DataLandscape MongoDB & Hadoop Use Cases MongoDB Connector Features Demo
  • 4.
  • 5.
    Hadoop “The ApacheHadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.” • Terabyte and Petabyte datasets • Data warehousing • Advanced analytics http://hadoop.apache.org
  • 6.
  • 7.
    ‹#› Operational vs.Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  • 8.
    Operational: MongoDB First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 9.
    Analytical: Hadoop First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 10.
    Operational & Analytical:Lifecycle First-Level Analytics Internet of Things Mobile Apps Social Product/Asset Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommender Warehouse & ETL Predictive Analytics Ad Targeting
  • 11.
  • 12.
    Commerce Applications poweredby Analysis powered by Products & Inventory Recommended products Customer profile Session management Elastic pricing Recommendation models Predictive analytics Clickstream history MongoDB Connector for Hadoop
  • 13.
    Insurance Applications poweredby Analysis powered by Customer profiles Insurance policies Session data Call center data Customer action analysis Churn analysis Churn prediction Policy rates MongoDB Connector for Hadoop
  • 14.
    Fraud Detection PaymentsNightly Analysis MongoDB Connector for Hadoop 3rd Party Data Sources Results Cache Fraud Detection Query Only Query Only
  • 15.
  • 16.
    ‹#› Connector Overview DATA • Read/Write MongoDB • Read/Write BSON TOOLS • MapReduce • Pig • Hive • Spark PLATFORMS • Apache Hadoop • Cloudera CDH • Hortonworks HDP • MapR • Amazon EMR
  • 17.
    ‹#› Connector Featuresand Functionality • Computes splits to read data • Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive • MongoDB as a standard data source/destination • Support for • Filtering data with MongoDB queries • Authentication • Reading from Replica Set tags • Appending to existing collections
  • 18.
    ‹#› MapReduce Configuration • MongoDB input/output mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat mongo.input.uri = mongodb://mydb:27017/db1.collection1 mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output mongo.job.input.format = com.hadoop.BSONFileInputFormat mapred.input.dir = hdfs:///tmp/database.bson mongo.job.output.format = com.hadoop.BSONFileOutputFormat mapred.output.dir = hdfs:///tmp/output.bson
  • 19.
    ‹#› Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 20.
    ‹#› Hive Support • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
  • 21.
    ‹#› Spark •Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop file API
  • 22.
    ‹#› Data Movement Dynamic queries to MongoDB vs. BSON snapshots in HDFS Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB
  • 23.
  • 24.
  • 25.
    ‹#› MovieWeb WebApplication • Browse - Top movies by ratings count - Top genres by movie count • Log in to - See My Ratings - Rate movies • Recommendations - Movies You May Like - Recommendations
  • 26.
    ‹#› MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users – http://grouplens.org/datasets/movielens/ • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  • 27.
    ‹#› Spark Recommender • Apache Hadoop (2.3) - HDFS & YARN - Top genres by movie count • Spark (1.0) - Execute within YARN - Assign executor resources • Data - From HDFS, MongoDB - To MongoDB
  • 28.
    ‹#› MovieWeb Workflow Snapshot db as BSON Predict ratings for all pairings Write Prediction to MongoDB collection Store BSON in HDFS Read BSON into Spark App Create user movie pairing Web Application exposes recommendations Train Model from existing ratings Repeat Process
  • 29.
    ‹#› Execution $spark-submit --master local --driver-memory 2G --executor-memory 2G --jars mongo-hadoop-core.jar,mongo-java-driver.jar --class com.mongodb.workshop.SparkExercise ./target/spark-1.0-SNAPSHOT.jar hdfs://localhost:9000 mongodb://127.0.0.1:27017/movielens predictions
  • 30.
    Should I useMongoDB or Hadoop?
  • 31.
    ‹#› Business First! First-Level Analytics Internet of Things Mobile Apps Social What/Why How Product/Asse t Catalog Security & Fraud Customer Data Management Single View Churn Analysis Risk Modeling Trade Surveillance Sentiment Analysis Recommend er Warehouse & ETL Predictive Analytics Ad Targeting
  • 32.
    ‹#› The goodtool for the task • Dataset size • Data processing complexity • Continuous improvement V1.0
  • 33.
    ‹#› The goodtool for the task • Dataset size • Data processing complexity • Continuous improvement V2.0
  • 34.
    ‹#› Resources /Questions • MongoDB Connector for Hadoop - http://github.com/mongodb/mongo-hadoop • Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-started- with-hadoop/ • MongoDB-Spark Demo - https://github.com/crcsmnky/mongodb-hadoop-workshop
  • 35.
    MongoDB & Hadoop Tugdual Grall Technical Evangelist tug@mongodb.com @tgrall

Editor's Notes

  • #6 Apache def, a framework to enable many things Distributed File system one of the core component is MapReduce Now it is more YARN, that is resource manager, and MR is just one type of jobs you can manage Mongo DB : GB and Terabytes Hadoop : Tb and Pb
  • #7 You have 2 places where you deal with data
  • #8 You have to think about “enrichment” MongoDB is here to enrich data that are in Hadoop Hadoop is here to enrich data that are in Mongodb Let’s look at the different uses cases between Operational and Analytics
  • #9 First level could be done in MongoDB “what is your application is talking to?”
  • #10 Hadoop will be there to analyze a bigger problem and do some treatment We are talking of hadoop when it is Pb of data
  • #11 We are trying to solve the bigger problem, by connecting the 2 technologies when it makes sense
  • #18 Split the data when reading data (Mapper) But also filtering queries, for example to take data from a specific timestamp To reduce the load on your cluster read from replicaset tags a new feature that people asked for, is adding result to the existing collection
  • #22 Spark in a new data processing that happens most in memory Take all the power of the connector Open a new “hadoop file” that is loaded in RDD ( Resilient Distributed Dataset )
  • #29 Load Data Read Users from MongoDB (user collection) Read Movies from BSON (HDFS) Read Ratings from MongoDB (ratings collection) Data Processing Generate (user/movies) pairs users.cartesian(movies) Train : Collaborative Filter ALS.train(ratings.rdd(), 10, 10, 0.01); Predict/Recommend Save data into MongoDB prediction collection