MongoDB and Hadoop

Paris
Tugdual Grall
Technical Evangelist
tug@mongodb.com
@tgrall

MongoDB & Hadoop
Tugdual Grall
Technical Evangelist
tug@mongodb.com
@tgrall

Agenda
Evolving Data Landscape
MongoDB & Hadoop Use Cases
MongoDB Connector Features
Demo

Hadoop
“The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using simple
programming models.”
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics
http://hadoop.apache.org

‹#›
Operational vs. Analytical: Enrichment
Applications, Interactions Warehouse, Analytics

Operational: MongoDB
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting

Analytical: Hadoop
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting

Operational & Analytical: Lifecycle
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting

Commerce
Applications
powered by
Analysis
powered by
Products & Inventory
Recommended products
Customer profile
Session management
Elastic pricing
Recommendation models
Predictive analytics
Clickstream history
MongoDB Connector
for Hadoop

Insurance
Applications
powered by
Analysis
powered by
Customer profiles
Insurance policies
Session data
Call center data
Customer action analysis
Churn analysis
Churn prediction
Policy rates
MongoDB Connector
for Hadoop

Fraud Detection
Payments Nightly Analysis
MongoDB Connector
for Hadoop
3rd Party
Data Sources
Results Cache
Fraud
Detection
Query Only
Query Only

‹#›
Connector Overview
DATA
• Read/Write MongoDB
• Read/Write BSON
TOOLS
• MapReduce
• Pig
• Hive
• Spark
PLATFORMS
• Apache Hadoop
• Cloudera CDH
• Hortonworks HDP
• MapR
• Amazon EMR

‹#›
Connector Features and Functionality
• Computes splits to read data
• Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
• MongoDB as a standard data source/destination
• Support for
• Filtering data with MongoDB queries
• Authentication
• Reading from Replica Set tags
• Appending to existing collections

‹#›
MapReduce Configuration
• MongoDB input/output
mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
mongo.input.uri = mongodb://mydb:27017/db1.collection1
mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output
mongo.job.input.format = com.hadoop.BSONFileInputFormat
mapred.input.dir = hdfs:///tmp/database.bson
mongo.job.output.format = com.hadoop.BSONFileOutputFormat
mapred.output.dir = hdfs:///tmp/output.bson

‹#›
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage

‹#›
Hive Support
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

‹#›
Spark
• Use with MapReduce input/output
formats
• Create Configuration objects with
input/output formats and data URI
• Load/save data using SparkContext
Hadoop file API

‹#›
Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Dynamic queries with most
recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB

Demo : Recommendation Platform

‹#›
MovieWeb Web Application
• Browse
- Top movies by ratings count
- Top genres by movie count
• Log in to
- See My Ratings
- Rate movies
• Recommendations
- Movies You May Like
- Recommendations

‹#›
MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
– http://grouplens.org/datasets/movielens/
• Python web app to browse movies, recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection

‹#›
Spark Recommender
• Apache Hadoop (2.3)
- HDFS & YARN
- Top genres by movie count
• Spark (1.0)
- Execute within YARN
- Assign executor resources
• Data
- From HDFS, MongoDB
- To MongoDB

‹#›
MovieWeb Workflow
Snapshot db
as BSON
Predict ratings for
all pairings
Write Prediction to
MongoDB
collection
Store BSON
in HDFS
Read BSON
into Spark App
Create user movie
pairing
Web Application
exposes
recommendations
Train Model from
existing ratings
Repeat Process

‹#›
Execution
$ spark-submit --master local
--driver-memory 2G --executor-memory 2G
--jars mongo-hadoop-core.jar,mongo-java-driver.jar
--class com.mongodb.workshop.SparkExercise
./target/spark-1.0-SNAPSHOT.jar
hdfs://localhost:9000
mongodb://127.0.0.1:27017/movielens
predictions

Should I use MongoDB or Hadoop?

‹#›
Business First!
First-Level
Analytics
Internet of
Things
Mobile Apps
Social
What/Why How
Product/Asse
t Catalog
Security &
Fraud
Customer
Data
Management
Single View
Churn
Analysis
Risk
Modeling
Trade
Surveillance
Sentiment
Analysis
Recommend
er
Warehouse
& ETL
Predictive
Analytics
Ad Targeting

‹#›
The good tool for the task
• Dataset size
• Data processing complexity
• Continuous improvement
V1.0

‹#›
The good tool for the task
• Dataset size
• Data processing complexity
• Continuous improvement
V2.0

‹#›
Resources / Questions
• MongoDB Connector for Hadoop
- http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop
- http://docs.mongodb.org/ecosystem/tutorial/getting-started-
with-hadoop/
• MongoDB-Spark Demo
- https://github.com/crcsmnky/mongodb-hadoop-workshop

MongoDB and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB and Hadoop

Similar to MongoDB and Hadoop (20)

More from Tugdual Grall

More from Tugdual Grall (20)

Recently uploaded

Recently uploaded (20)

MongoDB and Hadoop

Editor's Notes