MongoDB and Hadoop: Driving Business Insights

#mongodb #mongodbdays #hadoop
MongoDB and Hadoop:
Driving Business Insights
Sandeep Parikh
@crcsmnky
Senior Solutions Architect, MongoDB

Agenda
• Introduction
• Use Cases
• Components
• Connector
• Demo
• Questions

Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of computers using simple
programming models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics

Enterprise IT Stack
Operational Analytical
EDW
Management & Monitoring
Security & Auditing
Applications
CRM, ERP, Collaboration, Mobile, BI
Data Management
RDBMS
RDBMS
Infrastructure
OS & Virtualization, Compute, Storage, Network

Operational vs. Analytical:
Enrichment
Applications, Interactions Warehouse, Analytics

Operational: MongoDB
First-level
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Analytical: Hadoop
First-level
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Operational vs. Analytical: Lifecycle
First-level
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Data Mgmt
Single View Social
Warehouse &
ETL
Risk Modeling
Trade
Surveillance
Predictive
Analytics
Ad Targeting
Sentiment
Analysis

Commerce
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Session management
• Elastic pricing
• Recommendation models
• Predictive analytics
• Clickstream history
MongoDB
Connector for
Hadoop

Insurance
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call center data
• Customer action analysis
• Churn analysis
• Churn prediction
• Policy rates
MongoDB
Connector for
Hadoop

Fraud Detection
Payments
Nightly
Analysis
Fraud modeling
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only

Overview
Pig Hive
YARN
HDFS
MapReduce
Spark

HDFS and YARN
• Hadoop Distributed File System
– Distributed file-system that stores data on commodity
machines in a Hadoop cluster
• YARN
– Resource management platform responsible for
managing and scheduling compute resources in a
Hadoop cluster

MapReduce
• Paralell, distributed
computation across a
Hadoop cluster
• Process and/or generate
large datasets
• Simplistic model for
individual tasks
Map(k1, v1)
list(k2,v2)
Reduce(k2, list(v2))
list(v3)

Pig
• High-level platform for creating
MapReduce
• Pig Latin abstracts Java into
easier-to-use notation
• Executed as a series of
MapReduce applications
• Supports user-defined
functions (UDFs)

Hive
• Data warehouse infrastructure built on top of
Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)

Spark
Spark is a fast and powerful engine for
processing Hadoop data. It is designed to
perform both general data processing (similar
to MapReduce) and new workloads like
streaming, interactive queries, and machine
learning.
• Powerful built-in transformations and actions
– map, reduceByKey, union, distinct, sample, intersection, and
more
– foreach, count, collect, take, and many more

Data
Read/Write
MongoDB
Read/Write
BSON
Tools
MapReduce
Pig
Hive
Spark
Platforms
Apache Hadoop
Cloudera CDH
Hortonworks
HDP
Amazon EMR
Connector Overview

Features and Functionality
• MongoDB and BSON
– Input and Output formats
• Computes splits to read data
• Support for
– Filtering data with MongoDB queries
– Authentication
– Reading directly from shard Primaries
– ReadPreferences and Replica Set tags
– Appending to existing collections

MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output
– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output
– mongo.job.input.format = com.hadoop.BSONFileInputFormat
– mapred.input.dir = hdfs:///tmp/database.bson
– mongo.job.output.format = com.hadoop.BSONFileOutputFormat
– mapred.output.dir = hdfs:///tmp/output.bson

Mapper Example
public class Map extends Mapper<Object, BSONObject, Text, IntWritable> {
public void map(Object key, BSONObject doc, Context context) {
List<String> genres = (List<String>)doc.get("genres");
for(String genre : genres) {
context.write(new Text(genre), new IntWritable(1));
}
}
}
{ _id: ObjectId(…), title: “Toy Story”,
genres: [“Animation”, “Children”] }
{ _id: ObjectId(…), title: “Goldeneye”,
genres: [“Action”, “Crime”, “Thriller”] }
{ _id: ObjectId(…), title: “Jumanji”,
genres: [“Adventure”, “Children”, “Fantasy”] }

Reducer Example
public class Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for(IntWritable value : values) {
sum += value.get();
}
DBObject object = new BasicDBObjectBuilder().start()
.add("genre", key.toString())
.add("count", sum)
.get();
BSONWritable doc = new BSONWritable(object);
context.write(NullWritable.get(), doc);
{ _id: ObjectId(…), genre: “Action”, count: 1370 }
{ _id: ObjectId(…), genre: “Adventure”, count: 957 }
{ _id: ObjectId(…), genre: “Animation”, count: 258 }
}
}

Pig – Mappings
Read:
– BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
– Map schema, _id, datatypes
Insert:
– BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
– Map output id, schema
Update:
– MongoUpdateStorage
– Specify query, update operations, schema, update options

Pig Specifics
• Fixed or dynamic schema with Loader
• Types auto-mapped
– Embedded documents → Map
– Arrays → Tuple
• Supply alias for “_id”
– not a legal Pig variable name

Hive – Tables
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler

Hive Particulars
• Queries are not (currently) pushed down to MongoDB
• WHERE predicates are evaluated after reading data
from MongoDB
• Types auto-mapped
– Embedded documents (mixed types) → STRUCT
– Embedded documents (single type) → MAP
– Arrays → ARRAY
– ObjectId → STRUCT
• Use EXTERNAL when creating tables otherwise
dropping Hive table drops underlying collection

Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects
with input/output formats and
data URI
• Load/save data using
SparkContext Hadoop file or
RDD APIs

Spark Input Example
Configuration inputDataConfig = new Configuration();
inputDataConfig.set("mongo.job.input.format”, "MongoInputFormat.class");
inputDataConfig.set(“mongo.input.uri”, “mongodb://127.0.0.1/test.foo”);
JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD(
inputDataConfig MongoInputFormat.class, Object.class,
BSONObject.class);
Configuration bsonDataConfig = new Configuration();
bsonDataConfig.set("mongo.job.input.format”, "BSONFileInputFormat.class");
JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile(
"hdfs://namenode:9000/data/test/foo.bson",
BSONFileInputFormat.class, Object.class,
BSONObject.class, bsonDataConfig);
MongoDB
BSON

Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in
HDFS
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add
predictable load to
MongoDB

MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies,
recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection

MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
– Rate movies
• What’s missing?
– Movies You May Like
– Recommendations

Spark Recommender
• Apache Hadoop 2.3.0
– HDFS
• Spark 1.0
– Execute locally
– Assign executor
resources
• Data
– From HDFS
– To MongoDB

Snapshot
database as
BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-movie
pairings
Predict ratings for
all pairings
Write predictions
to MongoDB
collection
Web application
exposes
recommendations
Repeat the
process
MovieWeb Workflow

Execution
$ bin/spark-submit
--master local
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar
--driver-memory 2G
--executor-memory 1G
[insert job args here]

Questions?
• MongoDB Connector for Hadoop
– http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop
– http://docs.mongodb.org/ecosystem/tutorial/getting-started-
with-hadoop/
• MongoDB-Spark Demo
– http://github.com/crcsmnky/mongodb-spark-demo

#mongodb #mongodbdays #hadoop
Thank You
Sandeep Parikh
@crcsmnky
Senior Solutions Architect, MongoDB

MongoDB and Hadoop: Driving Business Insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to MongoDB and Hadoop: Driving Business Insights

Similar to MongoDB and Hadoop: Driving Business Insights (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB and Hadoop: Driving Business Insights