Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Partner Technical Solutions, MongoDB
Sandeep Parikh
#MongoDBWorld
MongoDB and Hadoop
Driving Business Insights
Agenda
• Evolving Data Landscape
• MongoDB & Hadoop Use Cases
• MongoDB Connector Features
• Demo
Evolving Data Landscape
Hadoop
The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets acr...
Enterprise IT Stack
EDW
Management&Monitoring
Security&Auditing
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualizat...
Operational vs. Analytical:
Enrichment
Applications, Interactions Warehouse, Analytics
Operational: MongoDB
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
D...
Analytical: Hadoop
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobile Apps
Customer
Dat...
Operational vs. Analytical: Lifecycle
Real-Time
Analytics
Product/Asset
Catalogs
Security &
Fraud
Internet of
Things
Mobil...
MongoDB & Hadoop Use
Cases
Commerce
Applications
powered by
Analysis
powered by
• Products & Inventory
• Recommended products
• Customer profile
• Se...
Insurance
Applications
powered by
Analysis
powered by
• Customer profiles
• Insurance policies
• Session data
• Call cente...
Fraud Detection
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
proces...
MongoDB Connector for
Hadoop
Connector Overview
Data
Read/Write
MongoDB
Read/Write
BSON
Tools
MapReduce
Pig
Hive
Spark
Platforms
Apache Hadoop
Cloudera...
Connector Features and
Functionality
• Computes splits to read data
– Single Node, Replica Sets, Sharded Clusters
• Mappin...
MapReduce Configuration
• MongoDB input
– mongo.job.input.format = com.hadoop.MongoInputFormat
– mongo.input.uri = mongodb...
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop...
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandle...
Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration objects
with input/output formats and
data UR...
Data Movement
Dynamic queries with
most recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapsh...
Demo
MovieWeb
MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
• Python web app to browse movies,
recommenda...
MovieWeb Web Application
• Browse
– Top movies by ratings count
– Top genres by movie count
• Log in to
– See My Ratings
–...
Spark Recommender
• Apache Hadoop 2.3.0
– HDFS and YARN
• Spark 1.0
– Execute within YARN
– Assign executor
resources
• Da...
Snapshot
database as
BSON
Store BSON in
HDFS
Read BSON into
Spark app
Train model from
existing ratings
Create user-
movie...
$ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.3.0.jar
$ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
$ bin/spark-...
Questions?
Upcoming SlideShare
Loading in …5
×

MongoDB and Hadoop: Driving Business Insights

19,531 views

Published on

Published in: Technology
  • Be the first to comment

MongoDB and Hadoop: Driving Business Insights

  1. 1. Partner Technical Solutions, MongoDB Sandeep Parikh #MongoDBWorld MongoDB and Hadoop Driving Business Insights
  2. 2. Agenda • Evolving Data Landscape • MongoDB & Hadoop Use Cases • MongoDB Connector Features • Demo
  3. 3. Evolving Data Landscape
  4. 4. Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabtye datasets • Data warehousing • Advanced analytics
  5. 5. Enterprise IT Stack EDW Management&Monitoring Security&Auditing RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical
  6. 6. Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  7. 7. Operational: MongoDB Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  8. 8. Analytical: Hadoop Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  9. 9. Operational vs. Analytical: Lifecycle Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  10. 10. MongoDB & Hadoop Use Cases
  11. 11. Commerce Applications powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  12. 12. Insurance Applications powered by Analysis powered by • Customer profiles • Insurance policies • Session data • Call center data • Customer action analysis • Churn analysis • Churn prediction • Policy rates MongoDB Connector for Hadoop
  13. 13. Fraud Detection Payments Fraud modeling Nightly Analysis MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  14. 14. MongoDB Connector for Hadoop
  15. 15. Connector Overview Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR
  16. 16. Connector Features and Functionality • Computes splits to read data – Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive – MongoDB as a standard data source/destination • Support for – Filtering data with MongoDB queries – Authentication – Reading from Replica Set tags – Appending to existing collections
  17. 17. MapReduce Configuration • MongoDB input – mongo.job.input.format = com.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  18. 18. Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  19. 19. Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler
  20. 20. Spark Usage • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop fileAPI
  21. 21. Data Movement Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB Dynamic queries to MongoDB vs. BSON snapshots in HDFS
  22. 22. Demo
  23. 23. MovieWeb
  24. 24. MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  25. 25. MovieWeb Web Application • Browse – Top movies by ratings count – Top genres by movie count • Log in to – See My Ratings – Rate movies • What’s missing? – Movies You May Like – Recommendations
  26. 26. Spark Recommender • Apache Hadoop 2.3.0 – HDFS and YARN • Spark 1.0 – Execute within YARN – Assign executor resources • Data – From HDFS, MongoDB – To MongoDB
  27. 27. Snapshot database as BSON Store BSON in HDFS Read BSON into Spark app Train model from existing ratings Create user- movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workflow
  28. 28. $ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.3.0.jar $ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop $ bin/spark-submit --master yarn-cluster --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar --jars mongo-java-2.12.2.jar,mongo-hadoop-1.2.1.jar --driver-memory 1G --executor-memory 2G --num-executors 4 Execution
  29. 29. Questions?

×