MongoDB and Hadoop: Driving Business Insights

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,924
On Slideshare
910
From Embeds
2,014
Number of Embeds
6

Actions

Shares
Downloads
115
Comments
0
Likes
4

Embeds 2,014

http://www.mongodb.com 1,204
https://www.mongodb.com 777
https://twitter.com 25
https://live.mongodb.com 5
http://news.google.com 2
https://comwww-drupal.10gen.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Partner Technical Solutions, MongoDB Sandeep Parikh #MongoDBWorld MongoDB and Hadoop Driving Business Insights
  • 2. Agenda • Evolving Data Landscape • MongoDB & Hadoop Use Cases • MongoDB Connector Features • Demo
  • 3. Evolving Data Landscape
  • 4. Hadoop The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. • Terabyte and Petabtye datasets • Data warehousing • Advanced analytics
  • 5. Enterprise IT Stack EDW Management&Monitoring Security&Auditing RDBMS CRM, ERP, Collaboration, Mobile, BI OS & Virtualization, Compute, Storage, Network RDBMS Applications Infrastructure Data Management Operational Analytical
  • 6. Operational vs. Analytical: Enrichment Applications, Interactions Warehouse, Analytics
  • 7. Operational: MongoDB Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 8. Analytical: Hadoop Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 9. Operational vs. Analytical: Lifecycle Real-Time Analytics Product/Asset Catalogs Security & Fraud Internet of Things Mobile Apps Customer Data Mgmt Single View Social Churn Analysis Recommender Warehouse & ETL Risk Modeling Trade Surveillance Predictive Analytics Ad Targeting Sentiment Analysis
  • 10. MongoDB & Hadoop Use Cases
  • 11. Commerce Applications powered by Analysis powered by • Products & Inventory • Recommended products • Customer profile • Session management • Elastic pricing • Recommendation models • Predictive analytics • Clickstream history MongoDB Connector for Hadoop
  • 12. Insurance Applications powered by Analysis powered by • Customer profiles • Insurance policies • Session data • Call center data • Customer action analysis • Churn analysis • Churn prediction • Policy rates MongoDB Connector for Hadoop
  • 13. Fraud Detection Payments Fraud modeling Nightly Analysis MongoDB Connector for Hadoop Results Cache Online payments processing 3rd Party Data Sources Fraud Detection query only query only
  • 14. MongoDB Connector for Hadoop
  • 15. Connector Overview Data Read/Write MongoDB Read/Write BSON Tools MapReduce Pig Hive Spark Platforms Apache Hadoop Cloudera CDH Hortonworks HDP Amazon EMR
  • 16. Connector Features and Functionality • Computes splits to read data – Single Node, Replica Sets, Sharded Clusters • Mappings for Pig and Hive – MongoDB as a standard data source/destination • Support for – Filtering data with MongoDB queries – Authentication – Reading from Replica Set tags – Appending to existing collections
  • 17. MapReduce Configuration • MongoDB input – mongo.job.input.format = com.hadoop.MongoInputFormat – mongo.input.uri = mongodb://mydb:27017/db1.collection1 • MongoDB output – mongo.job.output.format = com.hadoop.MongoOutputFormat – mongo.output.uri = mongodb://mydb:27017/db1.collection2 • BSON input/output – mongo.job.input.format = com.hadoop.BSONFileInputFormat – mapred.input.dir = hdfs:///tmp/database.bson – mongo.job.output.format = com.hadoop.BSONFileOutputFormat – mapred.output.dir = hdfs:///tmp/output.bson
  • 18. Pig Mappings • Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader • Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
  • 19. Hive Support CREATE TABLE mongo_users (id int, name string, age int) STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler" WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”) • Access collections as Hive tables • Use with MongoStorageHandler or BSONStorageHandler
  • 20. Spark Usage • Use with MapReduce input/output formats • Create Configuration objects with input/output formats and data URI • Load/save data using SparkContext Hadoop fileAPI
  • 21. Data Movement Dynamic queries with most recent data Puts load on operational database Snapshots move load to Hadoop Snapshots add predictable load to MongoDB Dynamic queries to MongoDB vs. BSON snapshots in HDFS
  • 22. Demo
  • 23. MovieWeb
  • 24. MovieWeb Components • MovieLens dataset – 10M ratings, 10K movies, 70K users • Python web app to browse movies, recommendations – Flask, PyMongo • Spark app computes recommendations – MLLib collaborative filter • Predicted ratings are exposed in web app – New predictions collection
  • 25. MovieWeb Web Application • Browse – Top movies by ratings count – Top genres by movie count • Log in to – See My Ratings – Rate movies • What’s missing? – Movies You May Like – Recommendations
  • 26. Spark Recommender • Apache Hadoop 2.3.0 – HDFS and YARN • Spark 1.0 – Execute within YARN – Assign executor resources • Data – From HDFS, MongoDB – To MongoDB
  • 27. Snapshot database as BSON Store BSON in HDFS Read BSON into Spark app Train model from existing ratings Create user- movie pairings Predict ratings for all pairings Write predictions to MongoDB collection Web application exposes recommendations Repeat the process weekly MovieWeb Workflow
  • 28. $ export SPARK_JAR=spark-assembly-1.0.0-hadoop2.3.0.jar $ export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop $ bin/spark-submit --master yarn-cluster --class com.mongodb.hadoop.demo.Recommender demo-1.0.jar --jars mongo-java-2.12.2.jar,mongo-hadoop-1.2.1.jar --driver-memory 1G --executor-memory 2G --num-executors 4 Execution
  • 29. Questions?