MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop

Tweet tonight using #MongoDB to be
entered to win a t-shirt!

Welcome to MongoDB Evenings Dallas!
Agenda
5:30pm: Pizza, Beer & Soft Drinks
6:00pm: Welcome
William Kent, Enterprise Account Executive, MongoDB
6:10pm: What’s the Scoop on MongoDB & Hadoop?
Jake Angerman, Senior Solutions Architect, MongoDB
7:00pm:Acxiom's Journey with MongoDB
John Riewerts, Director of Engineering - Marketing Services, Acxiom
7:45pm:Announcements
Q&A

MongoDB + Hadoop
{ Name: ‘Jake Angerman’,
Title: ‘Sr. Solutions Architect’}

Documents Support Modern Requirements
Relational Document Data Structure
{ customer_id : 123,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
location : [37.7576792,-122.5078112],
image : <binary data>,
phones: [ {
number : “1-212-555-1212”,
dnc : true,
type : “home”
},
{
number : “1-212-555-1213”,
type : “cell”
} ]
}

6
MongoDB Technical Capabilities
Applica'on
Driver
Mongos
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
…
Primary
Secondary
Secondary
Shard N
1. Dynamic Document Schema
{ name: “John Smith”,
date: “2013-08-01”,
address: “10 3rd St.”,
phone: {
home: 1234567890,
mobile: 1234568138 }
}
db.customer.insert({…})
db.customer.find({
name: ”John Smith”})
2. Na3ve language drivers
4. High performance
- Data locality
- Indexes
- RAM
3. High availability
- Replica sets
5. Horizontal scalability
- Sharding

7
Hadoop
A framework for distributed processing of large data sets
•  Terabyte and petabyte datasets
•  Data warehousing
•  Advanced analytics
•  Not a database
•  No indexes
•  Batch processing

9
Data Management
Hadoop
Fault tolerance
Batch processing
Coarse-grained operations
Unstructured Data
WORM
MongoDB
High availability
Mutable data
Fine-grained operations
Flexible Schemas
CRUD

10
Data Management
Hadoop
Offline Processing
Analytics
Data Warehousing
MongoDB
Online Operations
Application
Operational

11
Commerce Use Case
Applications
powered by
Analysis
powered by
•  Products & Inventory
•  Recommended products
•  Customer profile
•  Session management
•  Elastic pricing
•  Recommendation models
•  Predictive analytics
•  Clickstream history
MongoDB
Connector for
Hadoop

12
Insurance Use Case
Applications
powered by
Analysis
powered by
•  Customer profiles
•  Insurance policies
•  Session data
•  Call center data
•  Customer action analysis
•  Churn analysis
•  Churn prediction
•  Policy rates
MongoDB
Connector for
Hadoop

13
Fraud Detection Use Case
Payments
Fraud modeling
Nightly
Analysis
MongoDB Connector
for Hadoop
Results
Cache
Online payments
processing
3rd Party Data
Sources
Fraud
Detection
query
only
query
only

Where MongoDB Fits
in the Hadoop Ecosystem

15
HDFS
YARN
MapReduce
Pig Hive
Spark

Spark
Streaming
Hive
Spark
Shell
Mesos
Hadoop
Pig
Spark
SQL
Spark
Stand
Alone
YARN

17
MongoDB Connector for Hadoop
•  Low latency
•  Rich fast querying
•  Flexible indexing
•  Aggregations in database
•  Known data relationships
•  Great for any subset of data
•  Longer jobs
•  Batch analytics
•  Highly parallel processing
•  Unknown data relationships
•  Great for looking at all data or
large subsets
Applications Distributed Analytics
MongoDB
Connector for
Hadoop

db.tweets.aggregate([
{$group: {
_id: {
hour: {$hour: "$date"},
minute: {$minute: "$date"}
},
total: {$sum: "$sentiment.score"},
average: {$avg: "$sentiment.score"},
count: {$sum: 1},
happyTalk: {$push: "$sentiment.positive"}
}},
{$unwind: "$happyTalk"},
{$unwind: "$happyTalk"},
{$group: {
_id: "$_id",
total: {$first: "$total"},
average: {$first: "$average"},
count: {$first: "$count"},
happyTalk: {$addToSet: "$happyTalk"}
}},
{$sort: {_id: -1} }
])
But doesn't
MongoDB have…?
•  aggregation framework
–  machine learning libraries
•  map reduce
–  Javascript
–  competing workloads

19
MongoDB Data Operations Spectrum
•  Document Retrieval – 1ms if in cache, ~10ms from
spinning disk
•  .find() – per-document cost similar to single document
–  _id range
–  any secondary index range, can be composite key
–  intersect two indexes
–  covered indexes even faster
•  .count(), .distinct(), .group() – fast, may be covered
•  .aggregate() – retrieval cost like find, plus pipeline
operations
–  $match, $group
–  $project, $redact
•  .mapReduce() – in-database Javascript
•  Hadoop Connector
–  mongo.input.query for indexed partial scan
–  full scan
Faster…………….....Slower

21
Analytics Landscape
Batch / Predictive / Ad Hoc
(mins – hours)
Real-Time Dashboards /
Scoring
(<30 ms)
Planned Reporting
(secs – mins )
Modern
Legacy

MetLife – Single View
…
Single CSR Applica3on Uniﬁed Customer
Portal
Opera3onal Repor3ng
Cards …Cards Silo 1
…
Opera'onal Data Layer
•  Insurance policies
•  Demographic data
•  Customer web data
•  Call center data DW/Data Lake
•  Churn predic3on algorithms
MongoDB
Connector for Hadoop
Cards Cards Silo 2
Cards Cards Silo N
Pub-sub/ETL
Customer
Clustering
Churn Analysis
Predic3ve
analy3cs
…

25
Foursquare
•  k-nearest neighbor problems
– similarity of venues, people, or brands
•  MongoDB data has advantages when used with MapReduce
– log files can be stale
– log files may not contain as much information
– you can scan much less data
BSON dump
MongoDB
Connector for Hadoop

28
https://github.com/mongodb/mongo-hadoop

29
Connector Overview
Data
Read/Write
MongoDB
Read/Write
BSON
Tools
MapReduce
Pig
Hive
Spark
PlaUorms
Apache Hadoop
Cloudera CDH
Hortonworks HDP
Amazon EMR

30
MapReduce Configuration
•  MongoDB input
–  mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
–  mongo.input.uri = mongodb://mydb:27017/db1.collection1
•  MongoDB output
–  mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
–  mongo.output.uri = mongodb://mydb:27017/db1.collection2
•  BSON input/output
–  mongo.job.input.format = com.hadoop.BSONFileInputFormat
–  mapred.input.dir = hdfs:///tmp/database.bson
–  mongo.job.output.format = com.hadoop.BSONFileOutputFormat
–  mapred.output.dir = hdfs:///tmp/output.bson

33
MongoDB Cluster
MONGOS
SHARD A
SHARD B
SHARD C
SHARD D
MONGOS Client

35
extends MongoSplitter class
List<InputSplit> calculateSplits()

36
•  High-level platform for creating MapReduce
•  Pig Latin abstracts Java into easier-to-use notation
•  Executed as a series of MapReduce applications
•  Supports user-defined functions (UDFs)
Pig

37
samples = LOAD 'mongodb://127.0.0.1:27017/sensor.logs'
USING
com.mongodb.hadoop.pig.MongoLoader(’deviceId:int,value:double');
grouped = GROUP samples by deviceId;
sample_stats = FOREACH grouped {
mean = AVG(samples.value);
GENERATE group as deviceId, mean as mean;
}
STORE sample_stats INTO 'mongodb://127.0.0.1:27017/sensor.stats'
USING com.mongodb.hadoop.pig.MongoStorage;

38
•  Data warehouse infrastructure built on top of Hadoop
•  Provides data summarization, query, and analysis
•  HiveQL is a subset of SQL
•  Support for user-defined functions (UDFs)

39
Hive Support
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
•  Access collections as Hive tables
•  Use with MongoStorageHandler or BSONStorageHandler

42
•  Powerful built-in transformations and actions
–  map, reduceByKey, union, distinct, sample, intersection, and more
–  foreach, count, collect, take, and many more
An engine for processing Hadoop data. Can perform
MapReduce in addition to streaming, interactive queries,
and machine learning.

Create the Resilient Distributed Dataset (RDD)
rdd = sc.newAPIHadoopRDD(
config, MongoInputFormat.class, Object.class, BSONObject.class)
config.set(
"mongo.input.uri", "mongodb://127.0.0.1:27017/marketdata.minbars")
config.set(
"mongo.input.query", '{"_id":{"$gt":{"$date":1182470400000}}}')
config.set(
"mongo.output.uri", "mongodb://127.0.0.1:27017/marketdata.fiveminutebars")
val minBarRawRDD = sc.newAPIHadoopRDD(
config,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Object],
classOf[BSONObject])

>>> mongo_rdd = sc.mongoRDD('mongodb://localhost:27017/adsb.tincan')
>>> parsed_rdd = mongo_rdd.map(parseData)
>>> clusters = KMeans.train(parsed_rdd, 10, maxIterations=10, runs=1,
initializationMode="random")
>>> cluster_sizes = parsed_rdd.map(lambda e: clusters.predict(e)).countByValue()
>>> cluster_sizes
defaultdict(<type 'int'>, {0: 70122, 1: 350890, 2: 118596, 3: 104609, 4: 254759, 5:
175840, 6: 166789, 7: 68309, 8: 147826, 9: 495102})
K-means Clustering

52
Data Flows
Hadoop
Connector
BSON Files
MapReduce & HDFS

Optimal location
for providing
operational
response times
& slices
Governance to
choose where to
load and process
data
More Complete EDMArchitecture & Data Lake
…Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png

Data processing pipeline
Pub-sub,ETL,fileimports
Stream Processing
Downstream
Systems
… …
Single CSR
Applica3on
Uniﬁed
Digital
Apps
Opera3ona
l Repor3ng
…
… …
Analy3c
Repor3ng
Drivers & Stacks
Customer
Clusterin
g
Churn
Analysis
Predic3v
e
Analy3cs
…
Distributed
Processing
Operational Applications & Reporting
Can run
processing on
all data or
slices
Data Lake

Code “JakeAngerman” gets 25% off
Super Early Bird Registration Ends March 25, 2016
June 28 - 29, 2016
New York, NY
www.mongodbworld.com

MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop

More Related Content

What's hot

Viewers also liked

Similar to MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop

More from MongoDB

Recently uploaded

MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop