Using MongoDB with Hadoop & Spark

MongoDB, Hadoop, & Spark
{ Name: ‘Bryan Reinero’,
Title: ‘Developer Advocate’,
Twitter: ‘@blimpyacht’,
Email: ‘bryan@mongdb.com’,
Web: ‘mongodb.com/bryan’ }

3
Data Management
Hadoop
Fault tolerance
Batch processing
Coarse-grained operations
Unstructured Data
MongoDB
High availability
Mutable data
Fine-grained operations
Flexible Schemas

4
Data Management
Hadoop
Offline Processing
Analytics
Data Warehousing
MongoDB
Online Operations
Application
Operational

5
Use Cases
• Behavioral
analytics
• Segmentation
• Fraud detection
• Prediction
• Pricing analytics
• Sales analytics

6
Typical Implementations
Data ServicesHDFS Application
Database

7
Typical Implementations
Data Services

9
Yet Another Resource Negotiator
Compute Nodes
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar

10
Running an Example
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER

11
Running an Example
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER

12
https://github.com/mongodb/mongo-hadoop/

13
https://github.com/mongodb/mongo-hadoop/wiki/Enron-
Emails-Example

14
What You’re Gonna Need
A reducer class
extends org.apache.hadoop.mapreduce.Reducer
A mapper class
extends org.apache.hadoop.mapreduce.Mapper

15
EnronMailMapper Class
@Override
public void map(final Object key, final BSONObject val,
final Context context)
throws IOException, InterruptedException {
BSONObject headers = (BSONObject) val.get("headers");
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
}

16
EnronMailReducer Class
@Override
public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final
Context pContext)
throws IOException, InterruptedException {
int sum = 0;
for (final IntWritable value : pValues) {
sum += value.get();
}
BSONObject outDoc = BasicDBObjectBuilder.start()
.add("f", pKey.getFrom())
.add("t", pKey.getTo()).get();
reduceResult.setDoc(outDoc);
intw.set(sum);
pContext.write(reduceResult, intw);
}

18
MapReduce
map() {
if (null != to) {
mp.setTo(recip);
}
}
}

19
MapReduce
map() {
if (null != to) {
mp.setTo(recip);
}
}
}

20
MapReduce
map() {
if (null != to) {
mp.setTo(recip);
}
}
}

21
MapReduce
{from: “Ken”,
to: “Jeff” }
map() {
if (null != to) {
mp.setTo(recip);
}
}
}

22
MapReduce
{from: “Ken”,
to: “Jeff” }
map() {
if (null != to) {
mp.setTo(recip);
}
}
}
{from: “Ken”,
to: “Andrew” }
{from: “Ken”,
to: “Rebecca” }

24
MapReduce
from: “Ken”, to: “Jeff”

25
MapReduce

26
MapReduce

27
MapReduce
from: “Ken”, to: “Andrew”

28
MapReduce

29
MapReduce

30
MapReduce

31
MapReduce
public void reduce() {
int sum = 0;
for (final IntWritable value:pValues)
sum += value.get();
BSONObject outDoc =
BasicDBObjectBuilder.start()
intw.set(sum);
}

32
MapReduce
public void reduce() {
int sum = 0;
for (final IntWritable value:pValues)
sum += value.get();
BSONObject outDoc =
BasicDBObjectBuilder.start()
intw.set(sum);
}

34
Sharding
Complete Data Set
Define shard key
5906 84531002 69983880

35
Sharding
Chunk Chunk
Define shard key on title
5906 84531002 69983880

36
Sharding
Chunk Chunk ChunkChunk
Define shard key on title
5906 84531002 69983880

37
Shard 1 Shard 2 Shard 3 Shard 4
Load Balancing

39
extends MongoSplitter class

40
extends MongoSplitter class
List<InputSplit> calculateSplits()

Spark Integration
mongodbConfig.set(
"mongo.job.input.format”,
"com.mongodb.hadoop.MongoInputFormat”
);
mongodbConfig.set(
"mongo.input.uri",
"mongodb://localhost:27017/db.collection”
);

Spark Integration
JavaPairRDD<Object, BSONObject> documents =
sc.newAPIHadoopRDD(
mongodbConfig,
MongoInputFormat.class,
Object.class,
BSONObject.class
);

Spark Integrations to Come….
New Spark Connector
• Filter source data with
Aggregation
Framework
• Spark SQL
• Dataframes

A
Primary
B
Secondary
C
Secondary
Application
Servers

A
Primary
B
Secondary
C
Secondary
Application
Servers
Indexes for
CRUD
Indexes for
Analytics
ReadPreference( “Reporting” )

MongoDB + Hadoop + Spark
Benefits
• Access to machine learning
libs
• Closed loop processing
• Reduced deployment
complexity
• Leverage MongoDB
caching
• Persist intermediary results

Thanks!
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘bryan@mongdb.com’ }

Using MongoDB with Hadoop & Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using MongoDB with Hadoop & Spark

Similar to Using MongoDB with Hadoop & Spark (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Using MongoDB with Hadoop & Spark

Editor's Notes