More Related Content Similar to Using MongoDB with Hadoop & Spark (20) Using MongoDB with Hadoop & Spark1. MongoDB, Hadoop, & Spark
{ Name: ‘Bryan Reinero’,
Title: ‘Developer Advocate’,
Twitter: ‘@blimpyacht’,
Email: ‘bryan@mongdb.com’,
Web: ‘mongodb.com/bryan’ }
9. 9
Yet Another Resource Negotiator
Compute Nodes
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar
10. 10
Running an Example
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar
11. 11
Running an Example
COMPUTE NODE
Client
NODE
MANAGER
NODE
MANAGER
RESOURCE
MANAGER
APPLICATION
MASTER
CONTAINER
CONTAINER
CONTAINER
Bin/hadoop jar MyJob.jar
MongoDB_Hadoop_Connector.jar
14. 14
What You’re Gonna Need
A reducer class
extends org.apache.hadoop.mapreduce.Reducer
A mapper class
extends org.apache.hadoop.mapreduce.Mapper
15. 15
EnronMailMapper Class
@Override
public void map(final Object key, final BSONObject val,
final Context context)
throws IOException, InterruptedException {
BSONObject headers = (BSONObject) val.get("headers");
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
}
16. 16
EnronMailReducer Class
@Override
public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final
Context pContext)
throws IOException, InterruptedException {
int sum = 0;
for (final IntWritable value : pValues) {
sum += value.get();
}
BSONObject outDoc = BasicDBObjectBuilder.start()
.add("f", pKey.getFrom())
.add("t", pKey.getTo()).get();
reduceResult.setDoc(outDoc);
intw.set(sum);
pContext.write(reduceResult, intw);
}
18. 18
MapReduce
map() {
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
19. 19
MapReduce
map() {
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
20. 20
MapReduce
map() {
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
21. 21
MapReduce
{from: “Ken”,
to: “Jeff” }
map() {
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
22. 22
MapReduce
{from: “Ken”,
to: “Jeff” }
map() {
String to = (String) headers.get("To");
if (null != to) {
String[] recipients = to.split(",");
for (final String recip1 : recipients) {
String recip = recip1.trim();
if (recip.length() > 0) {
mp.setFrom((String) key);
mp.setTo(recip);
context.write(mp, intw);
}
}
}
{from: “Ken”,
to: “Andrew” }
{from: “Ken”,
to: “Rebecca” }
31. 31
MapReduce
public void reduce() {
int sum = 0;
for (final IntWritable value:pValues)
sum += value.get();
BSONObject outDoc =
BasicDBObjectBuilder.start()
.add("f", pKey.getFrom())
.add("t", pKey.getTo()).get();
reduceResult.setDoc(outDoc);
intw.set(sum);
pContext.write(reduceResult, intw);
}
32. 32
MapReduce
public void reduce() {
int sum = 0;
for (final IntWritable value:pValues)
sum += value.get();
BSONObject outDoc =
BasicDBObjectBuilder.start()
.add("f", pKey.getFrom())
.add("t", pKey.getTo()).get();
reduceResult.setDoc(outDoc);
intw.set(sum);
pContext.write(reduceResult, intw);
}
44. Spark Integrations to Come….
New Spark Connector
• Filter source data with
Aggregation
Framework
• Spark SQL
• Dataframes
48. MongoDB + Hadoop + Spark
Benefits
• Access to machine learning
libs
• Closed loop processing
• Reduced deployment
complexity
• Leverage MongoDB
caching
• Persist intermediary results
49. Thanks!
{ name: ‘Bryan Reinero’,
title: ‘Developer Advocate’,
twitter: ‘@blimpyacht’,
code: ‘github.com/breinero’
email: ‘bryan@mongdb.com’ }
Editor's Notes Immutable data Immutable data Immutable data