MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

MongoDB Hadoop Connector
Luke Lovett
Maintainer, mongo-hadoop
https://github.com/mongodb/mongo-hadoop

Overview
• Hadoop Overview
• Why MongoDB and Hadoop
• Connector Overview
• Technical look into new features
• What’s on the horizon?
• Wrap-up

Hadoop Overview
• Distributed data processing
• Fulfills analytical requirements
• Jobs are infrequent, batch processes
Churn Analysis Recommendation Warehouse/ETL Risk Modeling
Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis

MongoDB + Hadoop
• MongoDB backs application
• Satisfy queries in real-time
• MongoDB + Hadoop = application data analytics

Connector Overview
• Brings operational data into analytical lifecycle
• Supporting an evolving Hadoop ecosystem
– Apache Spark has made a huge entrance
• MongoDB interaction seamless, natural

Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Pig
data =
LOAD “mongodb://myhost/db.collection”
USING com.mongodb.hadoop.MongoInputFormat

Connector Examples
Hive
CREATE EXTERNAL TABLE mongo (
title STRING,
address STRUCT<from:STRING, to:STRING>)
STORED BY
“com.mongodb.hadoop.hive.MongoStorageHandler”;

Connector Examples
Spark (Python)
import pymongo_spark
pymongo_spark.activate()
rdd = sc.MongoRDD(“mongodb://host/db.coll”)

New Features
• Hive predicate pushdown
• Pig projection
• Compression support for BSON
• PySpark support
• MongoSplitter improvements

PySpark
• Python shell
• Submit jobs written in Python
• Problem: How do we provide a natural Python syntax
for accessing the connector inside the JVM?
• What we want:
– Support for PyMongo’s objects
– Have a natural API for working with MongoDB inside
Spark’s Python shell

PySpark
We need to understand:
• How do the JVM and Python work together in Spark?
• What does data look like between these processes?
• How does the MongoDB Hadoop Connector fit into this?
We need to take a look inside PySpark.

What’s Inside PySpark?
• Uses py4j to connect to JVM running Spark
• Communicates objects to/from JVM using Python’s
pickle protocol
• org.apache.spark.api.python.Converter converts
Writables to Java Objects and vice-versa
• Special PythonRDD type encapsulates JVM gateway
and necessary Converters, Picklers, and Constructors
for un-pickling

JVM Gateway
python:
java:

PythonRDD
Python: Keeps Reference to SparkContext, JVM Gateway
Java: simply wrap a
JavaRDD and do
some conversions

Pickler/Unpickler – What is a Pickle, anyway?
• Pickle – a Python object
serialized into a byte stream,
can be saved to a file
• defines a set of opcodes that
operate as in a stack
machine
• pickling turns a Python
object into a stream of
opcodes
• unpickling performs the
operators, getting a Python
object out

Example (pickleversion2)
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}

Pickle, implemented by Pyrolite library
Pyrolite - Python Remote Objects "light" and Pickle for Java/.NET
https://github.com/irmen/Pyrolite
• Pyrolite library allows Spark to use Python’s Pickle protocol to
serialize/deserialize Python objects across the gateway.
• Hooks available for handling custom types in each direction
– registerCustomPickler – define how to turn a Java object
into a Python Pickle byte stream
– registerConstructor – define how to construct a Java object
for a given Python type

BSONPickler – translates Java -> PyMongo
PyMongo – MongoDB Python driver
https://github.com/mongodb/mongo-python-driver
Special handling for
- Binary
- BSONTimestamp
- Code
- DBRef
- ObjectId
- Regex
- Min/MaxKey

“PySpark” – Before Picture
>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,
... ‘mongo.output.uri’: ‘mongodb://host/db.output’}
>>> rdd = sc.newAPIHadoopRDD(
... ‘com.mongodb.hadoop.MongoInputFormat’,
... ‘org.apache.hadoop.io.TextWritable’,
... ‘org.apache.hadoop.io.MapWritable’
... None, None, config)
>>> rdd.first()
({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__':
u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date':
datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’:
u’World’})
>>> # do some processing with RDD
>>> processed_rdd = …
>>> processed_rdd.saveAsNewAPIHadoopFile(
... ‘file:///unused’,
... ‘com.mongodb.hadoop.MongoOutputFormat’,
... None, None, None, None, config)

PySpark – After Picture
>>> import pymongo_spark
>>> pymongo_spark.activate()
>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)
>>> rdd.first()
{u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’:
u‘World’}
>>> processed_rdd = ...
>>> processed_rdd.saveToMongoDB(
... ‘mongodb://host/db.output’)

MongoSplitter
• splitting – cutting up data to distribute among worker nodes
• Hadoop InputSplits / Spark Partitions
• very important to get splitting right for optimum performance
• improvements in splitting for mongo-hadoop

MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
mongo
s
shard 1
connector
shard 0
config servers

MongoSplitter
Split per Shard Chunk
shards:
{ "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" }
databases:
{ "_id" : "customer", "partitioned" : true, "primary" : "shard01" }
customer.emails
shard key: { "headers.From" : 1 }
chunks:
shard01 21
shard02 21
shard03 20
{ "headers.From" : { "$minKey": 1}} -->>
{ "headers.From" : "charlie@foo.com" } on : shard01 Timestamp(42, 1)
{ "headers.From" : "charlie@foo.com": 1} -->>
{ "headers.From" : "mildred@foo.com" } on : shard02 Timestamp(42, 1)
{ "headers.From" : "mildred@foo.com" } -->>
{ "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)

MongoSplitter
• split per shard
mongos
shard 1
connector
shard 0
config server

MongoSplitter
• split per shard
_id_1
{“splitVector”: “db.collection”,
“keyPattern”: {“_id”: 1},
“maxChunkSize”: 42}
_id: 0 _id: 25 _id: 50 _id: 75 _id: 100

MongoSplitter
Problem: empty/unbalanced splits
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
• can use index on “createdOn”
• splitVector can’t split on a subset of the index
• some splits might be empty

MongoSplitter
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
Solutions
• Create a new collection with subset of data
• Create index over relevant documents only
• Learn to live with empty splits

MongoSplitter
Alternatives
Filtering out empty splits:
mongo.input.split.filter_empty=true
• create cursor, check for empty
• empty splits are thrown out from the final list
• save resources from task processing empty split

MongoSplitter
Query
{“published”: true}
• No index on “published” means splits more likely
unbalanced
• Query selects documents throughout index for split
pattern

MongoSplitter
Solution
PaginatingMongoSplitter
mongo.splitter.class=
com.mongodb.hadoop.splitter.MongoPaginatingSplitter
• one-time collection scan, but splits have efficient queries
• no empty splits
• splits of equal size (except for last)

MongoSplitter
• choose the right splitting algorithm
• more efficient splitting with input query

Future Work – Data Locality
• Processing happens where the data lives
• Hadoop
– namenode (NN) knows locations of blocks
– InputFormat can specify split locations
– jobtracker collaborates with NN to schedule tasks to
take advantage of data locality
• Spark
– RDD.getPreferredLocations

https://jira.mongodb.org/browse/HADOOP-202
Idea:
• Data node/executor on same machine as shard
• Connector assigns work based on local chunks

• Set up Spark exectutors or Hadoop data nodes on machines
with shards running
• Mark each InputSplit or Partition with the shard host that
contains it

Wrapping Up
• Investigating Python in Spark
• Understand splitting algorithms
• Data locality with MongoDB

Thank You!
Questions?
Github:
https://github.com/mongodb/mongo-hadoop
Issue Tracker:
https://jira.mongodb.org/browse/HADOOP

#MDBDays
mongodb.com
Get your technical questions answered
In the foyer, 10:00 - 5:00
By appointment only – register in person

Tell me how I didtoday on Guidebook and enter for achance to
winone of these
How to do it:
Download the Guidebook App
Search for MongoDB Silicon Valley
Submit session feedback

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Similar to MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector