SlideShare a Scribd company logo
MongoDB Hadoop Connector
Luke Lovett
Maintainer, mongo-hadoop
https://github.com/mongodb/mongo-hadoop
Overview
• Hadoop Overview
• Why MongoDB and Hadoop
• Connector Overview
• Technical look into new features
• What’s on the horizon?
• Wrap-up
Hadoop Overview
• Distributed data processing
• Fulfills analytical requirements
• Jobs are infrequent, batch processes
Churn Analysis Recommendation Warehouse/ETL Risk Modeling
Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis
MongoDB + Hadoop
• MongoDB backs application
• Satisfy queries in real-time
• MongoDB + Hadoop = application data analytics
Connector Overview
• Brings operational data into analytical lifecycle
• Supporting an evolving Hadoop ecosystem
– Apache Spark has made a huge entrance
• MongoDB interaction seamless, natural
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Pig
data =
LOAD “mongodb://myhost/db.collection”
USING com.mongodb.hadoop.MongoInputFormat
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Hive
CREATE EXTERNAL TABLE mongo (
title STRING,
address STRUCT<from:STRING, to:STRING>)
STORED BY
“com.mongodb.hadoop.hive.MongoStorageHandler”;
Connector Examples
MongoInputFormat MongoOutputFormat
BSONFileInputFormat BSONFileOutputFormat
Spark (Python)
import pymongo_spark
pymongo_spark.activate()
rdd = sc.MongoRDD(“mongodb://host/db.coll”)
New Features
• Hive predicate pushdown
• Pig projection
• Compression support for BSON
• PySpark support
• MongoSplitter improvements
PySpark
• Python shell
• Submit jobs written in Python
• Problem: How do we provide a natural Python syntax
for accessing the connector inside the JVM?
• What we want:
– Support for PyMongo’s objects
– Have a natural API for working with MongoDB inside
Spark’s Python shell
PySpark
We need to understand:
• How do the JVM and Python work together in Spark?
• What does data look like between these processes?
• How does the MongoDB Hadoop Connector fit into this?
We need to take a look inside PySpark.
What’s Inside PySpark?
• Uses py4j to connect to JVM running Spark
• Communicates objects to/from JVM using Python’s
pickle protocol
• org.apache.spark.api.python.Converter converts
Writables to Java Objects and vice-versa
• Special PythonRDD type encapsulates JVM gateway
and necessary Converters, Picklers, and Constructors
for un-pickling
What’s Inside PySpark?
JVM Gateway
python:
java:
What’s Inside PySpark?
PythonRDD
Python: Keeps Reference to SparkContext, JVM Gateway
Java: simply wrap a
JavaRDD and do
some conversions
What’s Inside PySpark?
Pickler/Unpickler – What is a Pickle, anyway?
• Pickle – a Python object
serialized into a byte stream,
can be saved to a file
• defines a set of opcodes that
operate as in a stack
machine
• pickling turns a Python
object into a stream of
opcodes
• unpickling performs the
operators, getting a Python
object out
Example (pickleversion2)
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
>>> pickletools.dis(pickletools.optimize(pickle.dumps(doc)))
0: ( MARK
1: d DICT (MARK at 0)
2: S STRING '_id'
9: c GLOBAL 'copy_reg _reconstructor'
34: ( MARK
35: c GLOBAL 'bson.objectid ObjectId'
59: c GLOBAL '__builtin__ object'
79: N NONE
80: t TUPLE (MARK at 34)
81: R REDUCE
82: S STRING 'VKxc7ln2xab`x8fSx14xea'
113: b BUILD
114: s SETITEM
115: S STRING 'hello'
124: S STRING 'world'
133: s SETITEM
134: . STOP
Example (pickleversion2)
{'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
What’s Inside PySpark?
Pickle, implemented by Pyrolite library
Pyrolite - Python Remote Objects "light" and Pickle for Java/.NET
https://github.com/irmen/Pyrolite
• Pyrolite library allows Spark to use Python’s Pickle protocol to
serialize/deserialize Python objects across the gateway.
• Hooks available for handling custom types in each direction
– registerCustomPickler – define how to turn a Java object
into a Python Pickle byte stream
– registerConstructor – define how to construct a Java object
for a given Python type
What’s Inside PySpark?
BSONPickler – translates Java -> PyMongo
PyMongo – MongoDB Python driver
https://github.com/mongodb/mongo-python-driver
Special handling for
- Binary
- BSONTimestamp
- Code
- DBRef
- ObjectId
- Regex
- Min/MaxKey
“PySpark” – Before Picture
>>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’,
... ‘mongo.output.uri’: ‘mongodb://host/db.output’}
>>> rdd = sc.newAPIHadoopRDD(
... ‘com.mongodb.hadoop.MongoInputFormat’,
... ‘org.apache.hadoop.io.TextWritable’,
... ‘org.apache.hadoop.io.MapWritable’
... None, None, config)
>>> rdd.first()
({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__':
u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date':
datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’:
u’World’})
>>> # do some processing with RDD
>>> processed_rdd = …
>>> processed_rdd.saveAsNewAPIHadoopFile(
... ‘file:///unused’,
... ‘com.mongodb.hadoop.MongoOutputFormat’,
... None, None, None, None, config)
PySpark – After Picture
>>> import pymongo_spark
>>> pymongo_spark.activate()
>>> rdd = sc.MongoRDD(‘mongodb://host/db.input’)
>>> rdd.first()
{u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’:
u‘World’}
>>> processed_rdd = ...
>>> processed_rdd.saveToMongoDB(
... ‘mongodb://host/db.output’)
MongoSplitter
• splitting – cutting up data to distribute among worker nodes
• Hadoop InputSplits / Spark Partitions
• very important to get splitting right for optimum performance
• improvements in splitting for mongo-hadoop
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
mongo
s
shard 1
connector
shard 0
config servers
MongoSplitter
Split per Shard Chunk
shards:
{ "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" }
{ "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" }
{ "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" }
databases:
{ "_id" : "customer", "partitioned" : true, "primary" : "shard01" }
customer.emails
shard key: { "headers.From" : 1 }
chunks:
shard01 21
shard02 21
shard03 20
{ "headers.From" : { "$minKey": 1}} -->>
{ "headers.From" : "charlie@foo.com" } on : shard01 Timestamp(42, 1)
{ "headers.From" : "charlie@foo.com": 1} -->>
{ "headers.From" : "mildred@foo.com" } on : shard02 Timestamp(42, 1)
{ "headers.From" : "mildred@foo.com" } -->>
{ "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
mongos
shard 1
connector
shard 0
config server
MongoSplitter
Splitting Algorithms
• split per shard chunk
• split per shard
• split using splitVector command
_id_1
{“splitVector”: “db.collection”,
“keyPattern”: {“_id”: 1},
“maxChunkSize”: 42}
_id: 0 _id: 25 _id: 50 _id: 75 _id: 100
MongoSplitter
Problem: empty/unbalanced splits
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
• can use index on “createdOn”
• splitVector can’t split on a subset of the index
• some splits might be empty
MongoSplitter
Problem: empty/unbalanced splits
Query
{“createdOn”:
{“$lte”: ISODate("2015-10-26T23:51:05.787Z")}})
Solutions
• Create a new collection with subset of data
• Create index over relevant documents only
• Learn to live with empty splits
MongoSplitter
Alternatives
Filtering out empty splits:
mongo.input.split.filter_empty=true
• create cursor, check for empty
• empty splits are thrown out from the final list
• save resources from task processing empty split
MongoSplitter
Problem: empty/unbalanced splits
Query
{“published”: true}
• No index on “published” means splits more likely
unbalanced
• Query selects documents throughout index for split
pattern
MongoSplitter
Solution
PaginatingMongoSplitter
mongo.splitter.class=
com.mongodb.hadoop.splitter.MongoPaginatingSplitter
• one-time collection scan, but splits have efficient queries
• no empty splits
• splits of equal size (except for last)
MongoSplitter
• choose the right splitting algorithm
• more efficient splitting with input query
Future Work – Data Locality
• Processing happens where the data lives
• Hadoop
– namenode (NN) knows locations of blocks
– InputFormat can specify split locations
– jobtracker collaborates with NN to schedule tasks to
take advantage of data locality
• Spark
– RDD.getPreferredLocations
Future Work – Data Locality
https://jira.mongodb.org/browse/HADOOP-202
Idea:
• Data node/executor on same machine as shard
• Connector assigns work based on local chunks
Future Work – Data Locality
• Set up Spark exectutors or Hadoop data nodes on machines
with shards running
• Mark each InputSplit or Partition with the shard host that
contains it
Wrapping Up
• Investigating Python in Spark
• Understand splitting algorithms
• Data locality with MongoDB
Thank You!
Questions?
Github:
https://github.com/mongodb/mongo-hadoop
Issue Tracker:
https://jira.mongodb.org/browse/HADOOP
#MDBDays
mongodb.com
Get your technical questions answered
In the foyer, 10:00 - 5:00
By appointment only – register in person
Tell me how I didtoday on Guidebook and enter for achance to
winone of these
How to do it:
Download the Guidebook App
Search for MongoDB Silicon Valley
Submit session feedback

More Related Content

What's hot

Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based ShardingWebinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
MongoDB
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
Wim Godden
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
MongoDB
 
Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012
Wim Godden
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
Night Sailer
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
MongoDB
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of Things
Alexander Rubin
 
Elastic search 검색
Elastic search 검색Elastic search 검색
Elastic search 검색
HyeonSeok Choi
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
Jason Terpko
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
MongoDB
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
MongoDB
 
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
MongoDB
 
Active Record Query Interface (1), Season 2
Active Record Query Interface (1), Season 2Active Record Query Interface (1), Season 2
Active Record Query Interface (1), Season 2
RORLAB
 
Обзор фреймворка Twisted
Обзор фреймворка TwistedОбзор фреймворка Twisted
Обзор фреймворка Twisted
Maxim Kulsha
 
3
33
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Sages
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query Optimization
MongoDB
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
Douglas Duncan
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
MongoDB
 

What's hot (20)

Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based ShardingWebinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
Webinar: MongoDB 2.4 Feature Demo and Q&A on Hash-based Sharding
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
MongoDB Performance Debugging
MongoDB Performance DebuggingMongoDB Performance Debugging
MongoDB Performance Debugging
 
Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012Caching and tuning fun for high scalability @ LOAD2012
Caching and tuning fun for high scalability @ LOAD2012
 
From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)From mysql to MongoDB(MongoDB2011北京交流会)
From mysql to MongoDB(MongoDB2011北京交流会)
 
Fast querying indexing for performance (4)
Fast querying   indexing for performance (4)Fast querying   indexing for performance (4)
Fast querying indexing for performance (4)
 
MySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of ThingsMySQL flexible schema and JSON for Internet of Things
MySQL flexible schema and JSON for Internet of Things
 
Elastic search 검색
Elastic search 검색Elastic search 검색
Elastic search 검색
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
Diagnostics & Debugging webinar
Diagnostics & Debugging webinarDiagnostics & Debugging webinar
Diagnostics & Debugging webinar
 
Diagnostics and Debugging
Diagnostics and DebuggingDiagnostics and Debugging
Diagnostics and Debugging
 
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
 
Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)Indexing and Query Optimizer (Mongo Austin)
Indexing and Query Optimizer (Mongo Austin)
 
Active Record Query Interface (1), Season 2
Active Record Query Interface (1), Season 2Active Record Query Interface (1), Season 2
Active Record Query Interface (1), Season 2
 
Обзор фреймворка Twisted
Обзор фреймворка TwistedОбзор фреймворка Twisted
Обзор фреймворка Twisted
 
3
33
3
 
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash courseCodepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
Codepot - Pig i Hive: szybkie wprowadzenie / Pig and Hive crash course
 
Indexing and Query Optimization
Indexing and Query OptimizationIndexing and Query Optimization
Indexing and Query Optimization
 
MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329MongoDB and Indexes - MUG Denver - 20160329
MongoDB and Indexes - MUG Denver - 20160329
 
Indexing & Query Optimization
Indexing & Query OptimizationIndexing & Query Optimization
Indexing & Query Optimization
 

Viewers also liked

Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
MongoDB
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
Sujee Maniyam
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Gord Sissons
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
MongoDB
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

Viewers also liked (9)

Mongo db and hadoop driving business insights - final
Mongo db and hadoop   driving business insights - finalMongo db and hadoop   driving business insights - final
Mongo db and hadoop driving business insights - final
 
Hadoop to spark-v2
Hadoop to spark-v2Hadoop to spark-v2
Hadoop to spark-v2
 
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
Pentaho Analytics for MongoDB - presentation from MongoDB World 2014
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
 
Webinar: MongoDB + Hadoop
Webinar: MongoDB + HadoopWebinar: MongoDB + Hadoop
Webinar: MongoDB + Hadoop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similar to MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
SegFaultConf
 
ql.io at NodePDX
ql.io at NodePDXql.io at NodePDX
ql.io at NodePDX
Subbu Allamaraju
 
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte RangeScaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
MongoDB
 
MongoDB Live Hacking
MongoDB Live HackingMongoDB Live Hacking
MongoDB Live Hacking
Tobias Trelle
 
Sharding in MongoDB 4.2 #what_is_new
 Sharding in MongoDB 4.2 #what_is_new Sharding in MongoDB 4.2 #what_is_new
Sharding in MongoDB 4.2 #what_is_new
Antonios Giannopoulos
 
PostgreSQL Open SV 2018
PostgreSQL Open SV 2018PostgreSQL Open SV 2018
PostgreSQL Open SV 2018
artgillespie
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
Jason Terpko
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
SATOSHI TAGOMORI
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
Amazon Web Services Korea
 
Get expertise with mongo db
Get expertise with mongo dbGet expertise with mongo db
Get expertise with mongo db
Amit Thakkar
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
Server Density
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
MongoDB
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
Keshav Murthy
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
Norberto Leite
 
Maintenance for MongoDB Replica Sets
Maintenance for MongoDB Replica SetsMaintenance for MongoDB Replica Sets
Maintenance for MongoDB Replica Sets
Igor Donchovski
 
Letgo Data Platform: A global overview
Letgo Data Platform: A global overviewLetgo Data Platform: A global overview
Letgo Data Platform: A global overview
Ricardo Fanjul Fandiño
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
SAP Concur
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
David Beazley (Dabeaz LLC)
 
Mongodb workshop
Mongodb workshopMongodb workshop
Mongodb workshop
Harun Yardımcı
 

Similar to MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector (20)

Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
Robert Pankowecki - Czy sprzedawcy SQLowych baz nas oszukali?
 
ql.io at NodePDX
ql.io at NodePDXql.io at NodePDX
ql.io at NodePDX
 
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte RangeScaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
 
MongoDB Live Hacking
MongoDB Live HackingMongoDB Live Hacking
MongoDB Live Hacking
 
Sharding in MongoDB 4.2 #what_is_new
 Sharding in MongoDB 4.2 #what_is_new Sharding in MongoDB 4.2 #what_is_new
Sharding in MongoDB 4.2 #what_is_new
 
PostgreSQL Open SV 2018
PostgreSQL Open SV 2018PostgreSQL Open SV 2018
PostgreSQL Open SV 2018
 
MongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and MergingMongoDB Chunks - Distribution, Splitting, and Merging
MongoDB Chunks - Distribution, Splitting, and Merging
 
Hive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TDHive dirty/beautiful hacks in TD
Hive dirty/beautiful hacks in TD
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
AWS IoT 핸즈온 워크샵 - 실습 5. DynamoDB에 센서 데이터 저장하기 (김무현 솔루션즈 아키텍트)
 
Get expertise with mongo db
Get expertise with mongo dbGet expertise with mongo db
Get expertise with mongo db
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.Distributed Queries in IDS: New features.
Distributed Queries in IDS: New features.
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
 
Maintenance for MongoDB Replica Sets
Maintenance for MongoDB Replica SetsMaintenance for MongoDB Replica Sets
Maintenance for MongoDB Replica Sets
 
Letgo Data Platform: A global overview
Letgo Data Platform: A global overviewLetgo Data Platform: A global overview
Letgo Data Platform: A global overview
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
An Embedded Error Recovery and Debugging Mechanism for Scripting Language Ext...
 
Mongodb workshop
Mongodb workshopMongodb workshop
Mongodb workshop
 

More from MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB
 

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Recently uploaded

TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

MongoDB Days Silicon Valley: MongoDB and the Hadoop Connector

  • 1. MongoDB Hadoop Connector Luke Lovett Maintainer, mongo-hadoop https://github.com/mongodb/mongo-hadoop
  • 2. Overview • Hadoop Overview • Why MongoDB and Hadoop • Connector Overview • Technical look into new features • What’s on the horizon? • Wrap-up
  • 3. Hadoop Overview • Distributed data processing • Fulfills analytical requirements • Jobs are infrequent, batch processes Churn Analysis Recommendation Warehouse/ETL Risk Modeling Trade Surveillance Predictive Analysis Ad Targeting Sentiment Analysis
  • 4. MongoDB + Hadoop • MongoDB backs application • Satisfy queries in real-time • MongoDB + Hadoop = application data analytics
  • 5. Connector Overview • Brings operational data into analytical lifecycle • Supporting an evolving Hadoop ecosystem – Apache Spark has made a huge entrance • MongoDB interaction seamless, natural
  • 6. Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormat BSONFileOutputFormat Pig data = LOAD “mongodb://myhost/db.collection” USING com.mongodb.hadoop.MongoInputFormat
  • 7. Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormat BSONFileOutputFormat Hive CREATE EXTERNAL TABLE mongo ( title STRING, address STRUCT<from:STRING, to:STRING>) STORED BY “com.mongodb.hadoop.hive.MongoStorageHandler”;
  • 8. Connector Examples MongoInputFormat MongoOutputFormat BSONFileInputFormat BSONFileOutputFormat Spark (Python) import pymongo_spark pymongo_spark.activate() rdd = sc.MongoRDD(“mongodb://host/db.coll”)
  • 9. New Features • Hive predicate pushdown • Pig projection • Compression support for BSON • PySpark support • MongoSplitter improvements
  • 10. PySpark • Python shell • Submit jobs written in Python • Problem: How do we provide a natural Python syntax for accessing the connector inside the JVM? • What we want: – Support for PyMongo’s objects – Have a natural API for working with MongoDB inside Spark’s Python shell
  • 11. PySpark We need to understand: • How do the JVM and Python work together in Spark? • What does data look like between these processes? • How does the MongoDB Hadoop Connector fit into this? We need to take a look inside PySpark.
  • 12. What’s Inside PySpark? • Uses py4j to connect to JVM running Spark • Communicates objects to/from JVM using Python’s pickle protocol • org.apache.spark.api.python.Converter converts Writables to Java Objects and vice-versa • Special PythonRDD type encapsulates JVM gateway and necessary Converters, Picklers, and Constructors for un-pickling
  • 13. What’s Inside PySpark? JVM Gateway python: java:
  • 14. What’s Inside PySpark? PythonRDD Python: Keeps Reference to SparkContext, JVM Gateway Java: simply wrap a JavaRDD and do some conversions
  • 15. What’s Inside PySpark? Pickler/Unpickler – What is a Pickle, anyway? • Pickle – a Python object serialized into a byte stream, can be saved to a file • defines a set of opcodes that operate as in a stack machine • pickling turns a Python object into a stream of opcodes • unpickling performs the operators, getting a Python object out
  • 16. Example (pickleversion2) >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 17. >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 18. >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 19. >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 20. >>> pickletools.dis(pickletools.optimize(pickle.dumps(doc))) 0: ( MARK 1: d DICT (MARK at 0) 2: S STRING '_id' 9: c GLOBAL 'copy_reg _reconstructor' 34: ( MARK 35: c GLOBAL 'bson.objectid ObjectId' 59: c GLOBAL '__builtin__ object' 79: N NONE 80: t TUPLE (MARK at 34) 81: R REDUCE 82: S STRING 'VKxc7ln2xab`x8fSx14xea' 113: b BUILD 114: s SETITEM 115: S STRING 'hello' 124: S STRING 'world' 133: s SETITEM 134: . STOP Example (pickleversion2) {'_id': ObjectId('564bc76c6e32ab608f5314ea'), 'hello': 'world'}
  • 21. What’s Inside PySpark? Pickle, implemented by Pyrolite library Pyrolite - Python Remote Objects "light" and Pickle for Java/.NET https://github.com/irmen/Pyrolite • Pyrolite library allows Spark to use Python’s Pickle protocol to serialize/deserialize Python objects across the gateway. • Hooks available for handling custom types in each direction – registerCustomPickler – define how to turn a Java object into a Python Pickle byte stream – registerConstructor – define how to construct a Java object for a given Python type
  • 22. What’s Inside PySpark? BSONPickler – translates Java -> PyMongo PyMongo – MongoDB Python driver https://github.com/mongodb/mongo-python-driver Special handling for - Binary - BSONTimestamp - Code - DBRef - ObjectId - Regex - Min/MaxKey
  • 23. “PySpark” – Before Picture >>> config = {‘mongo.input.uri’: ‘mongodb://host/db.input’, ... ‘mongo.output.uri’: ‘mongodb://host/db.output’} >>> rdd = sc.newAPIHadoopRDD( ... ‘com.mongodb.hadoop.MongoInputFormat’, ... ‘org.apache.hadoop.io.TextWritable’, ... ‘org.apache.hadoop.io.MapWritable’ ... None, None, config) >>> rdd.first() ({u'timeSecond': 1421872408, u'timestamp': 1421872408, u'__class__': u'org.bson.types.ObjectId', u'machine': 374500293, u'time': 1421872408000, u'date': datetime.datetime(2015, 1, 21, 12, 33, 28), u'new': False, u'inc': -1652246148}, {u’Hello’: u’World’}) >>> # do some processing with RDD >>> processed_rdd = … >>> processed_rdd.saveAsNewAPIHadoopFile( ... ‘file:///unused’, ... ‘com.mongodb.hadoop.MongoOutputFormat’, ... None, None, None, None, config)
  • 24. PySpark – After Picture >>> import pymongo_spark >>> pymongo_spark.activate() >>> rdd = sc.MongoRDD(‘mongodb://host/db.input’) >>> rdd.first() {u‘_id’: ObjectId('562e64ea6e32ab169586f9cc'), u‘Hello’: u‘World’} >>> processed_rdd = ... >>> processed_rdd.saveToMongoDB( ... ‘mongodb://host/db.output’)
  • 25. MongoSplitter • splitting – cutting up data to distribute among worker nodes • Hadoop InputSplits / Spark Partitions • very important to get splitting right for optimum performance • improvements in splitting for mongo-hadoop
  • 26. MongoSplitter Splitting Algorithms • split per shard chunk • split per shard • split using splitVector command mongo s shard 1 connector shard 0 config servers
  • 27. MongoSplitter Split per Shard Chunk shards: { "_id" : "shard01", "host" : "shard01/llp:27018,llp:27019,llp:27020" } { "_id" : "shard02", "host" : "shard01/llp:27021,llp:27022,llp:27023" } { "_id" : "shard03", "host" : "shard01/llp:27024,llp:27025,llp:27026" } databases: { "_id" : "customer", "partitioned" : true, "primary" : "shard01" } customer.emails shard key: { "headers.From" : 1 } chunks: shard01 21 shard02 21 shard03 20 { "headers.From" : { "$minKey": 1}} -->> { "headers.From" : "charlie@foo.com" } on : shard01 Timestamp(42, 1) { "headers.From" : "charlie@foo.com": 1} -->> { "headers.From" : "mildred@foo.com" } on : shard02 Timestamp(42, 1) { "headers.From" : "mildred@foo.com" } -->> { "headers.From" : { "$maxKey": 1 }} on : shard01 Timestamp(41, 1)
  • 28. MongoSplitter Splitting Algorithms • split per shard chunk • split per shard • split using splitVector command mongos shard 1 connector shard 0 config server
  • 29. MongoSplitter Splitting Algorithms • split per shard chunk • split per shard • split using splitVector command _id_1 {“splitVector”: “db.collection”, “keyPattern”: {“_id”: 1}, “maxChunkSize”: 42} _id: 0 _id: 25 _id: 50 _id: 75 _id: 100
  • 30. MongoSplitter Problem: empty/unbalanced splits Query {“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}}) • can use index on “createdOn” • splitVector can’t split on a subset of the index • some splits might be empty
  • 31. MongoSplitter Problem: empty/unbalanced splits Query {“createdOn”: {“$lte”: ISODate("2015-10-26T23:51:05.787Z")}}) Solutions • Create a new collection with subset of data • Create index over relevant documents only • Learn to live with empty splits
  • 32. MongoSplitter Alternatives Filtering out empty splits: mongo.input.split.filter_empty=true • create cursor, check for empty • empty splits are thrown out from the final list • save resources from task processing empty split
  • 33. MongoSplitter Problem: empty/unbalanced splits Query {“published”: true} • No index on “published” means splits more likely unbalanced • Query selects documents throughout index for split pattern
  • 34. MongoSplitter Solution PaginatingMongoSplitter mongo.splitter.class= com.mongodb.hadoop.splitter.MongoPaginatingSplitter • one-time collection scan, but splits have efficient queries • no empty splits • splits of equal size (except for last)
  • 35. MongoSplitter • choose the right splitting algorithm • more efficient splitting with input query
  • 36. Future Work – Data Locality • Processing happens where the data lives • Hadoop – namenode (NN) knows locations of blocks – InputFormat can specify split locations – jobtracker collaborates with NN to schedule tasks to take advantage of data locality • Spark – RDD.getPreferredLocations
  • 37. Future Work – Data Locality https://jira.mongodb.org/browse/HADOOP-202 Idea: • Data node/executor on same machine as shard • Connector assigns work based on local chunks
  • 38. Future Work – Data Locality • Set up Spark exectutors or Hadoop data nodes on machines with shards running • Mark each InputSplit or Partition with the shard host that contains it
  • 39. Wrapping Up • Investigating Python in Spark • Understand splitting algorithms • Data locality with MongoDB
  • 41. #MDBDays mongodb.com Get your technical questions answered In the foyer, 10:00 - 5:00 By appointment only – register in person
  • 42. Tell me how I didtoday on Guidebook and enter for achance to winone of these How to do it: Download the Guidebook App Search for MongoDB Silicon Valley Submit session feedback