MongoDB and hadoop

Talking about
MongoDB Intro & Fundamentals
Why MongoDB & Hadoop
Getting Started
Using MongoDB & Hadoop
Future of Big Data

Steve @sp

A
15+ years building
the internet

Father, husband,
skateboarder

Chief Solutions Architect @
responsible for drivers,
integrations, web & docs

Company behind MongoDB
Ofﬁces in NYC, Palo Alto, London & Dublin
100+ employees
Support, consulting, training
Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic

Well Funded: Sequoia, Union Square, Flybridge

MongoDB
Application Document
Oriented
High { author : “steve”,
date : new Date(),

Performance
text : “About MongoDB...”,
tags : [“tech”, “database”]}

Fully
Consistent
Horizontally Scalable

MongoDB philosophy
Keep functionality when we can (key/value
stores are great, but we need more)
Non-relational (no joins) makes scaling
horizontally practical
Document data models are good
Database technology should run anywhere
virtualized, cloud, metal, etc

Under the hood
Written in C++
Runs nearly everywhere
Data serialized to BSON
Extensive use of memory-mapped ﬁles
i.e. read-through write-through
memory caching.

Database Landscape
Scalability & Performance

Memcached
MongoDB

RDBMS

Depth of Functionality

“
MongoDB has the best
features of key/value
stores, document
databases and
relational databases
in one.
John Nunemaker

Relational made normalized
data look like this
Category
• Name
• Url

Article
User • Name
Tag
• Name • Slug • Name
• Email Address • Publish date • Url
• Text

Comment
• Comment
• Date
• Author

Document databases make
normalized data look like this
Article
• Name
• Slug
• Publish date
User • Text
• Name • Author
• Email Address
Comment[]
• Comment
• Date
• Author

Tag[]
• Value

Category[]
• Value

CMS / Blog
Needs:
• Business needed modern data store for rapid development and
scale

Solution:
• Use PHP & MongoDB

Results:
• Real time statistics
• All data, images, etc stored together
easy access, easy deployment, easy high availability
• No need for complex migrations
• Enabled very rapid development and growth

Photo Meta-Data
Problem:
• Business needed more ﬂexibility than Oracle could deliver

Solution:
• Use MongoDB instead of Oracle

Results:
• Developed application in one sprint cycle
• 500% cost reduction compared to Oracle
• 900% performance improvement compared to Oracle

Customer Analytics
Problem:
• Deal with massive data volume across all customer sites

Solution:
• Use MongoDB to replace Google Analytics / Omniture options

Results:
• Less than one week to build prototype and prove business case
• Rapid deployment of new features

Archiving
Why MongoDB:
• Existing application built on MySQL
• Lots of friction with RDBMS based archive storage
• Needed more scalable archive storage backend
Solution:
• Keep MySQL for active data (100mil)
• MongoDB for archive (2+ billion)
Results:
• No more alter table statements taking over 2 months to run
• Sharding enabled horizontal scale
• Very happily looking at other places to use MongoDB

Online Dictionary
Problem:
• MySQL could not scale to handle their 5B+ documents

Solution:
• Switched from MySQL to MongoDB

Results:
• Massive simpliﬁcation of code base
• Eliminated need for external caching system
• 20x performance improvement over MySQL

E-commerce
Problem:
• Multi-vertical E-commerce impossible to model (efﬁciently) in
RDBMS

Solution:
• Switched from MySQL to MongoDB

Results:
• Massive simpliﬁcation of code base
• Rapidly build, halving time to market (and cost)
• Eliminated need for external caching system
• 50x+ performance improvement over MySQL

Tons more
MongoDB casts a wide net

people keep coming up with
new and brilliant ways to use it

In Good Company

and 1000s more

Applications have
complex needs
Use the best tool for the job
Often more than one tool is needed
MongoDB ideal operational database
MongoDB ideal for BIG data
Not a data processing engine
For heavy processing needs use tool designed
for that job ... Hadoop

MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store
- Sharded environments do parallel processing

MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Fixes some of limits of MongoDB MR
- Can do realtime aggregation similar to SQL GroupBy
- parallel processing on sharded clusters

MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)

map iterates on
documents
Document is $this
Sort(k)
1 at time per shard

Reduce(k,values)

k,v

Finalize(k,v)
Input matches output

k,v Can run multiple times

Hadoop Map Reduce
Runs on same
1 1
InputFormat Map (k , v , ctx) thread as map

Many map operations ctx.write(k2,v2) Combiner(k2,values2)
1 at time per input
split same as k 2, v 3
Mongo's emit

similar to
Mongo's reducer
similar to Partitioner(k2)
Mongo's group
Sort(keys2)

Reducer threads
similar to
Mongo's Finalize

Reduce(k3,values4)
Output Format Runs once per key
kf,vf

MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split

Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v

Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )

MongoDB

Reducer threads

Reduce(k2,values3)
Output Format Runs once per key

kf,vf

DEMO
Install MongoDB
Install Hadoop & MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
PROFIT

Installing MongoDB

brew install mongodb
sudo easy_install pip
sudo pip install pymongo

Installing Hadoop

brew install hadoop

Installing Mongo-hadoop
https://gist.github.com/1887726

hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh

Groking Twitter

curl
https://stream.twitter.com/1/statuses/
sample.json
-u<login>:<password>
| mongoimport -d test -c live

... let it run for about 2 hours

Map Timezones in Python
#!/usr/bin/env python
import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
for doc in documents:
yield {'_id': doc['user']['time_zone'],
'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Writing Reducer in Python

import sys

from pymongo_hadoop import BSONReducer

def reducer(key, values):
print >> sys.stderr, "Processing Timezone %s" % key
_count = 0
for v in values:
_count += v['count']
return {'_id': key, 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -
mapper examples/twitter/twit_map.py
-reducer examples/twitter/twit_reduce.py
-inputURI mongodb://127.0.0.1/test.live
-outputURI mongodb://127.0.0.1/test.twit_reduction
-ﬁle examples/twitter/twit_map.py
-ﬁle examples/twitter/twit_reduce.py

Popular time zones
db.twit_reduction.ﬁnd().sort( {'count' : -1 })

{ "_id" : ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 }
{ "_id" : "Central Time (US & Canada)", "count" : 16374 }
{ "_id" : "Quito", "count" : 13708 }
{ "_id" : "Greenland", "count" : 12332 }
{ "_id" : "Santiago", "count" : 10153 }
{ "_id" : "Eastern Time (US & Canada)", "count" : 8823 }
{ "_id" : "Paciﬁc Time (US & Canada)", "count" : 8530 }
{ "_id" : "Brasilia", "count" : 6621 }
{ "_id" : "London", "count" : 5617 }
{ "_id" : "Mountain Time (US & Canada)", "count" : 4479 }
{ "_id" : "Amsterdam", "count" : 4199 }
{ "_id" : "Hawaii", "count" : 3381 }
{ "_id" : "Tokyo", "count" : 2713 }
{ "_id" : "Alaska", "count" : 2543 }
{ "_id" : "Madrid", "count" : 2118 }
{ "_id" : "Paris", "count" : 1538 }
{ "_id" : "Buenos Aires", "count" : 1247 }
{ "_id" : "Mexico City", "count" : 1104 }
{ "_id" : "Caracas", "count" : 1089 }

Map Hashtags in Python

import sys

from pymongo_hadoop import BSONMapper

def mapper(documents):
for doc in documents:
for hashtag in doc['entities']['hashtags']:
yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Reduce hashtags in Python

import sys

from pymongo_hadoop import BSONReducer

def reducer(key, values):
print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
_count = 0
for v in values:
_count += v['count']
return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -
mapper examples/twitter/twit_hashtag_map.py
-reducer examples/twitter/twit_hashtag_reduce.py
-inputURI mongodb://127.0.0.1/test.live
-outputURI mongodb://127.0.0.1/test.twit_reduction
-ﬁle examples/twitter/twit_hashtag_map.py
-ﬁle examples/twitter/twit_hashtag_reduce.py

Popular Hash Tags
db.twit_hashtags.ﬁnd().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }
{ "_id" : "teamfollowback", "count" : 200 }
{ "_id" : "RT", "count" : 150 }
{ "_id" : "Arsenal", "count" : 148 }
{ "_id" : "milars", "count" : 145 }
{ "_id" : "sanremo", "count" : 145 }
{ "_id" : "LoseMyNumberIf", "count" : 139 }
{ "_id" : "RelationshipsShould", "count" : 137 }
{ "_id" : "Bahrain", "count" : 129 }
{ "_id" : "bahrain", "count" : 125 }
{ "_id" : "oomf", "count" : 117 }
{ "_id" : "BabyKillerOcalan", "count" : 106 }
{ "_id" : "TeamFollowBack", "count" : 105 }
{ "_id" : "WhyDoPeopleThink", "count" : 102 }
{ "_id" : "np", "count" : 100 }

Aggregation in Mongo 2.1
db.live.aggregate(
{ $unwind : "$entities.hashtags" } ,
{ $match :
{ "entities.hashtags.text" :
{ $exists : true } } } ,
{ $group :
{ _id : "$entities.hashtags.text",
count : { $sum : 1 } } } ,
{ $sort : { count : -1 } },
{ $limit : 10 }
)

Popular Hash Tags
db.twit_hashtags.aggregate(a){
"result" : [
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
{ "_id" : "teamfollowback", "count" : 200 },
{ "_id" : "RT", "count" : 150 },
{ "_id" : "Arsenal", "count" : 148 },
{ "_id" : "milars", "count" : 145 },
{ "_id" : "sanremo","count" : 145 },
{ "_id" : "LoseMyNumberIf", "count" : 139 },
{ "_id" : "RelationshipsShould", "count" : 137 },
{ "_id" : "Bahrain", "count" : 129 },
{ "_id" : "bahrain", "count" : 125 }
],"ok" : 1
}

Production usage
Orbitz
Badgeville
foursquare
CityGrid
and more

What is BIG?
BIG today is
normal tomorrow

Google 2000
Google Inc, today announced it
has released the largest search
engine on the Internet.

Google’s new index, comprising
more than 1 billion URLs

Google 2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs

(and the number of individual web
pages out there is growing by
several billion pages per day).

BIG 2012 & Beyond
MongoDB enables us to scale
with the redeﬁnition of BIG.

New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.

MongoDB is
committed to
working with best
data tools including
Storm, Spark, &
more

http://spf13.com
http://github.com/s
@spf13

Question
download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com

MongoDB and hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB and hadoop

Similar to MongoDB and hadoop (20)

More from Steven Francia

More from Steven Francia (20)

Recently uploaded

Recently uploaded (20)

MongoDB and hadoop

Editor's Notes