MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB
Hadoop
&
humongous data

Talking about
What is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous Data

@spf13

AKA
Steve Francia
15+ years building
the internet

Father, husband,
skateboarder

Chief Solutions Architect @
responsible for drivers,
integrations, web & docs

2000
Google Inc
Today announced it has released
the largest search engine on the
Internet.

Google’s new index, comprising
more than 1 billion URLs

2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs

(and the number of individual web
pages out there is growing by
several billion pages per day).

An unprecedented
amount of data is
being created and is
accessible

Data Growth 1,000
1000

750

500
500

250
250
120
55
4 10 24
1
0
2000 2001 2002 2003 2004 2005 2006 2007 2008

Millions of URLs

Truly Exponential
Growth
Is hard for people to grasp

A BBC reporter recently: "Your current PC
is more powerful than the computer they
had on board the ﬁrst ﬂight to the moon".

Moore’s Law
Applies to more than just CPUs

Boiled down it is that things double at
regular intervals

It’s exponential growth.. and applies to
big data

How BIG is it?
2007

2008
2005
2006
2003
2004
2001
2002

Why all this
talk about BIG
Data now?

In the past few
years open source
software emerged
enabling ‘us’ to
handle BIG Data

Doers & Tellers talking about
different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september

Doers talk a lot more about
actual solutions

They know it’s a two sided story

Storage

Processing

Take aways
MongoDB and Hadoop
MongoDB for storage &
operations
Hadoop for processing &
analytics

Applications have
complex needs
MongoDB ideal operational
database
MongoDB ideal for BIG data
Not a data processing engine, but
provides processing functionality

Many options for
Processing Data
•Process in MongoDB using
Map Reduce

•Process in MongoDB using
Aggregation Framework

•Process outside MongoDB (using Hadoop)

MongoDB Map Reduce
Map()
MongoDB Data
Group(k)
emit(k,v)

map iterates on
documents
Document is $this
Sort(k)
1 at time per shard

Reduce(k,values)

k,v

Finalize(k,v)
Input matches output

k,v Can run multiple times

MongoDB Map Reduce
MongoDB map reduce quite capable... but with
limits
- Javascript not best language for processing map
reduce
- Javascript limited in external data processing
libraries
- Adds load to data store

MongoDB
Aggregation
Most uses of MongoDB Map Reduce were for
aggregation

Aggregation Framework optimized for aggregate
queries

Realtime aggregation similar to SQL GroupBy

MongoDB & Hadoop
same as Mongo's Many map operations
MongoDB shard chunks (64mb) 1 at time per input split

Creates a list each split Map (k1,1v1,1ctx) Runs on same
of Input Splits Map (k ,1v ,1ctx) thread as map
each split Map (k , v , ctx)
single server or
sharded cluster (InputFormat) each split ctx.write(k2,v2)2
ctx.write(k2,v )2 Combiner(k2,values2)2
RecordReader ctx.write(k2,v ) Combiner(k2,values )2
Combiner(k2,values )
k2, 2v3 3
k , 2v 3
k ,v

Partitioner(k2)2
Partitioner(k )2
Partitioner(k )
Sort(keys2)
Sort(k2)2
Sort(k )

MongoDB

Reducer threads

Reduce(k2,values3)
Output Format Runs once per key

kf,vf

DEMO
Install Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
Call myself a data scientist

Installing Mongo-hadoop
https://gist.github.com/1887726

hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-
hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh

Groking Twitter

curl
https://stream.twitter.com/1/
statuses/sample.json
-u<login>:<password>
| mongoimport -d test -c live

... let it run for about 2 hours

Map Hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
for doc in documents:
for hashtag in doc['entities']['hashtags']:
yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."

Reduce hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):
print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
_count = 0
for v in values:
_count += v['count']
return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)

All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar
-mapper examples/twitter/twit_hashtag_map.py
-reducer examples/twitter/twit_hashtag_reduce.py
-inputURI mongodb://127.0.0.1/test.live
-outputURI mongodb://127.0.0.1/test.twit_reduction
-file examples/twitter/twit_hashtag_map.py
-file examples/twitter/twit_hashtag_reduce.py

Popular Hash Tags
db.twit_hashtags.find().sort( {'count' : -1 })

{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }
{ "_id" : "teamfollowback", "count" : 200 }
{ "_id" : "RT", "count" : 150 }
{ "_id" : "Arsenal", "count" : 148 }
{ "_id" : "milars", "count" : 145 }
{ "_id" : "sanremo", "count" : 145 }
{ "_id" : "LoseMyNumberIf", "count" : 139 }
{ "_id" : "RelationshipsShould", "count" : 137 }
{ "_id" : "Bahrain", "count" : 129 }
{ "_id" : "bahrain", "count" : 125 }
{ "_id" : "oomf", "count" : 117 }
{ "_id" : "BabyKillerOcalan", "count" : 106 }
{ "_id" : "TeamFollowBack", "count" : 105 }
{ "_id" : "WhyDoPeopleThink", "count" : 102 }
{ "_id" : "np", "count" : 100 }

Aggregation in Mongo 2.1
db.live.aggregate(
{ $unwind : "$entities.hashtags" } ,
{ $match :
{ "entities.hashtags.text" :
{ $exists : true } } } ,
{ $group :
{ _id : "$entities.hashtags.text",
count : { $sum : 1 } } } ,
{ $sort : { count : -1 } },
{ $limit : 10 }
)

Popular Hash Tags
db.twit_hashtags.aggregate(a){
"result" : [
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
{ "_id" : "teamfollowback", "count" : 200 },
{ "_id" : "RT", "count" : 150 },
{ "_id" : "Arsenal", "count" : 148 },
{ "_id" : "milars", "count" : 145 },
{ "_id" : "sanremo","count" : 145 },
{ "_id" : "LoseMyNumberIf", "count" : 139 },
{ "_id" : "RelationshipsShould", "count" : 137 },
{ "_id" : "Bahrain", "count" : 129 },
{ "_id" : "bahrain", "count" : 125 }
],"ok" : 1
}

The
Future of
humongous
data

What is BIG?
BIG today is
normal tomorrow

Data Growth 9,000
9000

6750

4,400
4500

2,150
2250
1,000
500
55 120 250
1 4 10 24
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Millions of URLs

2012
Generating over
250 Millions of
tweets per day

MongoDB enables us to scale
with the redeﬁnition of BIG.

New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.

MongoDB is
committed to working
with best data tools
including
Hadoop, Storm,
Disco, Spark & more

http://spf13.com
http://github.com/spf13
@spf13

Questions?
download at
github.com/mongodb/mongo-hadoop

MongoDB, Hadoop and humongous data - MongoSV 2012

MongoDB, Hadoop and humongous data - MongoSV 2012

More Related Content

What's hot

Viewers also liked

Similar to MongoDB, Hadoop and humongous data - MongoSV 2012

More from Steven Francia

Recently uploaded

MongoDB, Hadoop and humongous data - MongoSV 2012