SlideShare a Scribd company logo
MongoD


     &
  B


  Hadoop
Talking about
MongoDB Intro & Fundamentals
Why MongoDB & Hadoop
Getting Started
Using MongoDB & Hadoop
Future of Big Data
Steve                  @sp

                     A
                      15+ years building
                      the internet

                         Father, husband,
                         skateboarder


Chief Solutions Architect @
responsible for drivers,
integrations, web & docs
Company behind MongoDB
Offices in NYC, Palo Alto, London & Dublin
100+ employees
Support, consulting, training
Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic

Well Funded: Sequoia, Union Square, Flybridge
Introduction
     to
MongoD
MongoDB
         Application     Document
                         Oriented
    High                 { author : “steve”,
                           date : new Date(),

Performance
                           text : “About MongoDB...”,
                           tags : [“tech”, “database”]}




                           Fully
                         Consistent
 Horizontally Scalable
MongoDB philosophy
 Keep functionality when we can (key/value
 stores are great, but we need more)
 Non-relational (no joins) makes scaling
 horizontally practical
 Document data models are good
 Database technology should run anywhere
 virtualized, cloud, metal, etc
Under the hood
Written in C++
Runs nearly everywhere
Data serialized to BSON
Extensive use of memory-mapped files
i.e. read-through write-through
memory caching.
Database Landscape
Scalability & Performance


                            Memcached
                                             MongoDB



                                                   RDBMS


                                Depth of Functionality
“
MongoDB has the best
features of key/value
stores, document
databases and
relational databases
in one.
        John Nunemaker
Relational made normalized
     data look like this
                      Category
                  • Name
                  • Url




                           Article
       User       • Name
                                              Tag
• Name            • Slug             • Name
• Email Address   • Publish date     • Url
                  • Text




                     Comment
                  • Comment
                  • Date
                  • Author
Document databases make
normalized data look like this
                            Article
                     • Name
                     • Slug
                     • Publish date
        User         • Text
   • Name            • Author
   • Email Address
                         Comment[]
                      • Comment
                      • Date
                      • Author

                            Tag[]
                      • Value

                         Category[]
                      • Value
MongoDB
Use Cases
CMS / Blog
Needs:
• Business needed modern data store for rapid development and
  scale

Solution:
• Use PHP & MongoDB

Results:
• Real time statistics
• All data, images, etc stored together
  easy access, easy deployment, easy high availability
• No need for complex migrations
• Enabled very rapid development and growth
Photo Meta-Data
Problem:
• Business needed more flexibility than Oracle could deliver

Solution:
• Use MongoDB instead of Oracle

Results:
• Developed application in one sprint cycle
• 500% cost reduction compared to Oracle
• 900% performance improvement compared to Oracle
Customer Analytics
Problem:
• Deal with massive data volume across all customer sites

Solution:
• Use MongoDB to replace Google Analytics / Omniture options

Results:
• Less than one week to build prototype and prove business case
• Rapid deployment of new features
Archiving
Why MongoDB:
• Existing application built on MySQL
• Lots of friction with RDBMS based archive storage
• Needed more scalable archive storage backend
Solution:
• Keep MySQL for active data (100mil)
• MongoDB for archive (2+ billion)
Results:
• No more alter table statements taking over 2 months to run
• Sharding enabled horizontal scale
• Very happily looking at other places to use MongoDB
Online Dictionary
Problem:
• MySQL could not scale to handle their 5B+ documents

Solution:
• Switched from MySQL to MongoDB

Results:
• Massive simplification of code base
• Eliminated need for external caching system
• 20x performance improvement over MySQL
E-commerce
Problem:
• Multi-vertical E-commerce impossible to model (efficiently) in
  RDBMS

Solution:
• Switched from MySQL to MongoDB

Results:
•   Massive simplification of code base
•   Rapidly build, halving time to market (and cost)
•   Eliminated need for external caching system
•   50x+ performance improvement over MySQL
Tons more
   MongoDB casts a wide net

  people keep coming up with
 new and brilliant ways to use it
In Good Company




   and 1000s more
Why
MongoDB
& Hadoop
Applications have
      complex needs
Use the best tool for the job
Often more than one tool is needed
MongoDB ideal operational database
MongoDB ideal for BIG data
Not a data processing engine
For heavy processing needs use tool designed
for that job ... Hadoop
MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map
  reduce
- Javascript limited in external data processing
  libraries
- Adds load to data store
- Sharded environments do parallel processing
MongoDB
              Aggregation
Most uses of MongoDB Map Reduce were for
aggregation
Aggregation Framework optimized for aggregate
queries
Fixes some of limits of MongoDB MR
- Can do realtime aggregation similar to SQL GroupBy
- parallel processing on sharded clusters
MongoDB Map Reduce
                        Map()
MongoDB   Data
                                              Group(k)
                        emit(k,v)

                        map iterates on
                        documents
                        Document is $this
                                              Sort(k)
                        1 at time per shard




                                              Reduce(k,values)

                                               k,v


                 Finalize(k,v)
                                              Input matches output

                  k,v                         Can run multiple times
Hadoop Map Reduce
                                                                              Runs on same
                               1    1
 InputFormat            Map (k , v , ctx)                                     thread as map

Many map operations                 ctx.write(k2,v2)                   Combiner(k2,values2)
1 at time per input
split                          same as                                      k 2, v 3
                             Mongo's emit

                                                                           similar to
                                                                        Mongo's reducer
               similar to     Partitioner(k2)
             Mongo's group
                                                                 Sort(keys2)




                                                                                               Reducer threads
                                       similar to
                                    Mongo's Finalize

                                                       Reduce(k3,values4)
                                   Output Format                                       Runs once per key
                                                         kf,vf
MongoDB & Hadoop
                      same as Mongo's          Many map operations
MongoDB             shard chunks (64mb)        1 at time per input split

                   Creates a list     each split      Map (k1,1v1,1ctx)                          Runs on same
                   of Input Splits                     Map (k ,1v ,1ctx)                         thread as map
                                      each split        Map (k , v , ctx)
single server or
sharded cluster    (InputFormat)      each split           ctx.write(k2,v2)2
                                                             ctx.write(k2,v )2            Combiner(k2,values2)2
                                     RecordReader              ctx.write(k2,v )            Combiner(k2,values )2
                                                                                            Combiner(k2,values )
                                                                                                k2, 2v3 3
                                                                                                 k , 2v 3
                                                                                                     k ,v


                                               Partitioner(k2)2
                                                Partitioner(k )2
                                                 Partitioner(k )
                                                                                  Sort(keys2)
                                                                                   Sort(k2)2
                                                                                    Sort(k )

MongoDB



                                                                                                            Reducer threads



                                                                 Reduce(k2,values3)
                                           Output Format                                    Runs once per key

                                                                    kf,vf
DEMO
DEMO
Install MongoDB
Install Hadoop & MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop
streaming
Write reducer in Python using Hadoop
streaming
PROFIT
Installing MongoDB



brew install mongodb
sudo easy_install pip
sudo pip install pymongo
Installing Hadoop




brew install hadoop
Installing Mongo-hadoop
                    https://gist.github.com/1887726

hadoop_version '0.23'
hadoop_path="/usr/local/Cellar/hadoop/
$hadoop_version.0/libexec/lib"

git clone git://github.com/mongodb/mongo-hadoop.git
cd mongo-hadoop
sed -i '' "s/default/$hadoop_version/g" build.sbt
cd streaming
./build.sh
Groking Twitter

curl 
https://stream.twitter.com/1/statuses/
sample.json 
-u<login>:<password> 
| mongoimport -d test -c live


              ... let it run for about 2 hours
Map Timezones in Python
  #!/usr/bin/env python
import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
    for doc in documents:
        yield {'_id': doc['user']['time_zone'],
'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Writing Reducer in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):
    print >> sys.stderr, "Processing Timezone %s" % key
    _count = 0
    for v in values:
        _count += v['count']
    return {'_id': key, 'count': _count}

BSONReducer(reducer)
All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar  -
mapper examples/twitter/twit_map.py 
-reducer examples/twitter/twit_reduce.py 
-inputURI mongodb://127.0.0.1/test.live 
-outputURI mongodb://127.0.0.1/test.twit_reduction 
-file examples/twitter/twit_map.py 
-file examples/twitter/twit_reduce.py
Popular time zones
db.twit_reduction.find().sort( {'count' : -1 })

{   "_id"   :   ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 }
{   "_id"   :   "Central Time (US & Canada)", "count" : 16374 }
{   "_id"   :   "Quito", "count" : 13708 }
{   "_id"   :   "Greenland", "count" : 12332 }
{   "_id"   :   "Santiago", "count" : 10153 }
{   "_id"   :   "Eastern Time (US & Canada)", "count" : 8823 }
{   "_id"   :   "Pacific Time (US & Canada)", "count" : 8530 }
{   "_id"   :   "Brasilia", "count" : 6621 }
{   "_id"   :   "London", "count" : 5617 }
{   "_id"   :   "Mountain Time (US & Canada)", "count" : 4479 }
{   "_id"   :   "Amsterdam", "count" : 4199 }
{   "_id"   :   "Hawaii", "count" : 3381 }
{   "_id"   :   "Tokyo", "count" : 2713 }
{   "_id"   :   "Alaska", "count" : 2543 }
{   "_id"   :   "Madrid", "count" : 2118 }
{   "_id"   :   "Paris", "count" : 1538 }
{   "_id"   :   "Buenos Aires", "count" : 1247 }
{   "_id"   :   "Mexico City", "count" : 1104 }
{   "_id"   :   "Caracas", "count" : 1089 }
DEMO 2
Map Hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONMapper

def mapper(documents):
  for doc in documents:
     for hashtag in doc['entities']['hashtags']:
       yield {'_id': hashtag['text'], 'count': 1}

BSONMapper(mapper)
print >> sys.stderr, "Done Mapping."
Reduce hashtags in Python
#!/usr/bin/env python

import sys
sys.path.append(".")

from pymongo_hadoop import BSONReducer

def reducer(key, values):
  print >> sys.stderr, "Hashtag %s" % key.encode('utf8')
  _count = 0
  for v in values:
     _count += v['count']
  return {'_id': key.encode('utf8'), 'count': _count}

BSONReducer(reducer)
All together

hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar  -
mapper examples/twitter/twit_hashtag_map.py 
-reducer examples/twitter/twit_hashtag_reduce.py 
-inputURI mongodb://127.0.0.1/test.live 
-outputURI mongodb://127.0.0.1/test.twit_reduction 
-file examples/twitter/twit_hashtag_map.py 
-file examples/twitter/twit_hashtag_reduce.py
Popular Hash Tags
db.twit_hashtags.find().sort( {'count' : -1 })

{   "_id"   :   "YouKnowYoureInLoveIf", "count" : 287 }
{   "_id"   :   "teamfollowback", "count" : 200 }
{   "_id"   :   "RT", "count" : 150 }
{   "_id"   :   "Arsenal", "count" : 148 }
{   "_id"   :   "milars", "count" : 145 }
{   "_id"   :   "sanremo", "count" : 145 }
{   "_id"   :   "LoseMyNumberIf", "count" : 139 }
{   "_id"   :   "RelationshipsShould", "count" : 137 }
{   "_id"   :   "Bahrain", "count" : 129 }
{   "_id"   :   "bahrain", "count" : 125 }
{   "_id"   :   "oomf", "count" : 117 }
{   "_id"   :   "BabyKillerOcalan", "count" : 106 }
{   "_id"   :   "TeamFollowBack", "count" : 105 }
{   "_id"   :   "WhyDoPeopleThink", "count" : 102 }
{   "_id"   :   "np", "count" : 100 }
DEMO 3
Aggregation in Mongo 2.1
     db.live.aggregate(
    { $unwind : "$entities.hashtags" } ,
    { $match :
      { "entities.hashtags.text" :
          { $exists : true } } } ,
    { $group :
      { _id : "$entities.hashtags.text",
      count : { $sum : 1 } } } ,
    { $sort : { count : -1 } },
    { $limit : 10 }
)
Popular Hash Tags
    db.twit_hashtags.aggregate(a){
    "result" : [
       { "_id" : "YouKnowYoureInLoveIf", "count" : 287 },
       { "_id" : "teamfollowback", "count" : 200 },
       { "_id" : "RT", "count" : 150 },
       { "_id" : "Arsenal", "count" : 148 },
       { "_id" : "milars", "count" : 145 },
       { "_id" : "sanremo","count" : 145 },
       { "_id" : "LoseMyNumberIf", "count" : 139 },
       { "_id" : "RelationshipsShould", "count" : 137 },
       { "_id" : "Bahrain", "count" : 129 },
       { "_id" : "bahrain", "count" : 125 }
     ],"ok" : 1
}
Using
MongoD
         &
Production usage
Orbitz
Badgeville
foursquare
CityGrid
             and more
Future
The

of
     BIG data
What is BIG?
  BIG today is
normal tomorrow
Google 2000
Google Inc, today announced it
has released the largest search
engine on the Internet.

Google’s new index, comprising
more than 1 billion URLs
Google 2008
Our indexing system for processing
links indicates that
we now count 1 trillion unique URLs

(and the number of individual web
pages out there is growing by
several billion pages per day).
BIG 2012 & Beyond
MongoDB enables us to scale
with the redefinition of BIG.

New processing tools like
Hadoop & Storm are enabling
us to process the new BIG.
Hadoop is our
  first step
MongoDB is
   committed to
 working with best
data tools including
  Storm, Spark, &
       more
http://spf13.com
                           http://github.com/s
                           @spf13




Question
    download at mongodb.org
We’re hiring!! Contact us at jobs@10gen.com
MongoDB and hadoop

More Related Content

What's hot

High performance computing
High performance computingHigh performance computing
High performance computing
punjab engineering college, chandigarh
 
1. Introduction to IoT
1. Introduction to IoT1. Introduction to IoT
1. Introduction to IoT
Abhishek Das
 
Hadoop
HadoopHadoop
Hadoop
Ahmad Kabeer
 
Evolution of the cloud
Evolution of the cloudEvolution of the cloud
Evolution of the cloudsagaroceanic11
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
6-IoT protocol.pptx
6-IoT protocol.pptx6-IoT protocol.pptx
6-IoT protocol.pptx
Pratik Gohel
 
Iot enabled technologies
Iot enabled technologiesIot enabled technologies
Iot enabled technologies
ShilpaKrishna6
 
Internet of Things with Cloud Computing and M2M Communication
Internet of Things with Cloud Computing and M2M CommunicationInternet of Things with Cloud Computing and M2M Communication
Internet of Things with Cloud Computing and M2M Communication
Sherin C Abraham
 
Internet of Things, Innovation and India by Syam Madanapalli
Internet of Things, Innovation and India by Syam MadanapalliInternet of Things, Innovation and India by Syam Madanapalli
Internet of Things, Innovation and India by Syam Madanapalli
Syam Madanapalli
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
sunera pathan
 
Ericas-CWNA-Study-Guide
Ericas-CWNA-Study-GuideEricas-CWNA-Study-Guide
Ericas-CWNA-Study-GuideErica StJohn
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
Rohit Agrawal
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowDataWorks Summit
 
Edge Computing.pptx
Edge Computing.pptxEdge Computing.pptx
Edge Computing.pptx
PriyaMaurya52
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Unit 4
Unit 4Unit 4
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
presentation on Edge computing
presentation on Edge computingpresentation on Edge computing
presentation on Edge computing
sairamgoud16
 

What's hot (20)

High performance computing
High performance computingHigh performance computing
High performance computing
 
Trends in distributed systems
Trends in distributed systemsTrends in distributed systems
Trends in distributed systems
 
1. Introduction to IoT
1. Introduction to IoT1. Introduction to IoT
1. Introduction to IoT
 
Hadoop
HadoopHadoop
Hadoop
 
Evolution of the cloud
Evolution of the cloudEvolution of the cloud
Evolution of the cloud
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
6-IoT protocol.pptx
6-IoT protocol.pptx6-IoT protocol.pptx
6-IoT protocol.pptx
 
Iot enabled technologies
Iot enabled technologiesIot enabled technologies
Iot enabled technologies
 
Internet of Things with Cloud Computing and M2M Communication
Internet of Things with Cloud Computing and M2M CommunicationInternet of Things with Cloud Computing and M2M Communication
Internet of Things with Cloud Computing and M2M Communication
 
Lamp technology
Lamp technologyLamp technology
Lamp technology
 
Internet of Things, Innovation and India by Syam Madanapalli
Internet of Things, Innovation and India by Syam MadanapalliInternet of Things, Innovation and India by Syam Madanapalli
Internet of Things, Innovation and India by Syam Madanapalli
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Ericas-CWNA-Study-Guide
Ericas-CWNA-Study-GuideEricas-CWNA-Study-Guide
Ericas-CWNA-Study-Guide
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Edge Computing.pptx
Edge Computing.pptxEdge Computing.pptx
Edge Computing.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Unit 4
Unit 4Unit 4
Unit 4
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
presentation on Edge computing
presentation on Edge computingpresentation on Edge computing
presentation on Edge computing
 

Similar to MongoDB and hadoop

MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataSteven Francia
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
Steven Francia
 
MongoDB at FrozenRails
MongoDB at FrozenRailsMongoDB at FrozenRails
MongoDB at FrozenRails
Mike Dirolf
 
Using MongoDB and Python
Using MongoDB and PythonUsing MongoDB and Python
Using MongoDB and Python
Mike Bright
 
2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo
Michael Bright
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
Roger Xia
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
Alex Ivanov
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
MongoDB Strange Loop 2009
MongoDB Strange Loop 2009MongoDB Strange Loop 2009
MongoDB Strange Loop 2009
Mike Dirolf
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
Steven Francia
 
MongoDB
MongoDBMongoDB
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancoukQ con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Roger Xia
 
MongoDB at RuPy
MongoDB at RuPyMongoDB at RuPy
MongoDB at RuPy
Mike Dirolf
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
Steven Francia
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
christkv
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
 
MongoDB EuroPython 2009
MongoDB EuroPython 2009MongoDB EuroPython 2009
MongoDB EuroPython 2009
Mike Dirolf
 
Mongodb introduction and_internal(simple)
Mongodb introduction and_internal(simple)Mongodb introduction and_internal(simple)
Mongodb introduction and_internal(simple)Kai Zhao
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
Mike Dirolf
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DC
Mike Dirolf
 

Similar to MongoDB and hadoop (20)

MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous DataMongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
 
MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
MongoDB at FrozenRails
MongoDB at FrozenRailsMongoDB at FrozenRails
MongoDB at FrozenRails
 
Using MongoDB and Python
Using MongoDB and PythonUsing MongoDB and Python
Using MongoDB and Python
 
2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo2016 feb-23 pyugre-py_mongo
2016 feb-23 pyugre-py_mongo
 
Spring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_dataSpring one2gx2010 spring-nonrelational_data
Spring one2gx2010 spring-nonrelational_data
 
RethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime webRethinkDB - the open-source database for the realtime web
RethinkDB - the open-source database for the realtime web
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB Strange Loop 2009
MongoDB Strange Loop 2009MongoDB Strange Loop 2009
MongoDB Strange Loop 2009
 
MongoDB for Genealogy
MongoDB for GenealogyMongoDB for Genealogy
MongoDB for Genealogy
 
MongoDB
MongoDBMongoDB
MongoDB
 
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancoukQ con london2011-matthewwall-whyichosemongodbforguardiancouk
Q con london2011-matthewwall-whyichosemongodbforguardiancouk
 
MongoDB at RuPy
MongoDB at RuPyMongoDB at RuPy
MongoDB at RuPy
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 
Mongodb intro
Mongodb introMongodb intro
Mongodb intro
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB EuroPython 2009
MongoDB EuroPython 2009MongoDB EuroPython 2009
MongoDB EuroPython 2009
 
Mongodb introduction and_internal(simple)
Mongodb introduction and_internal(simple)Mongodb introduction and_internal(simple)
Mongodb introduction and_internal(simple)
 
MongoDB NYC Python
MongoDB NYC PythonMongoDB NYC Python
MongoDB NYC Python
 
MongoDB Hadoop DC
MongoDB Hadoop DCMongoDB Hadoop DC
MongoDB Hadoop DC
 

More from Steven Francia

State of the Gopher Nation - Golang - August 2017
State of the Gopher Nation - Golang - August 2017State of the Gopher Nation - Golang - August 2017
State of the Gopher Nation - Golang - August 2017
Steven Francia
 
Building Awesome CLI apps in Go
Building Awesome CLI apps in GoBuilding Awesome CLI apps in Go
Building Awesome CLI apps in Go
Steven Francia
 
The Future of the Operating System - Keynote LinuxCon 2015
The Future of the Operating System -  Keynote LinuxCon 2015The Future of the Operating System -  Keynote LinuxCon 2015
The Future of the Operating System - Keynote LinuxCon 2015
Steven Francia
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)
Steven Francia
 
What every successful open source project needs
What every successful open source project needsWhat every successful open source project needs
What every successful open source project needs
Steven Francia
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them
Steven Francia
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Steven Francia
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
Steven Francia
 
Getting Started with Go
Getting Started with GoGetting Started with Go
Getting Started with Go
Steven Francia
 
Build your first MongoDB App in Ruby @ StrangeLoop 2013
Build your first MongoDB App in Ruby @ StrangeLoop 2013Build your first MongoDB App in Ruby @ StrangeLoop 2013
Build your first MongoDB App in Ruby @ StrangeLoop 2013
Steven Francia
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)
Steven Francia
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
Steven Francia
 
Future of data
Future of dataFuture of data
Future of data
Steven Francia
 
Big data for the rest of us
Big data for the rest of usBig data for the rest of us
Big data for the rest of us
Steven Francia
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
Steven Francia
 
Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center Strategies
Steven Francia
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
Steven Francia
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsSteven Francia
 
Building your first application w/mongoDB MongoSV2011
Building your first application w/mongoDB MongoSV2011Building your first application w/mongoDB MongoSV2011
Building your first application w/mongoDB MongoSV2011
Steven Francia
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
Steven Francia
 

More from Steven Francia (20)

State of the Gopher Nation - Golang - August 2017
State of the Gopher Nation - Golang - August 2017State of the Gopher Nation - Golang - August 2017
State of the Gopher Nation - Golang - August 2017
 
Building Awesome CLI apps in Go
Building Awesome CLI apps in GoBuilding Awesome CLI apps in Go
Building Awesome CLI apps in Go
 
The Future of the Operating System - Keynote LinuxCon 2015
The Future of the Operating System -  Keynote LinuxCon 2015The Future of the Operating System -  Keynote LinuxCon 2015
The Future of the Operating System - Keynote LinuxCon 2015
 
7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)7 Common Mistakes in Go (2015)
7 Common Mistakes in Go (2015)
 
What every successful open source project needs
What every successful open source project needsWhat every successful open source project needs
What every successful open source project needs
 
7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them7 Common mistakes in Go and when to avoid them
7 Common mistakes in Go and when to avoid them
 
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...Go for Object Oriented Programmers or Object Oriented Programming without Obj...
Go for Object Oriented Programmers or Object Oriented Programming without Obj...
 
Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go Painless Data Storage with MongoDB & Go
Painless Data Storage with MongoDB & Go
 
Getting Started with Go
Getting Started with GoGetting Started with Go
Getting Started with Go
 
Build your first MongoDB App in Ruby @ StrangeLoop 2013
Build your first MongoDB App in Ruby @ StrangeLoop 2013Build your first MongoDB App in Ruby @ StrangeLoop 2013
Build your first MongoDB App in Ruby @ StrangeLoop 2013
 
Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)Modern Database Systems (for Genealogy)
Modern Database Systems (for Genealogy)
 
Introduction to MongoDB and Hadoop
Introduction to MongoDB and HadoopIntroduction to MongoDB and Hadoop
Introduction to MongoDB and Hadoop
 
Future of data
Future of dataFuture of data
Future of data
 
Big data for the rest of us
Big data for the rest of usBig data for the rest of us
Big data for the rest of us
 
Replication, Durability, and Disaster Recovery
Replication, Durability, and Disaster RecoveryReplication, Durability, and Disaster Recovery
Replication, Durability, and Disaster Recovery
 
Multi Data Center Strategies
Multi Data Center StrategiesMulti Data Center Strategies
Multi Data Center Strategies
 
NoSQL databases and managing big data
NoSQL databases and managing big dataNoSQL databases and managing big data
NoSQL databases and managing big data
 
Hybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS ApplicationsHybrid MongoDB and RDBMS Applications
Hybrid MongoDB and RDBMS Applications
 
Building your first application w/mongoDB MongoSV2011
Building your first application w/mongoDB MongoSV2011Building your first application w/mongoDB MongoSV2011
Building your first application w/mongoDB MongoSV2011
 
MongoDB, E-commerce and Transactions
MongoDB, E-commerce and TransactionsMongoDB, E-commerce and Transactions
MongoDB, E-commerce and Transactions
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 

MongoDB and hadoop

  • 1. MongoD & B Hadoop
  • 2. Talking about MongoDB Intro & Fundamentals Why MongoDB & Hadoop Getting Started Using MongoDB & Hadoop Future of Big Data
  • 3. Steve @sp A 15+ years building the internet Father, husband, skateboarder Chief Solutions Architect @ responsible for drivers, integrations, web & docs
  • 4. Company behind MongoDB Offices in NYC, Palo Alto, London & Dublin 100+ employees Support, consulting, training Mgt: Google/DoubleClick, Oracle, Apple, NetApp, Mark Logic Well Funded: Sequoia, Union Square, Flybridge
  • 5. Introduction to MongoD
  • 6. MongoDB Application Document Oriented High { author : “steve”, date : new Date(), Performance text : “About MongoDB...”, tags : [“tech”, “database”]} Fully Consistent Horizontally Scalable
  • 7. MongoDB philosophy Keep functionality when we can (key/value stores are great, but we need more) Non-relational (no joins) makes scaling horizontally practical Document data models are good Database technology should run anywhere virtualized, cloud, metal, etc
  • 8. Under the hood Written in C++ Runs nearly everywhere Data serialized to BSON Extensive use of memory-mapped files i.e. read-through write-through memory caching.
  • 9. Database Landscape Scalability & Performance Memcached MongoDB RDBMS Depth of Functionality
  • 10. “ MongoDB has the best features of key/value stores, document databases and relational databases in one. John Nunemaker
  • 11. Relational made normalized data look like this Category • Name • Url Article User • Name Tag • Name • Slug • Name • Email Address • Publish date • Url • Text Comment • Comment • Date • Author
  • 12. Document databases make normalized data look like this Article • Name • Slug • Publish date User • Text • Name • Author • Email Address Comment[] • Comment • Date • Author Tag[] • Value Category[] • Value
  • 14. CMS / Blog Needs: • Business needed modern data store for rapid development and scale Solution: • Use PHP & MongoDB Results: • Real time statistics • All data, images, etc stored together easy access, easy deployment, easy high availability • No need for complex migrations • Enabled very rapid development and growth
  • 15. Photo Meta-Data Problem: • Business needed more flexibility than Oracle could deliver Solution: • Use MongoDB instead of Oracle Results: • Developed application in one sprint cycle • 500% cost reduction compared to Oracle • 900% performance improvement compared to Oracle
  • 16. Customer Analytics Problem: • Deal with massive data volume across all customer sites Solution: • Use MongoDB to replace Google Analytics / Omniture options Results: • Less than one week to build prototype and prove business case • Rapid deployment of new features
  • 17. Archiving Why MongoDB: • Existing application built on MySQL • Lots of friction with RDBMS based archive storage • Needed more scalable archive storage backend Solution: • Keep MySQL for active data (100mil) • MongoDB for archive (2+ billion) Results: • No more alter table statements taking over 2 months to run • Sharding enabled horizontal scale • Very happily looking at other places to use MongoDB
  • 18. Online Dictionary Problem: • MySQL could not scale to handle their 5B+ documents Solution: • Switched from MySQL to MongoDB Results: • Massive simplification of code base • Eliminated need for external caching system • 20x performance improvement over MySQL
  • 19. E-commerce Problem: • Multi-vertical E-commerce impossible to model (efficiently) in RDBMS Solution: • Switched from MySQL to MongoDB Results: • Massive simplification of code base • Rapidly build, halving time to market (and cost) • Eliminated need for external caching system • 50x+ performance improvement over MySQL
  • 20. Tons more MongoDB casts a wide net people keep coming up with new and brilliant ways to use it
  • 21. In Good Company and 1000s more
  • 23. Applications have complex needs Use the best tool for the job Often more than one tool is needed MongoDB ideal operational database MongoDB ideal for BIG data Not a data processing engine For heavy processing needs use tool designed for that job ... Hadoop
  • 24. MongoDB Map Reduce MongoDB map reduce quite capable... but with limits - Javascript not best language for processing map reduce - Javascript limited in external data processing libraries - Adds load to data store - Sharded environments do parallel processing
  • 25. MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Fixes some of limits of MongoDB MR - Can do realtime aggregation similar to SQL GroupBy - parallel processing on sharded clusters
  • 26. MongoDB Map Reduce Map() MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
  • 27. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as map Many map operations ctx.write(k2,v2) Combiner(k2,values2) 1 at time per input split same as k 2, v 3 Mongo's emit similar to Mongo's reducer similar to Partitioner(k2) Mongo's group Sort(keys2) Reducer threads similar to Mongo's Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
  • 28. MongoDB & Hadoop same as Mongo's Many map operations MongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx) single server or sharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k ) MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
  • 29. DEMO
  • 30. DEMO Install MongoDB Install Hadoop & MongoDB Plugin Import tweets from twitter Write mapper in Python using Hadoop streaming Write reducer in Python using Hadoop streaming PROFIT
  • 31. Installing MongoDB brew install mongodb sudo easy_install pip sudo pip install pymongo
  • 33. Installing Mongo-hadoop https://gist.github.com/1887726 hadoop_version '0.23' hadoop_path="/usr/local/Cellar/hadoop/ $hadoop_version.0/libexec/lib" git clone git://github.com/mongodb/mongo-hadoop.git cd mongo-hadoop sed -i '' "s/default/$hadoop_version/g" build.sbt cd streaming ./build.sh
  • 34. Groking Twitter curl https://stream.twitter.com/1/statuses/ sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  • 35. Map Timezones in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): for doc in documents: yield {'_id': doc['user']['time_zone'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  • 36. Writing Reducer in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Processing Timezone %s" % key _count = 0 for v in values: _count += v['count'] return {'_id': key, 'count': _count} BSONReducer(reducer)
  • 37. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar - mapper examples/twitter/twit_map.py -reducer examples/twitter/twit_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_map.py -file examples/twitter/twit_reduce.py
  • 38. Popular time zones db.twit_reduction.find().sort( {'count' : -1 }) { "_id" : ObjectId("4f45701903648ee13a565f9f"), "count" : 47912 } { "_id" : "Central Time (US & Canada)", "count" : 16374 } { "_id" : "Quito", "count" : 13708 } { "_id" : "Greenland", "count" : 12332 } { "_id" : "Santiago", "count" : 10153 } { "_id" : "Eastern Time (US & Canada)", "count" : 8823 } { "_id" : "Pacific Time (US & Canada)", "count" : 8530 } { "_id" : "Brasilia", "count" : 6621 } { "_id" : "London", "count" : 5617 } { "_id" : "Mountain Time (US & Canada)", "count" : 4479 } { "_id" : "Amsterdam", "count" : 4199 } { "_id" : "Hawaii", "count" : 3381 } { "_id" : "Tokyo", "count" : 2713 } { "_id" : "Alaska", "count" : 2543 } { "_id" : "Madrid", "count" : 2118 } { "_id" : "Paris", "count" : 1538 } { "_id" : "Buenos Aires", "count" : 1247 } { "_id" : "Mexico City", "count" : 1104 } { "_id" : "Caracas", "count" : 1089 }
  • 40. Map Hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONMapper def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  • 41. Reduce hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count} BSONReducer(reducer)
  • 42. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar - mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
  • 43. Popular Hash Tags db.twit_hashtags.find().sort( {'count' : -1 }) { "_id" : "YouKnowYoureInLoveIf", "count" : 287 } { "_id" : "teamfollowback", "count" : 200 } { "_id" : "RT", "count" : 150 } { "_id" : "Arsenal", "count" : 148 } { "_id" : "milars", "count" : 145 } { "_id" : "sanremo", "count" : 145 } { "_id" : "LoseMyNumberIf", "count" : 139 } { "_id" : "RelationshipsShould", "count" : 137 } { "_id" : "Bahrain", "count" : 129 } { "_id" : "bahrain", "count" : 125 } { "_id" : "oomf", "count" : 117 } { "_id" : "BabyKillerOcalan", "count" : 106 } { "_id" : "TeamFollowBack", "count" : 105 } { "_id" : "WhyDoPeopleThink", "count" : 102 } { "_id" : "np", "count" : 100 }
  • 45. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 } )
  • 46. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1 }
  • 49. Future The of BIG data
  • 50. What is BIG? BIG today is normal tomorrow
  • 51. Google 2000 Google Inc, today announced it has released the largest search engine on the Internet. Google’s new index, comprising more than 1 billion URLs
  • 52. Google 2008 Our indexing system for processing links indicates that we now count 1 trillion unique URLs (and the number of individual web pages out there is growing by several billion pages per day).
  • 53. BIG 2012 & Beyond MongoDB enables us to scale with the redefinition of BIG. New processing tools like Hadoop & Storm are enabling us to process the new BIG.
  • 54. Hadoop is our first step
  • 55. MongoDB is committed to working with best data tools including Storm, Spark, & more
  • 56. http://spf13.com http://github.com/s @spf13 Question download at mongodb.org We’re hiring!! Contact us at jobs@10gen.com

Editor's Notes

  1. \n
  2. 10\n15\n10\n5\n
  3. \n
  4. \n
  5. \n
  6. \n
  7. By reducing transactional semantics the db provides, one can still solve an interesting set of problems where performance is very important, and horizontal scaling then becomes easier.\n\n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n
  52. \n
  53. \n
  54. \n
  55. \n
  56. \n
  57. \n