MongoDB,  Hadoop & Humongous    DataSteve Francia   @spf13
Talking aboutWhat is Humongous DataWhy MongoDB & HadoopGetting Started (Demo)Who’s using MongoDB & HadoopFuture of Humongo...
@spf13                  AKASteve Francia15+ years buildingthe internet  Father, husband,  skateboarderChief Solutions Arch...
What ishumongous   data ?
2000Google IncToday announced it has releasedthe largest search engine on theInternet.Google’s new index, comprisingmore t...
2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual...
An unprecedentedamount of data isbeing created and isaccessible
Data Growth                                   1,0001000 750                                                       500 500 ...
What good isall this data ifwe can’t makesense of it?
What cost Googlemillions of $$10 years ago tobuild...
Could easily andcheaply be built by ateenager in a garagethanks to productslike MongoDB,Hadoop & AWS
MongoDB  & DataProcessing
Applications have    complex needsMongoDB ideal operationaldatabaseMongoDB ideal for BIG dataNot a data processing engine,...
MongoDB Map Reduce                        Map()MongoDB   Data                                              Group(k)       ...
MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map  re...
MongoDB              AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for agg...
As your data processingneeds increase   you will want to use a   tool designed for the job
Hadoop Map Reduce                                                                              Runs on same               ...
MongoDB & Hadoop                      same as Mongos          Many map operationsMongoDB             shard chunks (64mb)  ...
DEMOTIME
DEMOInstall Hadoop MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Py...
Installing Mongo-hadoop                   https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar...
Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live  ...
DEMO 1
Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(d...
Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef redu...
All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar  -mapper examples/twitter/twit_hashtag_map.py ...
Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){   "_id"   :   "YouKnowYoureInLoveIf", "count" : 287 }{   "_i...
DEMO 2
Aggregation in Mongo 2.1     db.live.aggregate(    { $unwind : "$entities.hashtags" } ,    { $match :      { "entities.has...
Popular Hash Tags    db.twit_hashtags.aggregate(a){    "result" : [       { "_id" : "YouKnowYoureInLoveIf", "count" : 287 ...
Who is   UsinMongoD   &  Today
Production usageOrbitzBadgevillefoursquareCityGrid             and more
The  Futureofhumongous       data
What is BIG?  BIG today isnormal tomorrow
Data Growth                                                 9,00090006750                                                 ...
Data Growth                                                 9,00090006750                                                 ...
2012Generating over250 Millions oftweets per day
MongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process t...
Hadoop is our  first step
MongoDB is   committed to working with bestdata tools including  Storm, Spark, &       more
http://spf13.com                           http://github.com/s                           @spf13Question    download at mon...
MongoDB, Hadoop and Humongous Data
Upcoming SlideShare
Loading in...5
×

MongoDB, Hadoop and Humongous Data

17,058

Published on

Published in: Technology, Business
0 Comments
20 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
17,058
On Slideshare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
265
Comments
0
Likes
20
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • 10\n15\n10\n5\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • One site is generating nearly as many URLs as the entire internet 6 years ago.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • MongoDB, Hadoop and Humongous Data

    1. 1. MongoDB, Hadoop & Humongous DataSteve Francia @spf13
    2. 2. Talking aboutWhat is Humongous DataWhy MongoDB & HadoopGetting Started (Demo)Who’s using MongoDB & HadoopFuture of Humongous Data
    3. 3. @spf13 AKASteve Francia15+ years buildingthe internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
    4. 4. What ishumongous data ?
    5. 5. 2000Google IncToday announced it has releasedthe largest search engine on theInternet.Google’s new index, comprisingmore than 1 billion URLs
    6. 6. 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
    7. 7. An unprecedentedamount of data isbeing created and isaccessible
    8. 8. Data Growth 1,0001000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs
    9. 9. What good isall this data ifwe can’t makesense of it?
    10. 10. What cost Googlemillions of $$10 years ago tobuild...
    11. 11. Could easily andcheaply be built by ateenager in a garagethanks to productslike MongoDB,Hadoop & AWS
    12. 12. MongoDB & DataProcessing
    13. 13. Applications have complex needsMongoDB ideal operationaldatabaseMongoDB ideal for BIG dataNot a data processing engine, butprovides processing functionality
    14. 14. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
    15. 15. MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store- Sharded environments do parallel processing
    16. 16. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesFixes some of limits of MongoDB MR- Can do realtime aggregation similar to SQL GroupBy- parallel processing on sharded clusters
    17. 17. As your data processingneeds increase you will want to use a tool designed for the job
    18. 18. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as mapMany map operations ctx.write(k2,v2) Combiner(k2,values2)1 at time per inputsplit same as k 2, v 3 Mongos emit similar to Mongos reducer similar to Partitioner(k2) Mongos group Sort(keys2) Reducer threads similar to Mongos Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
    19. 19. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
    20. 20. DEMOTIME
    21. 21. DEMOInstall Hadoop MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingCall myself a data scientist
    22. 22. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
    23. 23. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
    24. 24. DEMO 1
    25. 25. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
    26. 26. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
    27. 27. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
    28. 28. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
    29. 29. DEMO 2
    30. 30. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
    31. 31. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
    32. 32. Who is UsinMongoD & Today
    33. 33. Production usageOrbitzBadgevillefoursquareCityGrid and more
    34. 34. The Futureofhumongous data
    35. 35. What is BIG? BIG today isnormal tomorrow
    36. 36. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    37. 37. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    38. 38. 2012Generating over250 Millions oftweets per day
    39. 39. MongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
    40. 40. Hadoop is our first step
    41. 41. MongoDB is committed to working with bestdata tools including Storm, Spark, & more
    42. 42. http://spf13.com http://github.com/s @spf13Question download at mongodb.orgWe’re hiring!! Contact us at jobs@10gen.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×