Your SlideShare is downloading. ×
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
MongoDB, Hadoop and Humongous Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

MongoDB, Hadoop and Humongous Data

16,773

Published on

Published in: Technology, Business
0 Comments
19 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
16,773
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
264
Comments
0
Likes
19
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • \n
  • \n
  • 10\n15\n10\n5\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • One site is generating nearly as many URLs as the entire internet 6 years ago.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript

    • 1. MongoDB, Hadoop & Humongous DataSteve Francia @spf13
    • 2. Talking aboutWhat is Humongous DataWhy MongoDB & HadoopGetting Started (Demo)Who’s using MongoDB & HadoopFuture of Humongous Data
    • 3. @spf13 AKASteve Francia15+ years buildingthe internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
    • 4. What ishumongous data ?
    • 5. 2000Google IncToday announced it has releasedthe largest search engine on theInternet.Google’s new index, comprisingmore than 1 billion URLs
    • 6. 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
    • 7. An unprecedentedamount of data isbeing created and isaccessible
    • 8. Data Growth 1,0001000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs
    • 9. What good isall this data ifwe can’t makesense of it?
    • 10. What cost Googlemillions of $$10 years ago tobuild...
    • 11. Could easily andcheaply be built by ateenager in a garagethanks to productslike MongoDB,Hadoop & AWS
    • 12. MongoDB & DataProcessing
    • 13. Applications have complex needsMongoDB ideal operationaldatabaseMongoDB ideal for BIG dataNot a data processing engine, butprovides processing functionality
    • 14. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
    • 15. MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store- Sharded environments do parallel processing
    • 16. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesFixes some of limits of MongoDB MR- Can do realtime aggregation similar to SQL GroupBy- parallel processing on sharded clusters
    • 17. As your data processingneeds increase you will want to use a tool designed for the job
    • 18. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as mapMany map operations ctx.write(k2,v2) Combiner(k2,values2)1 at time per inputsplit same as k 2, v 3 Mongos emit similar to Mongos reducer similar to Partitioner(k2) Mongos group Sort(keys2) Reducer threads similar to Mongos Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
    • 19. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
    • 20. DEMOTIME
    • 21. DEMOInstall Hadoop MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingCall myself a data scientist
    • 22. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
    • 23. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
    • 24. DEMO 1
    • 25. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
    • 26. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
    • 27. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
    • 28. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
    • 29. DEMO 2
    • 30. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
    • 31. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
    • 32. Who is UsinMongoD & Today
    • 33. Production usageOrbitzBadgevillefoursquareCityGrid and more
    • 34. The Futureofhumongous data
    • 35. What is BIG? BIG today isnormal tomorrow
    • 36. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    • 37. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    • 38. 2012Generating over250 Millions oftweets per day
    • 39. MongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
    • 40. Hadoop is our first step
    • 41. MongoDB is committed to working with bestdata tools including Storm, Spark, & more
    • 42. http://spf13.com http://github.com/s @spf13Question download at mongodb.orgWe’re hiring!! Contact us at jobs@10gen.com

    ×