MongoDB, Hadoop and Humongous Data

19,425 views
18,621 views

Published on

Published in: Technology, Business
0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
19,425
On SlideShare
0
From Embeds
0
Number of Embeds
9,960
Actions
Shares
0
Downloads
279
Comments
0
Likes
21
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • 10\n15\n10\n5\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • One site is generating nearly as many URLs as the entire internet 6 years ago.\n
  • \n
  • \n
  • \n
  • \n
  • \n
  • MongoDB, Hadoop and Humongous Data

    1. MongoDB, Hadoop & Humongous DataSteve Francia @spf13
    2. Talking aboutWhat is Humongous DataWhy MongoDB & HadoopGetting Started (Demo)Who’s using MongoDB & HadoopFuture of Humongous Data
    3. @spf13 AKASteve Francia15+ years buildingthe internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
    4. What ishumongous data ?
    5. 2000Google IncToday announced it has releasedthe largest search engine on theInternet.Google’s new index, comprisingmore than 1 billion URLs
    6. 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
    7. An unprecedentedamount of data isbeing created and isaccessible
    8. Data Growth 1,0001000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs
    9. What good isall this data ifwe can’t makesense of it?
    10. What cost Googlemillions of $$10 years ago tobuild...
    11. Could easily andcheaply be built by ateenager in a garagethanks to productslike MongoDB,Hadoop & AWS
    12. MongoDB & DataProcessing
    13. Applications have complex needsMongoDB ideal operationaldatabaseMongoDB ideal for BIG dataNot a data processing engine, butprovides processing functionality
    14. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
    15. MongoDB Map ReduceMongoDB map reduce quite capable... but with limits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store- Sharded environments do parallel processing
    16. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesFixes some of limits of MongoDB MR- Can do realtime aggregation similar to SQL GroupBy- parallel processing on sharded clusters
    17. As your data processingneeds increase you will want to use a tool designed for the job
    18. Hadoop Map Reduce Runs on same 1 1 InputFormat Map (k , v , ctx) thread as mapMany map operations ctx.write(k2,v2) Combiner(k2,values2)1 at time per inputsplit same as k 2, v 3 Mongos emit similar to Mongos reducer similar to Partitioner(k2) Mongos group Sort(keys2) Reducer threads similar to Mongos Finalize Reduce(k3,values4) Output Format Runs once per key kf,vf
    19. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
    20. DEMOTIME
    21. DEMOInstall Hadoop MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingCall myself a data scientist
    22. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
    23. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
    24. DEMO 1
    25. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
    26. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
    27. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
    28. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
    29. DEMO 2
    30. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
    31. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
    32. Who is UsinMongoD & Today
    33. Production usageOrbitzBadgevillefoursquareCityGrid and more
    34. The Futureofhumongous data
    35. What is BIG? BIG today isnormal tomorrow
    36. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    37. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
    38. 2012Generating over250 Millions oftweets per day
    39. MongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
    40. Hadoop is our first step
    41. MongoDB is committed to working with bestdata tools including Storm, Spark, & more
    42. http://spf13.com http://github.com/s @spf13Question download at mongodb.orgWe’re hiring!! Contact us at jobs@10gen.com

    ×