MongoDB, Hadoop and humongous data - MongoSV 2012

11,926 views
11,280 views

Published on

Learn how to integrate MongoDB with Hadoop for large-scale distributed data processing. Using tools like MapReduce, Pig and Streaming you will learn how to do analytics and ETL on large datasets with the ability to load and save data against MongoDB. With Hadoop MapReduce, Java and Scala programmers will find a native solution for using MapReduce to process their data with MongoDB. Programmers of all kinds will find a new way to work with ETL using Pig to extract and analyze large datasets and persist the results to MongoDB. Python and Ruby Programmers can rejoice as well in a new way to write native Mongo MapReduce using the Hadoop Streaming interfaces.

Published in: Technology
4 Comments
22 Likes
Statistics
Notes
No Downloads
Views
Total views
11,926
On SlideShare
0
From Embeds
0
Number of Embeds
3,419
Actions
Shares
0
Downloads
179
Comments
4
Likes
22
Embeds 0
No embeds

No notes for slide

MongoDB, Hadoop and humongous data - MongoSV 2012

  1. MongoDBHadoop&humongous data
  2. Talking aboutWhat is Humongous DataHumongous Data & YouMongoDB & Data processingFuture of Humongous Data
  3. @spf13 AKASteve Francia15+ years buildingthe internet Father, husband, skateboarderChief Solutions Architect @responsible for drivers,integrations, web & docs
  4. What ishumongous data ?
  5. 2000Google IncToday announced it has releasedthe largest search engine on theInternet.Google’s new index, comprisingmore than 1 billion URLs
  6. 2008Our indexing system for processinglinks indicates thatwe now count 1 trillion unique URLs(and the number of individual webpages out there is growing byseveral billion pages per day).
  7. An unprecedentedamount of data isbeing created and isaccessible
  8. Data Growth 1,0001000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLs
  9. Truly Exponential GrowthIs hard for people to graspA BBC reporter recently: "Your current PCis more powerful than the computer theyhad on board the first flight to the moon".
  10. Moore’s LawApplies to more than just CPUsBoiled down it is that things double atregular intervalsIt’s exponential growth.. and applies tobig data
  11. How BIG is it?
  12. How BIG is it?2008
  13. How BIG is it? 20072008 2005 2006 2003 2004 2001 2002
  14. Why all thistalk about BIG Data now?
  15. In the past fewyears open sourcesoftware emergedenabling ‘us’ tohandle BIG Data
  16. The Big Data Story
  17. Is actuallytwo stories
  18. Doers & Tellers talking about different things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
  19. Tellers
  20. Doers
  21. Doers talk a lot more about actual solutions
  22. They know it’s a two sided story Storage Processing
  23. Take awaysMongoDB and HadoopMongoDB for storage &operationsHadoop for processing &analytics
  24. MongoDB& Data Processing
  25. Applications have complex needsMongoDB ideal operationaldatabaseMongoDB ideal for BIG dataNot a data processing engine, butprovides processing functionality
  26. Many options for Processing Data•Process in MongoDB using Map Reduce•Process in MongoDB using Aggregation Framework•Process outside MongoDB (using Hadoop)
  27. MongoDB Map Reduce Map()MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple times
  28. MongoDB Map ReduceMongoDB map reduce quite capable... but withlimits- Javascript not best language for processing map reduce- Javascript limited in external data processing libraries- Adds load to data store
  29. MongoDB AggregationMost uses of MongoDB Map Reduce were foraggregationAggregation Framework optimized for aggregatequeriesRealtime aggregation similar to SQL GroupBy
  30. MongoDB & Hadoop same as Mongos Many map operationsMongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx)single server orsharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k )MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vf
  31. DEMOTIME
  32. DEMOInstall Hadoop MongoDB PluginImport tweets from twitterWrite mapper in Python using HadoopstreamingWrite reducer in Python using HadoopstreamingCall myself a data scientist
  33. Installing Mongo-hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
  34. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  35. DEMO 1
  36. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
  37. Reduce hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONReducerdef reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count}BSONReducer(reducer)
  38. All togetherhadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.py
  39. Popular Hash Tagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "Bahrain", "count" : 129 }{ "_id" : "bahrain", "count" : 125 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "BabyKillerOcalan", "count" : 106 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
  40. DEMO 2
  41. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
  42. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, { "_id" : "Bahrain", "count" : 129 }, { "_id" : "bahrain", "count" : 125 } ],"ok" : 1}
  43. The Future ofhumongous data
  44. What is BIG? BIG today isnormal tomorrow
  45. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
  46. Data Growth 9,00090006750 4,4004500 2,1502250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLs
  47. 2012Generating over250 Millions oftweets per day
  48. MongoDB enables us to scalewith the redefinition of BIG.New processing tools likeHadoop & Storm are enablingus to process the new BIG.
  49. Hadoop is our first step
  50. MongoDB iscommitted to working with best data tools including Hadoop, Storm,Disco, Spark & more
  51. http://spf13.com http://github.com/spf13 @spf13Questions? download at github.com/mongodb/mongo-hadoop

×