MongoDB Hadoop and Humongous Data

3,222 views

Published on

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,222
On SlideShare
0
From Embeds
0
Number of Embeds
1,407
Actions
Shares
0
Downloads
100
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

MongoDB Hadoop and Humongous Data

  1. 1. MongoDB Hadoop & humongous dataTuesday, December 11, 12
  2. 2. Talking about What is Humongous Data Humongous Data & You MongoDB & Data processing Future of Humongous DataTuesday, December 11, 12
  3. 3. What is humongous data ?Tuesday, December 11, 12
  4. 4. 2000 Google Inc Today announced it has released the largest search engine on the Internet. Google’s new index, comprising more than 1 billion URLsTuesday, December 11, 12
  5. 5. 2008 Our indexing system for processing links indicates that we now count 1 trillion unique URLs (and the number of individual web pages out there is growing by several billion pages per day).Tuesday, December 11, 12
  6. 6. An unprecedented amount of data is being created and is accessibleTuesday, December 11, 12
  7. 7. Data Growth 1,000 1000 750 500 500 250 250 120 55 4 10 24 1 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 Millions of URLsTuesday, December 11, 12
  8. 8. Truly Exponential Growth Is hard for people to grasp A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".Tuesday, December 11, 12
  9. 9. Moore’s Law Applies to more than just CPUs Boiled down it is that things double at regular intervals It’s exponential growth.. and applies to big dataTuesday, December 11, 12
  10. 10. How BIG is it?Tuesday, December 11, 12
  11. 11. How BIG is it? 2008Tuesday, December 11, 12
  12. 12. How BIG is it? 2007 2008 2005 2006 2003 2004 2001 2002Tuesday, December 11, 12
  13. 13. Why all this talk about BIG Data now?Tuesday, December 11, 12
  14. 14. In the past few years open source software emerged enabling ‘us’ to handle BIG DataTuesday, December 11, 12
  15. 15. The Big Data StoryTuesday, December 11, 12
  16. 16. Is actually two storiesTuesday, December 11, 12
  17. 17. Doers & Tellers talking about different things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-septemberTuesday, December 11, 12
  18. 18. TellersTuesday, December 11, 12
  19. 19. DoersTuesday, December 11, 12
  20. 20. Doers talk a lot more about actual solutionsTuesday, December 11, 12
  21. 21. They know it’s a two sided story Storage ProcessingTuesday, December 11, 12
  22. 22. Take aways MongoDB and Hadoop MongoDB for storage & operations Hadoop for processing & analyticsTuesday, December 11, 12
  23. 23. MongoDB & Data ProcessingTuesday, December 11, 12
  24. 24. Applications have complex needs MongoDB ideal operational database MongoDB ideal for BIG data Not a data processing engine, but provides processing functionalityTuesday, December 11, 12
  25. 25. Many options for Processing Data •Process in MongoDB using Map Reduce •Process in MongoDB using Aggregation Framework •Process outside MongoDB (using Hadoop)Tuesday, December 11, 12
  26. 26. MongoDB Map Reduce Map() MongoDB Data Group(k) emit(k,v) map iterates on documents Document is $this Sort(k) 1 at time per shard Reduce(k,values) k,v Finalize(k,v) Input matches output k,v Can run multiple timesTuesday, December 11, 12
  27. 27. MongoDB Map Reduce MongoDB map reduce quite capable... but with limits - Javascript not best language for processing map reduce - Javascript limited in external data processing libraries - Adds load to data storeTuesday, December 11, 12
  28. 28. MongoDB Aggregation Most uses of MongoDB Map Reduce were for aggregation Aggregation Framework optimized for aggregate queries Realtime aggregation similar to SQL GroupByTuesday, December 11, 12
  29. 29. MongoDB & Hadoop same as Mongos Many map operations MongoDB shard chunks (64mb) 1 at time per input split Creates a list each split Map (k1,1v1,1ctx) Runs on same of Input Splits Map (k ,1v ,1ctx) thread as map each split Map (k , v , ctx) single server or sharded cluster (InputFormat) each split ctx.write(k2,v2)2 ctx.write(k2,v )2 Combiner(k2,values2)2 RecordReader ctx.write(k2,v ) Combiner(k2,values )2 Combiner(k2,values ) k2, 2v3 3 k , 2v 3 k ,v Partitioner(k2)2 Partitioner(k )2 Partitioner(k ) Sort(keys2) Sort(k2)2 Sort(k ) MongoDB Reducer threads Reduce(k2,values3) Output Format Runs once per key kf,vfTuesday, December 11, 12
  30. 30. DEMO TIMETuesday, December 11, 12
  31. 31. DEMO Install Hadoop MongoDB Plugin Import tweets from twitter Write mapper in Python using Hadoop streaming Write reducer in Python using Hadoop streaming Call myself a data scientistTuesday, December 11, 12
  32. 32. Installing Mongo-hadoop https://gist.github.com/1887726 hadoop_version 0.23 hadoop_path="/usr/local/Cellar/hadoop/ $hadoop_version.0/libexec/lib" git clone git://github.com/mongodb/mongo- hadoop.git cd mongo-hadoop sed -i "s/default/$hadoop_version/g" build.sbt cd streaming ./build.shTuesday, December 11, 12
  33. 33. Groking Twitter curl https://stream.twitter.com/1/ statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hoursTuesday, December 11, 12
  34. 34. DEMO 1Tuesday, December 11, 12
  35. 35. Map Hashtags in Python#!/usr/bin/env pythonimport syssys.path.append(".")from pymongo_hadoop import BSONMapperdef mapper(documents): for doc in documents: for hashtag in doc[entities][hashtags]: yield {_id: hashtag[text], count: 1}BSONMapper(mapper)print >> sys.stderr, "Done Mapping."Tuesday, December 11, 12
  36. 36. Reduce hashtags in Python #!/usr/bin/env python import sys sys.path.append(".") from pymongo_hadoop import BSONReducer def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode(utf8) _count = 0 for v in values: _count += v[count] return {_id: key.encode(utf8), count: _count} BSONReducer(reducer)Tuesday, December 11, 12
  37. 37. All together hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar -mapper examples/twitter/twit_hashtag_map.py -reducer examples/twitter/twit_hashtag_reduce.py -inputURI mongodb://127.0.0.1/test.live -outputURI mongodb://127.0.0.1/test.twit_reduction -file examples/twitter/twit_hashtag_map.py -file examples/twitter/twit_hashtag_reduce.pyTuesday, December 11, 12
  38. 38. Popular Hash Tags db.twit_hashtags.find().sort( {count : -1 }) { "_id" : "YouKnowYoureInLoveIf", "count" : 287 } { "_id" : "teamfollowback", "count" : 200 } { "_id" : "RT", "count" : 150 } { "_id" : "Arsenal", "count" : 148 } { "_id" : "milars", "count" : 145 } { "_id" : "sanremo", "count" : 145 } { "_id" : "LoseMyNumberIf", "count" : 139 } { "_id" : "RelationshipsShould", "count" : 137 } { "_id" : "oomf", "count" : 117 } { "_id" : "TeamFollowBack", "count" : 105 } { "_id" : "WhyDoPeopleThink", "count" : 102 } { "_id" : "np", "count" : 100 }Tuesday, December 11, 12
  39. 39. DEMO 2Tuesday, December 11, 12
  40. 40. Aggregation in Mongo 2.1 db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })Tuesday, December 11, 12
  41. 41. Popular Hash Tags db.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1}Tuesday, December 11, 12
  42. 42. The Future ofhumongous dataTuesday, December 11, 12
  43. 43. What is BIG? BIG today is normal tomorrowTuesday, December 11, 12
  44. 44. Data Growth 9,000 9000 6750 4,400 4500 2,150 2250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLsTuesday, December 11, 12
  45. 45. Data Growth 9,000 9000 6750 4,400 4500 2,150 2250 1,000 500 55 120 250 1 4 10 24 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Millions of URLsTuesday, December 11, 12
  46. 46. 2012 Generating over 250 Millions of tweets per dayTuesday, December 11, 12
  47. 47. MongoDB enables us to scale with the redefinition of BIG. New processing tools like Hadoop & Storm are enabling us to process the new BIG.Tuesday, December 11, 12
  48. 48. Hadoop is our first stepTuesday, December 11, 12
  49. 49. MongoDB is committed to working with best data tools including Hadoop, Storm, Disco, Spark & moreTuesday, December 11, 12
  50. 50. http://spf13.com http://github.com/spf13 @spf13 Questions? download at github.com/mongodb/mongo-hadoopTuesday, December 11, 12
  51. 51. Tuesday, December 11, 12

×