Webinar: Data Processing and Aggregation Options

12,472
-1

Published on

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we'll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.

Published in: Technology, Business
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,472
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
13,260
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Webinar: Data Processing and Aggregation Options

  1. 1. Data Processing and Aggregation Achille Brighton Consulting Engineer, MongoDB
  2. 2. Big Data
  3. 3. Exponential Data Growth Billions of URLs indexed by Google 1200 1000 800 600 400 200 0 2000 2002 2004 2006 2008
  4. 4. For over a decade Big Data == Custom Software
  5. 5. In the past few years Open source software has emerged enabling the rest of us to handle Big Data
  6. 6. How MongoDB Meets Our Requirements •  MongoDB is an operational database •  MongoDB provides high performance for storage and retrieval at large scale •  MongoDB has a robust query interface permitting intelligent operations •  MongoDB is not a data processing engine, but provides processing functionality
  7. 7. MongoDB data processing options
  8. 8. Getting Example Data
  9. 9. The “hello world” of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting…
  10. 10. What is the most popular pub name?
  11. 11. Open Street Map Data #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs)
  12. 12. Example Pub Data { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }
  13. 13. MongoDB MapReduce •  map MongoDB reduce finalize
  14. 14. MongoDB MapReduce • 
  15. 15. map Map Function MongoDB > var map = function() { emit(this.name, 1); reduce finalize
  16. 16. map Reduce Function MongoDB > var reduce = function (key, values) { var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } reduce finalize
  17. 17. Results > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }
  18. 18. Pub Names in the Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, }
  19. 19. Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 }
  20. 20. Double Checking
  21. 21. MongoDB MapReduce •  Real-time •  Output directly to document or collection •  Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++
  22. 22. Aggregation Framework
  23. 23. •  Declared in JSON, executes in C++ Aggregation Framework Data Processing in MongoDB
  24. 24. •  Declared in JSON, executes in C++ •  Flexible, functional, and simple Aggregation Framework Data Processing in MongoDB
  25. 25. •  Declared in JSON, executes in C++ •  Flexible, functional, and simple •  Plays nice with sharding Aggregation Framework Data Processing in MongoDB
  26. 26. Pipeline Piping command line operations ps ax | grep mongod | head 1 Data Processing in MongoDB
  27. 27. Pipeline Piping aggregation operations $match | $group | $sort Stream of documents Result document Data Processing in MongoDB
  28. 28. Pipeline Operators •  $match •  $sort •  $project •  $limit •  $group •  $skip •  $unwind •  $geoNear Data Processing in MongoDB
  29. 29. $match •  Filter documents •  Uses existing query syntax •  If using $geoNear it has to be first in pipeline •  $where is not supported
  30. 30. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } }
  31. 31. $project •  Reshape documents •  Include, exclude or rename fields •  Inject computed fields •  Create sub-document fields
  32. 32. Including and Excluding Fields { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “amenity” : “pub”, “name” : “The Red Lion” }
  33. 33. Reformatting Documents { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", { “$project”: { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }}
  34. 34. $group •  Group documents by an ID •  Field reference, object, constant •  Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last •  Processes all data in memory
  35. 35. Summating fields } { $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" } }} { { { title: "The Great Gatsby", pages: 218, language: "English" title: "War and Peace", pages: 1440, language: "Russian” } } { _id: "Russian", numTitles: 1, sumPages: 1440 { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "English", numTitles: 2, sumPages: 1306
  36. 36. Add To Set { title: "The Great Gatsby", pages: 218, language: "English" { $group: { _id: "$language", titles: { $addToSet: "$title" } }} } { { title: "War and Peace", pages: 1440, language: "Russian" } { } { title: "Atlas Shrugged", pages: 1088, language: "English" } } _id: "Russian", titles: [ "War and Peace" ] _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ]
  37. 37. Expanding Arrays { $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] { } { } } { } title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s"
  38. 38. Back to the pub! •  http://www.offwestend.com/index.php/theatres/pastshows/71
  39. 39. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 }
  40. 40. Results > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 }
  41. 41. Aggregation Framework Benefits •  Real-time •  Simple yet powerful interface •  Declared in JSON, executes in C++ •  Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited
  42. 42. Analyzing MongoDB Data in External Systems
  43. 43. MongoDB with Hadoop •  MongoDB
  44. 44. Hadoop MongoDB Connector •  MongoDB or BSON files as input/output •  Source data can be filtered with queries •  Hadoop Streaming support –  For jobs written in Python, Ruby, Node.js •  Supports Hadoop tools such as Pig and Hive
  45. 45. Map Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."
  46. 46. Reduce Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer)
  47. 47. Execute MapReduce hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names
  48. 48. Popular Pub Names Nearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 }
  49. 49. MongoDB with Hadoop •  MongoDB warehouse
  50. 50. MongoDB with Hadoop •  ETL MongoDB
  51. 51. Limitations •  Batch processing •  Requires synchronization between data store and processor •  Adds complexity to infrastructure
  52. 52. Advantages •  Processing decoupled from data store •  Parallel processing •  Leverage existing infrastructure •  Java has rich set of data processing libraries –  And other languages if using Hadoop Streaming
  53. 53. Storm
  54. 54. Storm
  55. 55. Storm MongoDB connector •  Spout for MongoDB oplog or capped collections –  Filtering capabilities –  Threaded and non-blocking •  Output to new or existing documents –  Insert/update bolt
  56. 56. Aggregating MongoDB’s Data Processing Options
  57. 57. Data Processing with MongoDB •  Process in MongoDB using Map/Reduce •  Process in MongoDB using Aggregation Framework •  Also: Storing pre-aggregated data –  An exercise in schema design •  Process outside MongoDB using Hadoop and other external tools
  58. 58. External Tools
  59. 59. Questions?
  60. 60. References •  Map Reduce docs –  http://docs.mongodb.org/manual/core/map-reduce/ •  Aggregation Framework –  Examples http://docs.mongodb.org/manual/applications/aggregation –  SQL Comparison http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/ •  Multi Threaded Map Reduce: http://edgystuff.tumblr.com/post/54709368492/how-to-speedup-mongodb-map-reduce-by-20x
  61. 61. Thanks! Achille Brighton Consulting Engineer, MongoDB
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×