Data Processing and Aggregation with MongoDB

2,321
-1

Published on

Published in: Technology, Business
2 Comments
8 Likes
Statistics
Notes
No Downloads
Views
Total Views
2,321
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
73
Comments
2
Likes
8
Embeds 0
No embeds

No notes for slide

Data Processing and Aggregation with MongoDB

  1. 1. Data Processing and Aggregation Senior Solutions Architect, MongoDB Inc massimo@mongodb.com. Massimo Brignoli @massimobrignoli
  2. 2. Chi sono? •  Solutions Architect/Evangelist in MongoDB Inc. •  24 anni di esperienza nel mondo dei database e dello sviluppo software •  Ex dipendente di MySQL e MariaDB •  In precedenza: web,web,web
  3. 3. Big Data
  4. 4. Innovation
  5. 5. Understanding Big Data – It’s Not Very “Big” from Big Data Executive Summary – 50+ top executives from Government and F500 firms 64% - Ingest diverse, new data in real-time 15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB)
  6. 6. “I have not failed. I've just found 10,000 ways that won't work.” ― Thomas A. Edison
  7. 7. Tante grandi innovazioni dal 1970…
  8. 8. Ma usereste una di queste tecnologie per lanciare un nuovo business oggi?
  9. 9. Incluso il modello relazionale dei dati!
  10. 10. Per quali computer è stato pensato il modello relazionale?
  11. 11. Questi erano i computer!
  12. 12. E lo Storage?
  13. 13. E come si sviluppava il software? pio, il LISP (LISt Processing language) [24]. A quel tempo, i problemi significativi non ri- denti con interfacce chiare e componibili. Si diffusero concetti quali la programmazione 1 ei gi Processo Bisogno Linguaggio 1950 1960 1970 1980 1990 2000 Primi tentativi di “ordine” nello sviluppo Comprensibilità e portabilità del codice, per sostenere la sua evoluzione Organizzazione “industriale” dello sviluppo dei sistemi software Impossibilità di definire in modo preciso il sistema da sviluppare Sviluppo e distribuzione molto rapidi e orientati ai sistemi di comunicazione Waterfall, a “V”, ... Incrementale, Spirale, ... Metodologie agili Linguaggi assemblativi Linguaggi di alto livello Linguaggi strutturati Linguaggi orientati agli oggetti Linguaggi per lo sviluppo dinamico
  14. 14. RDBMS Rende lo Sviluppo Difficile Relational Database Object Relational Mapping Application Code XML Config DB Schema
  15. 15. E Ancora Più Difficile Evolverlo… New Table New Table New Column Name Pet Phone Email New Column 3 months later…
  16. 16. RDBMS Dalla Complessità alla Semplicità.. MongoDB { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  17. 17. Che cos’è un Record?
  18. 18. Chiave → Valore •  Storage mono-dimensionale •  Il singolo valore e’un blob •  Le query sono solo per chiave •  Nessuno schema •  I valore non può essere aggiornato ma solamente sovrascritto Key Blob
  19. 19. Relazionale •  Storage bi-dimensionale (tuple) •  Ogni campo contiene solo un valore •  Query sono su ogni campo •  Schema molto strutturato (tabelle) •  Update sul posto •  Il processo di normalizzazione richiede molte tabelle, indici e con una pessima localizzazione dei dati. Primary Key
  20. 20. Documento •  Storage N-dimensionale •  Ogni campo può contenere 0,1, tanti o valori incapsulati •  Query su tutti i campi e livelli •  Schema dinamico •  Update in linea •  Incapsulare i dati migliora la localizzazione dei dati, richiede meno indici e ha migliori performance _id
  21. 21. For over a decade Big Data == Custom Software
  22. 22. In the past few years Open source software has emerged enabling the rest of us to handle Big Data
  23. 23. How MongoDB Meets Our Requirements •  MongoDB is an operational database •  MongoDB provides high performance for storage and retrieval at large scale •  MongoDB has a robust query interface permitting intelligent operations •  MongoDB is not a data processing engine,but provides processing functionality
  24. 24. http://www.flickr.com/photos/torek/4444673930/ MongoDB data processing options
  25. 25. Getting Example Data
  26. 26. The“hello world”of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting…
  27. 27. What is the most popular pub name?
  28. 28. #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs) Open Street Map Data
  29. 29. { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } } Example Pub Data
  30. 30. MongoDB MapReduce•  MongoDB map reduce finalize
  31. 31. Map Function > var map = function() { emit(this.name, 1); MongoDB map reduce finalize
  32. 32. Reduce Function > var reduce = function (key, values) { var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } MongoDB map reduce finalize
  33. 33. Results > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { } } ) > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }
  34. 34. > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, } Pub Names in the Center of London
  35. 35. > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } Results
  36. 36. MongoDB MapReduce •  Real-time •  Output directly to document or collection •  Runs inside MongoDB on local data − Adds load to your DB − In Javascript–debugging can be a challenge − Translating in and out of C++
  37. 37. Aggregation Framework
  38. 38. Aggregation Framework•  MongoDB op1 op2 opN
  39. 39. Aggregation Framework in 60 Seconds
  40. 40. Aggregation Framework Operators •  $project •  $match •  $limit •  $skip •  $sort •  $unwind •  $group
  41. 41. $match •  Filter documents •  Uses existing query syntax •  If using $geoNear it has to be first in pipeline •  $where is not supported
  42. 42. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } Matching Field Values { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} }
  43. 43. $project •  Reshape documents •  Include,exclude or rename fields •  Inject computed fields •  Create sub-document fields
  44. 44. Including and Excluding Fields { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } { “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }} { “amenity” : “pub”, “name” : “The Red Lion” }
  45. 45. Reformatting Documents { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } { “$project”: { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} { “name” : “The Red Lion” “meta” : { “type” : “pub” }}
  46. 46. $group •  Group documents by an ID •  Field reference,object,constant •  Other output fields are computed $max,$min,$avg,$sum $addToSet,$push $first,$last •  Processes all data in memory
  47. 47. Back to the pub! •  http://www.offwestend.com/index.php/theatres/pastshows/71
  48. 48. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 }
  49. 49. > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 } Results
  50. 50. Aggregation Framework Benefits •  Real-time •  Simple yet powerful interface •  Declared in JSON,executes in C++ •  Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited
  51. 51. Analyzing MongoDB Data in External Systems
  52. 52. MongoDB with Hadoop•  MongoDB
  53. 53. MongoDB with Hadoop•  MongoDB warehouse
  54. 54. MongoDB with Hadoop •  MongoDBETL
  55. 55. #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Map Pub Names in Python
  56. 56. #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer) Reduce Pub Names in Python
  57. 57. hadoop jar target/mongo-hadoop-streaming- assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names Execute MapReduce
  58. 58. > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } { "_id" : "The George", "value" : 4 } { "_id" : "The Green Man", "value" : 4 } Popular Pub Names Nearby
  59. 59. MongoDB and Hadoop •  Away from data store •  Can leverage existing data processing infrastructure •  Can horizontally scale your data processing -  Offline batch processing -  Requires synchronisation between store & processor -  Infrastructure is much more complex
  60. 60. The Future of Big Data and MongoDB
  61. 61. What is Big Data? Big Data today will be normal tomorrow
  62. 62. Exponential Data Growth 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 2000 2002 2004 2006 2008 2010 2012 Billions of URLs indexed by Google
  63. 63. MongoDB enables you to scale big
  64. 64. MongoDB is evolving so you can process the big
  65. 65. Data Processing with MongoDB •  Process in MongoDB using Map/Reduce •  Process in MongoDB using Aggregation Framework •  Process outside MongoDB using Hadoop and other external tools
  66. 66. MongoDB Integration •  Hadoop https://github.com/mongodb/mongo-hadoop •  Storm https://github.com/christkv/mongo-storm •  Disco https://github.com/mongodb/mongo-disco •  Spark Coming soon!
  67. 67. Questions?
  68. 68. Thanks! massimo@mongodb.com Massimo Brignoli @massimobrignoli
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×