Past, Present and Future of Data Processing in Apache Hadoop
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Past, Present and Future of Data Processing in Apache Hadoop

on

  • 788 views

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we’ll take a look at 3 different ways of aggregating ...

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we’ll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.

Statistics

Views

Total Views
788
Views on SlideShare
725
Embed Views
63

Actions

Likes
0
Downloads
12
Comments
0

2 Embeds 63

http://milano.codemotionworld.com 59
http://milan2013.codemotionworld.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • IBM designed IMS with Rockwell and Caterpillar starting in 1966 for the Apollo program. IMS's challenge was to inventory the very large bill of materials (BOM) for the Saturn V moon rocket and Apollo space vehicle.
  • This is helpful because as much as 95% of enterprise information is unstructured, and doesn’t fit neatly into tidy rows and columns. NoSQL and Hadoop allow for dynamic schema.
  • The industry is talking about Hadoop and MongoDB for Big Data. So should you
  • This is where MongoDB fits into the existing enterprise IT stackMongoDB is an operational data store used for online data, in the same way that Oracle is an operational data store. It supports applications that ingest, store, manage and even analyze data in real-time. (Compared to Hadoop and data warehouses, which are used for offline, batch analytical workloads.)
  • Another common use case we see is warehousing of data -* again the connector allows you to utilize existing libraries via hadoopUS
  • The third most common usecase is an ETL - extract transform load - function.Then putting the aggregated data into mongodb for further analysis.

Past, Present and Future of Data Processing in Apache Hadoop Presentation Transcript

  • 1. Codemotion Milano 2013 Data Processing and Aggregation Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 2. Who Am I? • Solutions Architect/Evangelist in MongoDB Inc. • 20 years of experience in databases • Former MySQL employee • Previous life: web, web, web Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 3. Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 4. What is Big Data? • Big Data is like teenage sex: • everyone talks about it • nobody really knows how to do it • everyone thinks everyone else is doing it • so everyone claims they are doing it… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 5. Understanding Big Data – It’s Not Very “Big” 64% - Ingest diverse, new data in real-time 15% - More than 100TB of data 20% - Less than 100TB (average of all? <20TB) from Big Data Executive Summary – 50+ top executives from Government and F500 firms
  • 6. For over a decade Big Data == Custom Software Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 7. Lots of Great Innovations Since 1970
  • 8. Including the Relational Database
  • 9. RDBMS Makes Development Hard Code XML Config DB Schema Application Object Relational Mapping Relational Database
  • 10. And Even Harder To Iterate New Table New Column New Table Name Pet Phone New Column 3 months later… Email
  • 11. From Complexity to Simplicity MongoDB RDBMS { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  • 12. In the past few years Open source software has emerged enabling the rest of us to handle Big Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 13. Use Popular, Well-Known Technologies Source: Silicon Angle, 2012
  • 14. Enterprise Big Data Stack CRM, ERP, Collaboration, Mobile, BI Data Management Online Data RDBMS RDBMS Offline Data Hadoop Infrastructure OS & Virtualization, Compute, Storage, Network EDW Security & Auditing Management & Monitoring Applications
  • 15. Consideration – Online vs. Offline Online • Real-time • Low-latency • High availability vs. Offline • Long-running • High-Latency • Availability is lower priority
  • 16. How MongoDB Meets Our Requirements • MongoDB is an operational database • MongoDB provides high performance for storage and retrieval at large scale • MongoDB has a robust query interface permitting intelligent operations • MongoDB is not a data processing engine, but provides processing functionality Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 17. MongoDB data processing options http://www.flickr.com/photos/torek/4444673930/ http://createivecommons.org/licenses/by-nc-sa/3.0/ Except where otherwise noted, this work is licensed under
  • 18. Getting Example Data Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 19. The “hello world” of MapReduce is counting words in a paragraph of text. Let’s try something a little more interesting… Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 20. What is the most popular pub name? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 21. Open Street Map Data #!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 22. Example Pub Data { "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 23. MapReduce Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 24. MongoDB MapReduce • map MongoDB reduce finalize Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 25. map Map Function MongoDB reduce > var map = function() { finalize emit(this.name, 1); Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 26. map Reduce Function MongoDB reduce > var reduce = function (key, values) { finalize var sum = 0; values.forEach( function (val) {sum += val;} ); return sum; } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 27. Results > db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 28. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 29. Pub Names in the Center of London > db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 30. Results > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Green Man", "value" : 5 } "The Kings Arms", "value" : 5 } "The Red Lion", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 31. MongoDB MapReduce • Real-time • Output directly to document or collection • Runs inside MongoDB on local data − Adds load to your DB − In Javascript – debugging can be a challenge − Translating in and out of C++ Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 32. Aggregation Framework Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 33. Aggregation Framework • op1 MongoDB op2 opN Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 34. Aggregation Framework in 60 Seconds Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 35. Aggregation Framework Operators • $project • $match • $limit • $skip • $sort • $unwind • $group Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 36. $match • Filter documents • Uses existing query syntax • If using $geoNear it has to be first in pipeline • $where is not supported Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 37. Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "$match": { "name": "The Red Lion" }} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 38. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 39. Including and Excluding Fields { “$project”: { { "_id" : 271466, "name" : "The Red Lion", “_id”: 0, “amenity”: 1, “name”: 1, "location" : { }} "amenity" : "pub", "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } { “amenity” : “pub”, “name” : “The Red Lion” } } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 40. Reformatting Documents { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] } } { “name” : “The Red Lion” “meta” : { “type” : “pub” }} Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 41. Dealing with Arrays { “$project”: { { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "facilities" : [ "toilets", “_id”: 0, “name”: 1, “meta”: { “type”: “$amenity”} }} {"$unwind": "$facility"} "food" ], } { "name" : "The Red Lion", "facility" : "toilets" }, { "name" : "The Red Lion", "facility" : "food" } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 42. $group • Group documents by an ID • Field reference, object, constant • Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last • Processes all data in memory Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 43. Back to the pub! • http://www.offwestend.com/index.php/theatres/pastshows/71 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 44. Popular Pub Names >var popular_pub_names = [ { $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}} }, { $group : { _id: “$name” value: {$sum: 1} } }, { $sort : {value: -1} }, { $limit : 10 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 45. Results > db.pubs.aggregate(popular_pub_names) { "result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } ], "ok" : 1 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 46. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Declared in JSON, executes in C++ • Runs inside MongoDB on local data − Adds load to your DB − Limited Operators − Data output is limited Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 47. Analyzing MongoDB Data in External Systems Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 48. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 49. MongoDB with Hadoop • MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ warehouse
  • 50. MongoDB with Hadoop • ETL Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ MongoDB
  • 51. Map Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping." Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 52. Reduce Pub Names in Python #!/usr/bin/env python from pymongo_hadoop import BSONReducer def reducer(key, values): _count = 0 for v in values: _count += v['count'] return {'_id': key, 'value': _count} BSONReducer(reducer) Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 53. Execute MapReduce hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar -mapper examples/pub/map.py -reducer examples/pub/reduce.py -mongo mongodb://127.0.0.1/demo.pubs -outputURI mongodb://127.0.0.1/demo.pub_names Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 54. Popular Pub Names Nearby > db.pub_names.find().sort({value: -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "All Bar One", "value" : 11 } "The Slug & Lettuce", "value" : 7 } "The Coach & Horses", "value" : 6 } "The Kings Arms", "value" : 5 } "Corney & Barrow", "value" : 4 } "O'Neills", "value" : 4 } "Pitcher & Piano", "value" : 4 } "The Crown", "value" : 4 } "The George", "value" : 4 } "The Green Man", "value" : 4 } Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 55. MongoDB and Hadoop • Away from data store • Can leverage existing data processing infrastructure • Can horizontally scale your data processing - Offline batch processing - Requires synchronisation between store & processor - Infrastructure is much more complex Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 56. The Future of Big Data and MongoDB Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 57. What is Big Data? Big Data today will be normal tomorrow Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 58. Exponential Data Growth Billions of URLs indexed by Google 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 2000 2002 2004 2006 2008 Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/ 2010 2012
  • 59. MongoDB enables you to scale big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 60. MongoDB is evolving so you can process the big Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 61. Data Processing with MongoDB • Process in MongoDB using Map/Reduce • Process in MongoDB using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 62. Questions? Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/
  • 63. Codemotion Milano Thanks! Massimo Brignoli Solutions Architect, MongoDB Inc. Except where otherwise noted, this work is licensed under http://createivecommons.org/licenses/by-nc-sa/3.0/