How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's Aggregation Framework and Map Reduce

9,469 views
9,065 views

Published on

Presented on 28th June 2013 in Dublin's Jaspersoft Big Data event

http://www.jaspersoft.com/event/big-data-analysis-made-easy

Published in: Technology
0 Comments
26 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
9,469
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
26
Embeds 0
No embeds

No notes for slide
  • i'm going to talk about How to leveragei hope you end with learning about mongodb
  • Start by saying "I want to start asking a question what is it?”
  • Google, Amazon, and Facebook built custom tools to handle massive amounts of data.MongoDB led an opensource movement to provide a viable alternative to proprietary solutions to handle big dataIt’s just Data! Don’t panic
  • We will be demonstrating how to do each of these today and discuss why and when you would use each.
  • Chart, Trends, Insights
  • Traackr social mediaIntuit small business, personal finance and tax software
  • Not only data size but also the rate the data comes inFor example twitterWhat is the tolerable delay? How complex is the processing of the data?
  • I often think of Map reduce as the Marmite of MongoDB - people either love it or hate it.For that very reason we've produced the aggregation framework in 2.2 and its only getting better in 2.4!
  • $project, $match, $unwind, $group - $limit, $skip, $sortNojavascript code$outMore operators coming soon
  • The original aggregation utility in mongodb.Simplified view -> from mongodbc++ to the js runtime1) You create a map function2) MAP returns results mongoDB then groups and sorts the results3) Then passes the values to reduce4) Finialise is optionalBack to the c++ runtime
  • Summarise by hour and save that in a collection.
  • Map and reduce need to return the same object. Because it can the reduce can be run again.
  • V8 in 2.4 & muiltithreaded
  • JobsHigher latency
  • The mongodbhadoop adapter allows you to stream data into hadoop and outSo you can scale data processing across many machines for batch processing.
  • Another common usecase we see is warehousing of data - again the connector allows you to utilise existing libraries via hadoop
  • The third most common usecase is an ETL - extract transform load - function.Then putting the aggregated data into mongodb for further analysis.
  • Google, Amazon, and Facebook built custom tools to handle massive amounts of data.MongoDB led an opensource movement to provide a viable alternative to proprietary solutions to handle big data
  • Horizontally scale out and providing sharding tools out the box
  • Horizontally scale out and providing sharding tools out the box
  • Our next challenge is helping you make sense of your data
  • Map / Reduce - allows complex programable aggregationsAggregation Framework - easy and simple access to aggregationHadoop - the start of our integration with external toolsStorm Distributed and fault-tolerant realtime computation system. - used by Twitter, Groupon, etcmore flexible, incremental processingDisco is an open-source implementation of the Map-Reduce framework for distributed computing. - developed by Nokia Research Center
  • Meetupeducation.10gen.com
  • How to leverage MongoDB for Big Data Analysis and Operations with MongoDB's Aggregation Framework and Map Reduce

    1. 1. Technical Support Engineer, 10gen Gianfranco Palumbo #bigdatajaspersoft How to leverage MongoDB for Big Data Analysis and Operations @MongoDBDublin
    2. 2. Join us this evening at Dublin MUG meetup.com/DublinMUG/
    3. 3. Big Data
    4. 4. http://www.worldwidewebsize.com/ Exponential Data Growth
    5. 5. MongoDB solves our needs • Ideal operational database • Provides high performance for storage and retrieval at large scale • Has a robust query interface permitting intelligent operations • Is not a data processing engine, but provides processing functionality
    6. 6. Data Processing in MongoDB • Process in MongoDB using Map/Reduce • Process in MongoDB using Aggregation Framework • Process outside MongoDB using Hadoop and other external tools
    7. 7. The goal Real Time Analytics Engine Data SourceData SourceData Source
    8. 8. Sample Customers
    9. 9. Solution goals • Lots of data sources • Lots of data from each source High write volume • Users can drill down into dataDynamic queries • Lots of clients • High request rate Fast queries • How long before an event appears in a report? Minimize delay between collection & query
    10. 10. System architecture
    11. 11. Systems Architecture Data Sources Asynchronous writes Upserts avoid unnecessary reads Writes buffered in RAM and flushed to disk in bulk Data Sources Data Sources Data Sources Spread writes over multiple shards
    12. 12. Simple log storage Design Pattern
    13. 13. Sample data Original Event Data 127.0.0.1 - frank [10/Jun/2013:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 “http://www.example.com/start.html" Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7_4; en-US)” As JSON doc = { _id: ObjectId('4f442120eb03305789000000'), host: "127.0.0.1", time: ISODate("2013-06-10T20:55:36Z"), path: “/apache_pb.gif", referer: “http://www.example.com/start.html", user_agent: "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7_4; en-US)” } Insert to MongoDB db.logs.insert( doc )
    14. 14. Dynamic Queries Find all logs for a URL db.logs.find( { ‘path’ : ‘/index.html’ } ) Find all logs for a time range db.logs.find( { ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, s1) } } ) Find all logs for a host over a range of dates db.logs.find( { ‘host’ : ‘127.0.0.1’, ‘time’ : { ‘$gte’: new Date(2013, 0), ‘$lt’: new Date(2013, 1) } } )
    15. 15. Aggregation Framework
    16. 16. MongoDB Aggregation Framework
    17. 17. Aggregation Framework Requests per day by URL db.logs.aggregate( [ { '$match': { 'time': { '$gte': new Date(2013, 0), '$lt': new Date(2013, 1) } } }, { '$project': { 'path': 1, 'date': { 'y': { '$year': '$time' }, 'm': { '$month': '$time' }, 'd': { '$dayOfMonth': '$time' } } } }, { '$group': { '_id': { 'p': '$path', 'y': '$date.y', 'm': '$date.m', 'd': '$date.d' }, 'hits': { '$sum': 1 } } }, ])
    18. 18. Aggregation Framework { ‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 1 }, 'hits’: 124 }, { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 2 }, 'hits’: 245 }, { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 3 }, 'hits’: 322 }, { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 4 }, 'hits’: 175 }, { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 5 }, 'hits’: 94 } ] }
    19. 19. Aggregation Framework Benefits • Real-time • Simple yet powerful interface • Declared in JSON, executes in C++ • Runs inside MongoDB on local data
 • Adds load to your DB • Limited how much data it can return
    20. 20. Roll-ups with map- reduce Design Pattern
    21. 21. MongoDB Map/Reduce
    22. 22. Map Reduce – Map Phase Generate hourly rollups from log data var map = function() { var key = { p: this.path, d: new Date( this.ts.getFullYear(), this.ts.getMonth(), this.ts.getDate(), this.ts.getHours(), 0, 0, 0) }; emit( key, { hits: 1 } ); }
    23. 23. Map Reduce – Reduce Phase Generate hourly rollups from log data var reduce = function(key, values) { var r = { hits: 0 }; values.forEach(function(v) { r.hits += v.hits; }); return r; } )
    24. 24. MongoDB Map/Reduce • Real-time • Output directly to document or collection • Runs inside MongoDB on local data • V8 engine • Adds load to your DB • In JavaScript
    25. 25. Integrations
    26. 26. REPORTING Charting
    27. 27. APACHE HADOOP Log Aggregation with MongoDB as sink More complex aggregations or integration with tools like Mahout
    28. 28. MongoDB MongoDB with Hadoop
    29. 29. MongoDB with Hadoop
    30. 30. MongoDB with Hadoop
    31. 31. MongoDB and Hadoop • Away from data store • Can leverage existing data processing infrastructure • Can horizontally scale your data processing • Offline batch processing • Requires synchronization between store & processor • Infrastructure is much more complex
    32. 32. The Future of Big Data and MongoDB
    33. 33. What is Big? Big today is normal tomorrow
    34. 34. http://www.worldwidewebsize.com/ Big is only getting bigger
    35. 35. IBM - http://www-01.ibm.com/software/data/bigdata/ 90% of the data in the world today has been created in the last two years
    36. 36. MongoDB enables you to scale to the redefinition of BIG
    37. 37. MongoDB is evolving to enable you to process the new BIG
    38. 38. Gianfranco Palumbo – slides tweeted from @MongoDBDublin MongoDB is committed to working with the best data processing tools • Map Reduce • Aggregation Framework • Hadoop adapter – docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/ • Storm – github.com/christkv/mongo-storm • Disco – github.com/mongodb/mongo-disco • Spark (coming soon)
    39. 39. Technical Support Engineer, 10gen Gianfranco Palumbo #bigdatajaspersoft Thank you @MongoDBDublin

    ×