Your SlideShare is downloading. ×

Aggregation Framework

2,110
views

Published on


0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,110
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
54
Comments
0
Likes
8
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. #mongodbdays Aggregation Framework Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13
  • 2. Agenda • State of Aggregation • Pipeline • Usage and Limitations • Optimization • Sharding • (Expressions) • Looking AheadTuesday, January 29, 13
  • 3. State of AggregationTuesday, January 29, 13
  • 4. State of Aggregation • Were storing our data in MongoDB • We need to do ad-hoc reporting, grouping, common aggregations, etc. • What are we using for this?Tuesday, January 29, 13
  • 5. Data WarehousingTuesday, January 29, 13
  • 6. Data Warehousing • SQL for reporting and analytics • Infrastructure complications – Additional maintenance – Data duplication – ETL processes – Real time?Tuesday, January 29, 13
  • 7. MapReduceTuesday, January 29, 13
  • 8. MapReduce • Extremely versatile, powerful • Intended for complex data analysis • Overkill for simple aggregation tasks, such as – Averages – Summation – GroupingTuesday, January 29, 13
  • 9. MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locksTuesday, January 29, 13
  • 10. Aggregation FrameworkTuesday, January 29, 13
  • 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple – Operation pipeline – Computational expressions • Works well with shardingTuesday, January 29, 13
  • 12. Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shellTuesday, January 29, 13
  • 13. PipelineTuesday, January 29, 13
  • 14. Pipeline • Process a stream of documents – Original input is a collection – Final output is a result document • Series of operators – Filter or transform data – Input/output chain ps ax | grep mongod | head -n 1Tuesday, January 29, 13
  • 15. Pipeline Operators • $match • $sort • $project • $limit • $group • $skip • $unwindTuesday, January 29, 13
  • 16. Example book data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" }Tuesday, January 29, 13
  • 17. $match • Filter documents • Uses existing query syntax • (No geospatial operations or $where)Tuesday, January 29, 13
  • 18. Matching Field Values { { $match: { title: "The Great Gatsby", language: "Russian" pages: 218, }} language: "English" } { title: "War and Peace", { pages: 1440, title: "War and Peace", language: "Russian" pages: 1440, } language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" }Tuesday, January 29, 13
  • 19. Matching with Query Operators { { $match: { title: "The Great Gatsby", pages: { $gt: 1000 } pages: 218, }} language: "English" } { { title: "War and Peace", title: "War and Peace", pages: 1440, pages: 1440, language: "Russian" language: "Russian" } } { { title: "Atlas Shrugged", title: "Atlas Shrugged", pages: 1088, pages: 1088, language: "English" language: "English" } }Tuesday, January 29, 13
  • 20. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fieldsTuesday, January 29, 13
  • 21. Including and Excluding Fields { { $project: { _id: 375, _id: 0, title: "Great Gatsby", title: 1, ISBN: "9781857150193", language: 1 available: true, }} pages: 218, subjects: [ "Long Island", "New York", "1920s" { ], title: " Great Gatsby", language: "English" language: "English" } }Tuesday, January 29, 13
  • 22. Renaming and Computing Fields { { $project: { _id: 375, avgChapterLength: { title: "Great Gatsby", $divide: ["$pages", ISBN: "9781857150193", "$chapters"] available: true, }, pages: 218, lang: "$language" chapters: 9, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" avgChapterLength: 24.2222 , } lang: "English" }Tuesday, January 29, 13
  • 23. Creating Sub-Document Fields { $project: { { title: 1, _id: 375, stats: { title: "Great Gatsby", pages: "$pages", ISBN: "9781857150193", language: "$language", available: true, } pages: 218, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" title: " Great Gatsby", } stats: { pages: 218, language: "English" }Tuesday, January 29, 13
  • 24. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memoryTuesday, January 29, 13
  • 25. Calculating an Average { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, avgPages: { $avg: language: "English" "$pages" } } }} { title: "War and Peace", pages: 1440, { language: "Russian" _id: "Russian", } avgPages: 1440 } { title: "Atlas Shrugged", { pages: 1088, _id: "English", language: "English" avgPages: 653 } }Tuesday, January 29, 13
  • 26. Summating Fields and Counting { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, numTitles: { $sum: 1 }, language: "English" sumPages: { $sum: "$pages" } }} } { title: "War and Peace", { pages: 1440, _id: "Russian", language: "Russian” numTitles: 1, } sumPages: 1440 } { { title: "Atlas Shrugged", _id: "English", pages: 1088, numTitles: 2, language: "English" sumPages: 1306 } }Tuesday, January 29, 13
  • 27. Collecting Distinct Values { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, titles: { $addToSet: "$title" } language: "English" }} } { { title: "War and Peace", _id: "Russian", titles: [ "War and Peace" ] pages: 1440, } language: "Russian" } { _id: "English", { titles: [ title: "Atlas Shrugged", "Atlas Shrugged", pages: 1088, "The Great Gatsby" language: "English" ] } }Tuesday, January 29, 13
  • 28. $unwind • Applied to an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array valuesTuesday, January 29, 13
  • 29. Yielding Multiple Documents from One { { $unwind: "$subjects" } title: "The Great Gatsby", ISBN: "9781857150193", { subjects: [ title: "The Great Gatsby", "Long Island", ISBN: "9781857150193", "New York", subjects: "Long Island" "1920s" } ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" }Tuesday, January 29, 13
  • 30. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behaviorTuesday, January 29, 13
  • 31. Sort All the Documents in the Pipeline { title: "The Great Gatsby" } { $sort: { title: 1 }} { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } { title: "Fathers and Sons" } { title: "Fathers and Sons" } { title: "Invisible Man" } { title: "Grapes of Wrath" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "The Great Gatsby" }Tuesday, January 29, 13
  • 32. Limit Documents Through the Pipeline { title: "The Great Gatsby" } { $limit: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "The Great Gatsby" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Grapes of Wrath" } { title: "Fathers and Sons" } { title: "Animal Farm" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  • 33. Skip Over Documents in the Pipeline { title: "The Great Gatsby" } { $skip: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Fathers and Sons" } { title: "Lord of the Flies" } { title: "Invisible Man" } { title: "Fathers and Sons" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  • 34. Usage and LimitationsTuesday, January 29, 13
  • 35. Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database commandTuesday, January 29, 13
  • 36. Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  • 37. Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  • 38. Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated typesTuesday, January 29, 13
  • 39. ShardingTuesday, January 29, 13
  • 40. Sharding • Split the pipeline at first $group or $sort – Shards execute pipeline up to that point – mongos merges results and continues • Early $match may excuse shards • CPU and memory implications for mongosTuesday, January 29, 13
  • 41. Sharding [ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }} ]Tuesday, January 29, 13
  • 42. Aggregation in a sharded clusterTuesday, January 29, 13
  • 43. ExpressionsTuesday, January 29, 13
  • 44. Expressions • Return computed values • Used with $project and $group • Reference fields using $ (e.g. "$x") • Expressions may be nestedTuesday, January 29, 13
  • 45. Boolean Operators • Input array of one or more values – $and, $or – Short-circuit logic • Invert values with $not • Evaluation of non-boolean types – null, undefined, zero ▶ false – Non-zero, strings, dates, objects ▶ true { $and: [true, false] } ▶ false { $or: ["foo", 0] } ▶ true { $not: null } ▶ trueTuesday, January 29, 13
  • 46. Comparison Operators • Compare numbers, strings, and dates • Input array with two operands – $cmp, $eq, $ne – $gt, $gte, $lt, $lte { $cmp: [3, 4] } ▶ -1 { $eq: ["foo", "bar"] } ▶ false { $ne: ["foo", "bar"] } ▶ true { $gt: [9, 7] } ▶ trueTuesday, January 29, 13
  • 47. Arithmetic Operators • Input array of one or more numbers – $add, $multiply • Input array of two operands – $subtract, $divide, $mod { $add: [1, 2, 3] } ▶ 6 { $multiply: [2, 2, 2] } ▶ 8 { $subtract: [10, 7] } ▶ 3 { $divide: [10, 2] } ▶ 5 { $mod: [8, 3] } ▶ 2Tuesday, January 29, 13
  • 48. String Operators • $strcasecmp case-insensitive comparison – $cmp is case-sensitive • $toLower and $toUpper case change • $substr for sub-string extraction • Not encoding aware (assumes ASCII alphabet) { $strcasecmp: ["foo", "bar"] } ▶ 1 { $substr: ["foo", 1, 2] } ▶ "oo" { $toUpper: "foo" } ▶ "FOO" { $toLower: "BAR" } ▶ "bar"Tuesday, January 29, 13
  • 49. Date Operators • Extract values from date objects – $dayOfYear, $dayOfMonth, $dayOfWeek – $year, $month, $week – $hour, $minute, $second { $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012 { $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10 { $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24 { $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4 { $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299 { $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43Tuesday, January 29, 13
  • 50. Conditional Operators • $cond ternary operator • $ifNull { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different” { $ifNull: ["foo", "bar"] } ▶ "foo" { $ifNull: [null, "bar"] } ▶ "bar"Tuesday, January 29, 13
  • 51. Looking AheadTuesday, January 29, 13
  • 52. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing time series dataTuesday, January 29, 13
  • 53. Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253Tuesday, January 29, 13
  • 54. Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited outputTuesday, January 29, 13
  • 55. #mongodbdays Thank You Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13