Aggregation Framework

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,613
On Slideshare
1,667
From Embeds
946
Number of Embeds
6

Actions

Shares
Downloads
51
Comments
0
Likes
7

Embeds 946

http://www.10gen.com 799
http://www.mongodb.com 117
http://www.twylah.com 19
https://www.mongodb.com 6
http://drupal1.10gen.cc 4
https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. #mongodbdays Aggregation Framework Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13
  • 2. Agenda • State of Aggregation • Pipeline • Usage and Limitations • Optimization • Sharding • (Expressions) • Looking AheadTuesday, January 29, 13
  • 3. State of AggregationTuesday, January 29, 13
  • 4. State of Aggregation • Were storing our data in MongoDB • We need to do ad-hoc reporting, grouping, common aggregations, etc. • What are we using for this?Tuesday, January 29, 13
  • 5. Data WarehousingTuesday, January 29, 13
  • 6. Data Warehousing • SQL for reporting and analytics • Infrastructure complications – Additional maintenance – Data duplication – ETL processes – Real time?Tuesday, January 29, 13
  • 7. MapReduceTuesday, January 29, 13
  • 8. MapReduce • Extremely versatile, powerful • Intended for complex data analysis • Overkill for simple aggregation tasks, such as – Averages – Summation – GroupingTuesday, January 29, 13
  • 9. MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locksTuesday, January 29, 13
  • 10. Aggregation FrameworkTuesday, January 29, 13
  • 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple – Operation pipeline – Computational expressions • Works well with shardingTuesday, January 29, 13
  • 12. Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shellTuesday, January 29, 13
  • 13. PipelineTuesday, January 29, 13
  • 14. Pipeline • Process a stream of documents – Original input is a collection – Final output is a result document • Series of operators – Filter or transform data – Input/output chain ps ax | grep mongod | head -n 1Tuesday, January 29, 13
  • 15. Pipeline Operators • $match • $sort • $project • $limit • $group • $skip • $unwindTuesday, January 29, 13
  • 16. Example book data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" }Tuesday, January 29, 13
  • 17. $match • Filter documents • Uses existing query syntax • (No geospatial operations or $where)Tuesday, January 29, 13
  • 18. Matching Field Values { { $match: { title: "The Great Gatsby", language: "Russian" pages: 218, }} language: "English" } { title: "War and Peace", { pages: 1440, title: "War and Peace", language: "Russian" pages: 1440, } language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" }Tuesday, January 29, 13
  • 19. Matching with Query Operators { { $match: { title: "The Great Gatsby", pages: { $gt: 1000 } pages: 218, }} language: "English" } { { title: "War and Peace", title: "War and Peace", pages: 1440, pages: 1440, language: "Russian" language: "Russian" } } { { title: "Atlas Shrugged", title: "Atlas Shrugged", pages: 1088, pages: 1088, language: "English" language: "English" } }Tuesday, January 29, 13
  • 20. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fieldsTuesday, January 29, 13
  • 21. Including and Excluding Fields { { $project: { _id: 375, _id: 0, title: "Great Gatsby", title: 1, ISBN: "9781857150193", language: 1 available: true, }} pages: 218, subjects: [ "Long Island", "New York", "1920s" { ], title: " Great Gatsby", language: "English" language: "English" } }Tuesday, January 29, 13
  • 22. Renaming and Computing Fields { { $project: { _id: 375, avgChapterLength: { title: "Great Gatsby", $divide: ["$pages", ISBN: "9781857150193", "$chapters"] available: true, }, pages: 218, lang: "$language" chapters: 9, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" avgChapterLength: 24.2222 , } lang: "English" }Tuesday, January 29, 13
  • 23. Creating Sub-Document Fields { $project: { { title: 1, _id: 375, stats: { title: "Great Gatsby", pages: "$pages", ISBN: "9781857150193", language: "$language", available: true, } pages: 218, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" title: " Great Gatsby", } stats: { pages: 218, language: "English" }Tuesday, January 29, 13
  • 24. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memoryTuesday, January 29, 13
  • 25. Calculating an Average { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, avgPages: { $avg: language: "English" "$pages" } } }} { title: "War and Peace", pages: 1440, { language: "Russian" _id: "Russian", } avgPages: 1440 } { title: "Atlas Shrugged", { pages: 1088, _id: "English", language: "English" avgPages: 653 } }Tuesday, January 29, 13
  • 26. Summating Fields and Counting { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, numTitles: { $sum: 1 }, language: "English" sumPages: { $sum: "$pages" } }} } { title: "War and Peace", { pages: 1440, _id: "Russian", language: "Russian” numTitles: 1, } sumPages: 1440 } { { title: "Atlas Shrugged", _id: "English", pages: 1088, numTitles: 2, language: "English" sumPages: 1306 } }Tuesday, January 29, 13
  • 27. Collecting Distinct Values { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, titles: { $addToSet: "$title" } language: "English" }} } { { title: "War and Peace", _id: "Russian", titles: [ "War and Peace" ] pages: 1440, } language: "Russian" } { _id: "English", { titles: [ title: "Atlas Shrugged", "Atlas Shrugged", pages: 1088, "The Great Gatsby" language: "English" ] } }Tuesday, January 29, 13
  • 28. $unwind • Applied to an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array valuesTuesday, January 29, 13
  • 29. Yielding Multiple Documents from One { { $unwind: "$subjects" } title: "The Great Gatsby", ISBN: "9781857150193", { subjects: [ title: "The Great Gatsby", "Long Island", ISBN: "9781857150193", "New York", subjects: "Long Island" "1920s" } ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" }Tuesday, January 29, 13
  • 30. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behaviorTuesday, January 29, 13
  • 31. Sort All the Documents in the Pipeline { title: "The Great Gatsby" } { $sort: { title: 1 }} { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } { title: "Fathers and Sons" } { title: "Fathers and Sons" } { title: "Invisible Man" } { title: "Grapes of Wrath" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "The Great Gatsby" }Tuesday, January 29, 13
  • 32. Limit Documents Through the Pipeline { title: "The Great Gatsby" } { $limit: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "The Great Gatsby" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Grapes of Wrath" } { title: "Fathers and Sons" } { title: "Animal Farm" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  • 33. Skip Over Documents in the Pipeline { title: "The Great Gatsby" } { $skip: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Fathers and Sons" } { title: "Lord of the Flies" } { title: "Invisible Man" } { title: "Fathers and Sons" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  • 34. Usage and LimitationsTuesday, January 29, 13
  • 35. Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database commandTuesday, January 29, 13
  • 36. Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  • 37. Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  • 38. Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated typesTuesday, January 29, 13
  • 39. ShardingTuesday, January 29, 13
  • 40. Sharding • Split the pipeline at first $group or $sort – Shards execute pipeline up to that point – mongos merges results and continues • Early $match may excuse shards • CPU and memory implications for mongosTuesday, January 29, 13
  • 41. Sharding [ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }} ]Tuesday, January 29, 13
  • 42. Aggregation in a sharded clusterTuesday, January 29, 13
  • 43. ExpressionsTuesday, January 29, 13
  • 44. Expressions • Return computed values • Used with $project and $group • Reference fields using $ (e.g. "$x") • Expressions may be nestedTuesday, January 29, 13
  • 45. Boolean Operators • Input array of one or more values – $and, $or – Short-circuit logic • Invert values with $not • Evaluation of non-boolean types – null, undefined, zero ▶ false – Non-zero, strings, dates, objects ▶ true { $and: [true, false] } ▶ false { $or: ["foo", 0] } ▶ true { $not: null } ▶ trueTuesday, January 29, 13
  • 46. Comparison Operators • Compare numbers, strings, and dates • Input array with two operands – $cmp, $eq, $ne – $gt, $gte, $lt, $lte { $cmp: [3, 4] } ▶ -1 { $eq: ["foo", "bar"] } ▶ false { $ne: ["foo", "bar"] } ▶ true { $gt: [9, 7] } ▶ trueTuesday, January 29, 13
  • 47. Arithmetic Operators • Input array of one or more numbers – $add, $multiply • Input array of two operands – $subtract, $divide, $mod { $add: [1, 2, 3] } ▶ 6 { $multiply: [2, 2, 2] } ▶ 8 { $subtract: [10, 7] } ▶ 3 { $divide: [10, 2] } ▶ 5 { $mod: [8, 3] } ▶ 2Tuesday, January 29, 13
  • 48. String Operators • $strcasecmp case-insensitive comparison – $cmp is case-sensitive • $toLower and $toUpper case change • $substr for sub-string extraction • Not encoding aware (assumes ASCII alphabet) { $strcasecmp: ["foo", "bar"] } ▶ 1 { $substr: ["foo", 1, 2] } ▶ "oo" { $toUpper: "foo" } ▶ "FOO" { $toLower: "BAR" } ▶ "bar"Tuesday, January 29, 13
  • 49. Date Operators • Extract values from date objects – $dayOfYear, $dayOfMonth, $dayOfWeek – $year, $month, $week – $hour, $minute, $second { $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012 { $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10 { $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24 { $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4 { $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299 { $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43Tuesday, January 29, 13
  • 50. Conditional Operators • $cond ternary operator • $ifNull { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different” { $ifNull: ["foo", "bar"] } ▶ "foo" { $ifNull: [null, "bar"] } ▶ "bar"Tuesday, January 29, 13
  • 51. Looking AheadTuesday, January 29, 13
  • 52. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing time series dataTuesday, January 29, 13
  • 53. Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253Tuesday, January 29, 13
  • 54. Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited outputTuesday, January 29, 13
  • 55. #mongodbdays Thank You Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13