#mongodbdays       Aggregation Framework       Emily Stolfo       Ruby Engineer/Evangelist, 10gen       @EmStolfoTuesday, ...
Agenda       • State of Aggregation       • Pipeline       • Usage and Limitations       • Optimization       • Sharding  ...
State of AggregationTuesday, January 29, 13
State of Aggregation       • Were storing our data in MongoDB       • We need to do ad-hoc reporting, grouping,          c...
Data WarehousingTuesday, January 29, 13
Data Warehousing       • SQL for reporting and analytics       • Infrastructure complications             – Additional mai...
MapReduceTuesday, January 29, 13
MapReduce       • Extremely versatile, powerful       • Intended for complex data analysis       • Overkill for simple agg...
MapReduce in MongoDB       • Implemented with JavaScript             – Single-threaded             – Difficult to debug   ...
Aggregation FrameworkTuesday, January 29, 13
Aggregation Framework       • Declared in JSON, executes in C++       • Flexible, functional, and simple             – Ope...
Enabling Developers       • Doing more within MongoDB, faster       • Refactoring MapReduce and groupings             – Re...
PipelineTuesday, January 29, 13
Pipeline       • Process a stream of documents             – Original input is a collection             – Final output is ...
Pipeline Operators                • $match     • $sort                • $project   • $limit                • $group     • ...
Example book data       {           _id: 375,           title: "The Great Gatsby",           ISBN: "9781857150193",       ...
$match       • Filter documents       • Uses existing query syntax       • (No geospatial operations or $where)Tuesday, Ja...
Matching Field Values       {                                        { $match: {           title: "The Great Gatsby",     ...
Matching with Query Operators       {                                { $match: {           title: "The Great Gatsby",     ...
$project       • Reshape documents       • Include, exclude or rename fields       • Inject computed fields       • Create...
Including and Excluding Fields      {                            { $project: {          _id: 375,                  _id: 0,...
Renaming and Computing Fields       {                            { $project: {           _id: 375,                  avgCha...
Creating Sub-Document Fields                                   { $project: {      {                                     ti...
$group       • Group documents by an ID             – Field reference, object, constant       • Other output fields are co...
Calculating an Average       {                                { $group: {           title: "The Great Gatsby",     _id: "$...
Summating Fields and Counting       {                                { $group: {           title: "The Great Gatsby",     ...
Collecting Distinct Values       {                                { $group: {           title: "The Great Gatsby",     _id...
$unwind       • Applied to an array field       • Yield new documents for each array element             – Array replaced ...
Yielding Multiple Documents from One       {                                { $unwind: "$subjects" }           title: "The...
$sort, $limit, $skip       • Sort documents by one or more fields             – Same order syntax as cursors             –...
Sort All the Documents in the Pipeline       { title: "The Great Gatsby" }    { $sort: { title: 1 }}       { title: "Brave...
Limit Documents Through the Pipeline       { title: "The Great Gatsby" }    { $limit: 5 }       { title: "Brave New World"...
Skip Over Documents in the Pipeline       { title: "The Great Gatsby" }    { $skip: 5 }       { title: "Brave New World" }...
Usage and LimitationsTuesday, January 29, 13
Usage       • collection.aggregate() method             – Mongo shell             – Most drivers       • aggregate databas...
Collection         db.books.aggregate([           { $project: { language: 1 }},           { $group: { _id: "$language", nu...
Database Command         db.runCommand({           aggregate: "books",           pipeline: [             { $project: { lan...
Limitations       • Result limited by BSON document size             – Final command result             – Intermediate sha...
ShardingTuesday, January 29, 13
Sharding       • Split the pipeline at first $group or $sort             – Shards execute pipeline up to that point       ...
Sharding       [           {   $match: { /* filter by shard key */ }},           {   $project: { /* select fields  */ }}, ...
Aggregation in a sharded clusterTuesday, January 29, 13
ExpressionsTuesday, January 29, 13
Expressions       • Return computed values       • Used with $project and $group       • Reference fields using $ (e.g. "$...
Boolean Operators       • Input array of one or more values             – $and, $or             – Short-circuit logic     ...
Comparison Operators       • Compare numbers, strings, and dates       • Input array with two operands             – $cmp,...
Arithmetic Operators       • Input array of one or more numbers             – $add, $multiply       • Input array of two o...
String Operators       • $strcasecmp case-insensitive comparison             – $cmp is case-sensitive       • $toLower and...
Date Operators       • Extract values from date objects             – $dayOfYear, $dayOfMonth, $dayOfWeek             – $y...
Conditional Operators       • $cond ternary operator       • $ifNull                          { $cond: [{ $eq: [1, 2] }, "...
Looking AheadTuesday, January 29, 13
Framework Use Cases       • Basic aggregation queries       • Ad-hoc reporting       • Real-time analytics       • Visuali...
Extending the Framework       • Adding new pipeline operators, expressions       • $out and $tee for output control       ...
Future Enhancements       • Automatically move $match earlier if possible       • Pipeline explain facility       • Memory...
#mongodbdays       Thank You       Emily Stolfo       Ruby Engineer/Evangelist, 10gen       @EmStolfoTuesday, January 29, 13
Upcoming SlideShare
Loading in …5
×

Aggregation Framework

11,301 views

Published on

0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
11,301
On SlideShare
0
From Embeds
0
Number of Embeds
1,195
Actions
Shares
0
Downloads
13,225
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide

Aggregation Framework

  1. 1. #mongodbdays Aggregation Framework Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13
  2. 2. Agenda • State of Aggregation • Pipeline • Usage and Limitations • Optimization • Sharding • (Expressions) • Looking AheadTuesday, January 29, 13
  3. 3. State of AggregationTuesday, January 29, 13
  4. 4. State of Aggregation • Were storing our data in MongoDB • We need to do ad-hoc reporting, grouping, common aggregations, etc. • What are we using for this?Tuesday, January 29, 13
  5. 5. Data WarehousingTuesday, January 29, 13
  6. 6. Data Warehousing • SQL for reporting and analytics • Infrastructure complications – Additional maintenance – Data duplication – ETL processes – Real time?Tuesday, January 29, 13
  7. 7. MapReduceTuesday, January 29, 13
  8. 8. MapReduce • Extremely versatile, powerful • Intended for complex data analysis • Overkill for simple aggregation tasks, such as – Averages – Summation – GroupingTuesday, January 29, 13
  9. 9. MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locksTuesday, January 29, 13
  10. 10. Aggregation FrameworkTuesday, January 29, 13
  11. 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple – Operation pipeline – Computational expressions • Works well with shardingTuesday, January 29, 13
  12. 12. Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shellTuesday, January 29, 13
  13. 13. PipelineTuesday, January 29, 13
  14. 14. Pipeline • Process a stream of documents – Original input is a collection – Final output is a result document • Series of operators – Filter or transform data – Input/output chain ps ax | grep mongod | head -n 1Tuesday, January 29, 13
  15. 15. Pipeline Operators • $match • $sort • $project • $limit • $group • $skip • $unwindTuesday, January 29, 13
  16. 16. Example book data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" }Tuesday, January 29, 13
  17. 17. $match • Filter documents • Uses existing query syntax • (No geospatial operations or $where)Tuesday, January 29, 13
  18. 18. Matching Field Values { { $match: { title: "The Great Gatsby", language: "Russian" pages: 218, }} language: "English" } { title: "War and Peace", { pages: 1440, title: "War and Peace", language: "Russian" pages: 1440, } language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" }Tuesday, January 29, 13
  19. 19. Matching with Query Operators { { $match: { title: "The Great Gatsby", pages: { $gt: 1000 } pages: 218, }} language: "English" } { { title: "War and Peace", title: "War and Peace", pages: 1440, pages: 1440, language: "Russian" language: "Russian" } } { { title: "Atlas Shrugged", title: "Atlas Shrugged", pages: 1088, pages: 1088, language: "English" language: "English" } }Tuesday, January 29, 13
  20. 20. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fieldsTuesday, January 29, 13
  21. 21. Including and Excluding Fields { { $project: { _id: 375, _id: 0, title: "Great Gatsby", title: 1, ISBN: "9781857150193", language: 1 available: true, }} pages: 218, subjects: [ "Long Island", "New York", "1920s" { ], title: " Great Gatsby", language: "English" language: "English" } }Tuesday, January 29, 13
  22. 22. Renaming and Computing Fields { { $project: { _id: 375, avgChapterLength: { title: "Great Gatsby", $divide: ["$pages", ISBN: "9781857150193", "$chapters"] available: true, }, pages: 218, lang: "$language" chapters: 9, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" avgChapterLength: 24.2222 , } lang: "English" }Tuesday, January 29, 13
  23. 23. Creating Sub-Document Fields { $project: { { title: 1, _id: 375, stats: { title: "Great Gatsby", pages: "$pages", ISBN: "9781857150193", language: "$language", available: true, } pages: 218, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" title: " Great Gatsby", } stats: { pages: 218, language: "English" }Tuesday, January 29, 13
  24. 24. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memoryTuesday, January 29, 13
  25. 25. Calculating an Average { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, avgPages: { $avg: language: "English" "$pages" } } }} { title: "War and Peace", pages: 1440, { language: "Russian" _id: "Russian", } avgPages: 1440 } { title: "Atlas Shrugged", { pages: 1088, _id: "English", language: "English" avgPages: 653 } }Tuesday, January 29, 13
  26. 26. Summating Fields and Counting { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, numTitles: { $sum: 1 }, language: "English" sumPages: { $sum: "$pages" } }} } { title: "War and Peace", { pages: 1440, _id: "Russian", language: "Russian” numTitles: 1, } sumPages: 1440 } { { title: "Atlas Shrugged", _id: "English", pages: 1088, numTitles: 2, language: "English" sumPages: 1306 } }Tuesday, January 29, 13
  27. 27. Collecting Distinct Values { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, titles: { $addToSet: "$title" } language: "English" }} } { { title: "War and Peace", _id: "Russian", titles: [ "War and Peace" ] pages: 1440, } language: "Russian" } { _id: "English", { titles: [ title: "Atlas Shrugged", "Atlas Shrugged", pages: 1088, "The Great Gatsby" language: "English" ] } }Tuesday, January 29, 13
  28. 28. $unwind • Applied to an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array valuesTuesday, January 29, 13
  29. 29. Yielding Multiple Documents from One { { $unwind: "$subjects" } title: "The Great Gatsby", ISBN: "9781857150193", { subjects: [ title: "The Great Gatsby", "Long Island", ISBN: "9781857150193", "New York", subjects: "Long Island" "1920s" } ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" }Tuesday, January 29, 13
  30. 30. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behaviorTuesday, January 29, 13
  31. 31. Sort All the Documents in the Pipeline { title: "The Great Gatsby" } { $sort: { title: 1 }} { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } { title: "Fathers and Sons" } { title: "Fathers and Sons" } { title: "Invisible Man" } { title: "Grapes of Wrath" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "The Great Gatsby" }Tuesday, January 29, 13
  32. 32. Limit Documents Through the Pipeline { title: "The Great Gatsby" } { $limit: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "The Great Gatsby" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Grapes of Wrath" } { title: "Fathers and Sons" } { title: "Animal Farm" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  33. 33. Skip Over Documents in the Pipeline { title: "The Great Gatsby" } { $skip: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Fathers and Sons" } { title: "Lord of the Flies" } { title: "Invisible Man" } { title: "Fathers and Sons" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Fahrenheit 451" }Tuesday, January 29, 13
  34. 34. Usage and LimitationsTuesday, January 29, 13
  35. 35. Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database commandTuesday, January 29, 13
  36. 36. Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  37. 37. Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 }Tuesday, January 29, 13
  38. 38. Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated typesTuesday, January 29, 13
  39. 39. ShardingTuesday, January 29, 13
  40. 40. Sharding • Split the pipeline at first $group or $sort – Shards execute pipeline up to that point – mongos merges results and continues • Early $match may excuse shards • CPU and memory implications for mongosTuesday, January 29, 13
  41. 41. Sharding [ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }} ]Tuesday, January 29, 13
  42. 42. Aggregation in a sharded clusterTuesday, January 29, 13
  43. 43. ExpressionsTuesday, January 29, 13
  44. 44. Expressions • Return computed values • Used with $project and $group • Reference fields using $ (e.g. "$x") • Expressions may be nestedTuesday, January 29, 13
  45. 45. Boolean Operators • Input array of one or more values – $and, $or – Short-circuit logic • Invert values with $not • Evaluation of non-boolean types – null, undefined, zero ▶ false – Non-zero, strings, dates, objects ▶ true { $and: [true, false] } ▶ false { $or: ["foo", 0] } ▶ true { $not: null } ▶ trueTuesday, January 29, 13
  46. 46. Comparison Operators • Compare numbers, strings, and dates • Input array with two operands – $cmp, $eq, $ne – $gt, $gte, $lt, $lte { $cmp: [3, 4] } ▶ -1 { $eq: ["foo", "bar"] } ▶ false { $ne: ["foo", "bar"] } ▶ true { $gt: [9, 7] } ▶ trueTuesday, January 29, 13
  47. 47. Arithmetic Operators • Input array of one or more numbers – $add, $multiply • Input array of two operands – $subtract, $divide, $mod { $add: [1, 2, 3] } ▶ 6 { $multiply: [2, 2, 2] } ▶ 8 { $subtract: [10, 7] } ▶ 3 { $divide: [10, 2] } ▶ 5 { $mod: [8, 3] } ▶ 2Tuesday, January 29, 13
  48. 48. String Operators • $strcasecmp case-insensitive comparison – $cmp is case-sensitive • $toLower and $toUpper case change • $substr for sub-string extraction • Not encoding aware (assumes ASCII alphabet) { $strcasecmp: ["foo", "bar"] } ▶ 1 { $substr: ["foo", 1, 2] } ▶ "oo" { $toUpper: "foo" } ▶ "FOO" { $toLower: "BAR" } ▶ "bar"Tuesday, January 29, 13
  49. 49. Date Operators • Extract values from date objects – $dayOfYear, $dayOfMonth, $dayOfWeek – $year, $month, $week – $hour, $minute, $second { $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012 { $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10 { $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24 { $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4 { $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299 { $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43Tuesday, January 29, 13
  50. 50. Conditional Operators • $cond ternary operator • $ifNull { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different” { $ifNull: ["foo", "bar"] } ▶ "foo" { $ifNull: [null, "bar"] } ▶ "bar"Tuesday, January 29, 13
  51. 51. Looking AheadTuesday, January 29, 13
  52. 52. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing time series dataTuesday, January 29, 13
  53. 53. Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253Tuesday, January 29, 13
  54. 54. Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited outputTuesday, January 29, 13
  55. 55. #mongodbdays Thank You Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfoTuesday, January 29, 13

×