4. State of Aggregation
• We're storing our data in MongoDB
• We need to do ad-hoc reporting, grouping,
common aggregations, etc.
• What are we using for this?
Tuesday, January 29, 13
6. Data Warehousing
• SQL for reporting and analytics
• Infrastructure complications
– Additional maintenance
– Data duplication
– ETL processes
– Real time?
Tuesday, January 29, 13
8. MapReduce
• Extremely versatile, powerful
• Intended for complex data analysis
• Overkill for simple aggregation tasks, such as
– Averages
– Summation
– Grouping
Tuesday, January 29, 13
9. MapReduce in MongoDB
• Implemented with JavaScript
– Single-threaded
– Difficult to debug
• Concurrency
– Appearance of parallelism
– Write locks
Tuesday, January 29, 13
11. Aggregation Framework
• Declared in JSON, executes in C++
• Flexible, functional, and simple
– Operation pipeline
– Computational expressions
• Works well with sharding
Tuesday, January 29, 13
12. Enabling Developers
• Doing more within MongoDB, faster
• Refactoring MapReduce and groupings
– Replace pages of JavaScript
– Longer aggregation pipelines
• Quick aggregations from the shell
Tuesday, January 29, 13
14. Pipeline
• Process a stream of documents
– Original input is a collection
– Final output is a result document
• Series of operators
– Filter or transform data
– Input/output chain
ps ax | grep mongod | head -n 1
Tuesday, January 29, 13
16. Example book data
{
_id: 375,
title: "The Great Gatsby",
ISBN: "9781857150193",
available: true,
pages: 218,
chapters: 9,
subjects: [
"Long Island",
"New York",
"1920s"
],
language: "English"
}
Tuesday, January 29, 13
17. $match
• Filter documents
• Uses existing query syntax
• (No geospatial operations or $where)
Tuesday, January 29, 13
18. Matching Field Values
{
{ $match: {
title: "The Great Gatsby",
language: "Russian"
pages: 218,
}}
language: "English"
}
{
title: "War and Peace",
{
pages: 1440,
title: "War and Peace",
language: "Russian"
pages: 1440,
}
language: "Russian"
}
{
title: "Atlas Shrugged",
pages: 1088,
language: "English"
}
Tuesday, January 29, 13
24. $group
• Group documents by an ID
– Field reference, object, constant
• Other output fields are computed
– $max, $min, $avg, $sum
– $addToSet, $push
– $first, $last
• Processes all data in memory
Tuesday, January 29, 13
25. Calculating an Average
{ { $group: {
title: "The Great Gatsby", _id: "$language",
pages: 218, avgPages: { $avg:
language: "English" "$pages" }
} }}
{
title: "War and Peace",
pages: 1440, {
language: "Russian" _id: "Russian",
} avgPages: 1440
}
{
title: "Atlas Shrugged", {
pages: 1088, _id: "English",
language: "English" avgPages: 653
} }
Tuesday, January 29, 13
27. Collecting Distinct Values
{ { $group: {
title: "The Great Gatsby", _id: "$language",
pages: 218, titles: { $addToSet: "$title" }
language: "English" }}
}
{ {
title: "War and Peace", _id: "Russian",
titles: [ "War and Peace" ]
pages: 1440, }
language: "Russian"
}
{
_id: "English",
{ titles: [
title: "Atlas Shrugged", "Atlas Shrugged",
pages: 1088, "The Great Gatsby"
language: "English" ]
}
}
Tuesday, January 29, 13
28. $unwind
• Applied to an array field
• Yield new documents for each array element
– Array replaced by element value
– Missing/empty fields → no output
– Non-array fields → error
• Pipe to $group to aggregate array values
Tuesday, January 29, 13
29. Yielding Multiple Documents from One
{ { $unwind: "$subjects" }
title: "The Great Gatsby",
ISBN: "9781857150193",
{
subjects: [
title: "The Great Gatsby",
"Long Island", ISBN: "9781857150193",
"New York", subjects: "Long Island"
"1920s" }
]
} {
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
}
Tuesday, January 29, 13
30. $sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
Tuesday, January 29, 13
31. Sort All the Documents in the Pipeline
{ title: "The Great Gatsby" } { $sort: { title: 1 }}
{ title: "Brave New World" }
{ title: "Grapes of Wrath" } { title: "Animal Farm" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Fahrenheit 451" }
{ title: "Fathers and Sons" } { title: "Fathers and Sons" }
{ title: "Invisible Man" } { title: "Grapes of Wrath" }
{ title: "Fahrenheit 451" } { title: "Invisible Man" }
{ title: "Lord of the Flies" }
{ title: "The Great Gatsby" }
Tuesday, January 29, 13
32. Limit Documents Through the Pipeline
{ title: "The Great Gatsby" } { $limit: 5 }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" } { title: "The Great Gatsby" }
{ title: "Animal Farm" } { title: "Brave New World" }
{ title: "Lord of the Flies" } { title: "Grapes of Wrath" }
{ title: "Fathers and Sons" } { title: "Animal Farm" }
{ title: "Invisible Man" } { title: "Lord of the Flies" }
{ title: "Fahrenheit 451" }
Tuesday, January 29, 13
33. Skip Over Documents in the Pipeline
{ title: "The Great Gatsby" } { $skip: 5 }
{ title: "Brave New World" }
{ title: "Grapes of Wrath" }
{ title: "Animal Farm" } { title: "Fathers and Sons" }
{ title: "Lord of the Flies" } { title: "Invisible Man" }
{ title: "Fathers and Sons" } { title: "Fahrenheit 451" }
{ title: "Invisible Man" }
{ title: "Fahrenheit 451" }
Tuesday, January 29, 13
40. Sharding
• Split the pipeline at first $group or $sort
– Shards execute pipeline up to that point
– mongos merges results and continues
• Early $match may excuse shards
• CPU and memory implications for mongos
Tuesday, January 29, 13
41. Sharding
[
{ $match: { /* filter by shard key */ }},
{ $project: { /* select fields */ }},
{ $group: { /* group by some field */ }},
{ $sort: { /* sort by some field */ }},
{ $project: { /* reshape result */ }}
]
Tuesday, January 29, 13
44. Expressions
• Return computed values
• Used with $project and $group
• Reference fields using $ (e.g. "$x")
• Expressions may be nested
Tuesday, January 29, 13
45. Boolean Operators
• Input array of one or more values
– $and, $or
– Short-circuit logic
• Invert values with $not
• Evaluation of non-boolean types
– null, undefined, zero ▶ false
– Non-zero, strings, dates, objects ▶ true
{ $and: [true, false] } ▶ false
{ $or: ["foo", 0] } ▶ true
{ $not: null } ▶ true
Tuesday, January 29, 13
52. Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing time series data
Tuesday, January 29, 13
53. Extending the Framework
• Adding new pipeline operators, expressions
• $out and $tee for output control
– https://jira.mongodb.org/browse/SERVER-3253
Tuesday, January 29, 13
54. Future Enhancements
• Automatically move $match earlier if possible
• Pipeline explain facility
• Memory usage improvements
– Grouping input sorted by _id
– Sorting with limited output
Tuesday, January 29, 13
55. #mongodbdays
Thank You
Emily Stolfo
Ruby Engineer/Evangelist, 10gen
@EmStolfo
Tuesday, January 29, 13