SlideShare a Scribd company logo
1 of 55
Download to read offline
#mongodbdays




       Aggregation Framework
       Emily Stolfo
       Ruby Engineer/Evangelist, 10gen
       @EmStolfo




Tuesday, January 29, 13
Agenda
       • State of Aggregation
       • Pipeline
       • Usage and Limitations
       • Optimization
       • Sharding
       • (Expressions)
       • Looking Ahead



Tuesday, January 29, 13
State of Aggregation




Tuesday, January 29, 13
State of Aggregation
       • We're storing our data in MongoDB
       • We need to do ad-hoc reporting, grouping,
          common aggregations, etc.
       • What are we using for this?




Tuesday, January 29, 13
Data Warehousing

Tuesday, January 29, 13
Data Warehousing
       • SQL for reporting and analytics
       • Infrastructure complications
             – Additional maintenance
             – Data duplication
             – ETL processes
             – Real time?




Tuesday, January 29, 13
MapReduce

Tuesday, January 29, 13
MapReduce
       • Extremely versatile, powerful
       • Intended for complex data analysis
       • Overkill for simple aggregation tasks, such as
             – Averages
             – Summation
             – Grouping




Tuesday, January 29, 13
MapReduce in MongoDB
       • Implemented with JavaScript
             – Single-threaded
             – Difficult to debug

       • Concurrency
             – Appearance of parallelism
             – Write locks




Tuesday, January 29, 13
Aggregation Framework

Tuesday, January 29, 13
Aggregation Framework
       • Declared in JSON, executes in C++
       • Flexible, functional, and simple
             – Operation pipeline
             – Computational expressions

       • Works well with sharding




Tuesday, January 29, 13
Enabling Developers
       • Doing more within MongoDB, faster
       • Refactoring MapReduce and groupings
             – Replace pages of JavaScript
             – Longer aggregation pipelines

       • Quick aggregations from the shell




Tuesday, January 29, 13
Pipeline




Tuesday, January 29, 13
Pipeline
       • Process a stream of documents
             – Original input is a collection
             – Final output is a result document

       • Series of operators
             – Filter or transform data
             – Input/output chain


                          ps ax | grep mongod | head -n 1




Tuesday, January 29, 13
Pipeline Operators

                • $match     • $sort
                • $project   • $limit
                • $group     • $skip
                • $unwind




Tuesday, January 29, 13
Example book data
       {
           _id: 375,
           title: "The Great Gatsby",
           ISBN: "9781857150193",
           available: true,
           pages: 218,
           chapters: 9,
           subjects: [
              "Long Island",
              "New York",
              "1920s"
           ],
           language: "English"
       }




Tuesday, January 29, 13
$match
       • Filter documents
       • Uses existing query syntax
       • (No geospatial operations or $where)




Tuesday, January 29, 13
Matching Field Values
       {
                                        { $match: {
           title: "The Great Gatsby",
                                          language: "Russian"
           pages: 218,
                                        }}
           language: "English"
       }

       {
           title: "War and Peace",
                                        {
           pages: 1440,
                                            title: "War and Peace",
           language: "Russian"
                                            pages: 1440,
       }
                                            language: "Russian"
                                        }
       {
           title: "Atlas Shrugged",
           pages: 1088,
           language: "English"
       }




Tuesday, January 29, 13
Matching with Query Operators
       {                                { $match: {
           title: "The Great Gatsby",     pages: { $gt: 1000 }
           pages: 218,                  }}
           language: "English"
       }

       {                                {
           title: "War and Peace",          title: "War and Peace",
           pages: 1440,                     pages: 1440,
           language: "Russian"              language: "Russian"
       }                                }

       {                                {
           title: "Atlas Shrugged",         title: "Atlas Shrugged",
           pages: 1088,                     pages: 1088,
           language: "English"              language: "English"
       }                                }



Tuesday, January 29, 13
$project
       • Reshape documents
       • Include, exclude or rename fields
       • Inject computed fields
       • Create sub-document fields




Tuesday, January 29, 13
Including and Excluding Fields
      {                            { $project: {
          _id: 375,                  _id: 0,
          title: "Great Gatsby",     title: 1,
          ISBN: "9781857150193",     language: 1
          available: true,         }}
          pages: 218,
          subjects: [
             "Long Island",
             "New York",
             "1920s"               {
          ],                           title: " Great Gatsby",
          language: "English"          language: "English"
      }                            }




Tuesday, January 29, 13
Renaming and Computing Fields
       {                            { $project: {
           _id: 375,                  avgChapterLength: {
           title: "Great Gatsby",        $divide: ["$pages",
           ISBN: "9781857150193",              "$chapters"]
           available: true,           },
           pages: 218,                lang: "$language"
           chapters: 9,             }}
           subjects: [
              "Long Island",
              "New York",
              "1920s"               {
           ],                           _id: 375,
           language: "English"          avgChapterLength: 24.2222 ,
       }                                lang: "English"
                                    }




Tuesday, January 29, 13
Creating Sub-Document Fields
                                   { $project: {
      {
                                     title: 1,
          _id: 375,
                                     stats: {
          title: "Great Gatsby",
                                       pages: "$pages",
          ISBN: "9781857150193",
                                       language: "$language",
          available: true,
                                     }
          pages: 218,
                                   }}
          subjects: [
             "Long Island",
             "New York",
             "1920s"
                                   {
          ],
                                       _id: 375,
          language: "English"
                                       title: " Great Gatsby",
      }
                                       stats: {
                                         pages: 218,
                                         language: "English"
                                       }




Tuesday, January 29, 13
$group
       • Group documents by an ID
             – Field reference, object, constant

       • Other output fields are computed
             – $max, $min, $avg, $sum
             – $addToSet, $push
             – $first, $last

       • Processes all data in memory




Tuesday, January 29, 13
Calculating an Average
       {                                { $group: {
           title: "The Great Gatsby",     _id: "$language",
           pages: 218,                    avgPages: { $avg:
           language: "English"                    "$pages" }
       }                                }}

       {
           title: "War and Peace",
           pages: 1440,                 {
           language: "Russian"              _id: "Russian",
       }                                    avgPages: 1440
                                        }
       {
           title: "Atlas Shrugged",     {
           pages: 1088,                     _id: "English",
           language: "English"              avgPages: 653
       }                                }



Tuesday, January 29, 13
Summating Fields and Counting
       {                                { $group: {
           title: "The Great Gatsby",     _id: "$language",
           pages: 218,                    numTitles: { $sum: 1 },
           language: "English"            sumPages: { $sum: "$pages" }
                                        }}
       }

       {
           title: "War and Peace",      {
           pages: 1440,                     _id: "Russian",
           language: "Russian”              numTitles: 1,
       }                                    sumPages: 1440
                                        }
       {
                                        {
           title: "Atlas Shrugged",
                                            _id: "English",
           pages: 1088,                     numTitles: 2,
           language: "English"              sumPages: 1306
       }                                }




Tuesday, January 29, 13
Collecting Distinct Values
       {                                { $group: {
           title: "The Great Gatsby",     _id: "$language",
           pages: 218,                    titles: { $addToSet: "$title" }
           language: "English"          }}
       }

       {                                {
           title: "War and Peace",          _id: "Russian",
                                            titles: [ "War and Peace" ]
           pages: 1440,                 }
           language: "Russian"
       }
                                        {
                                            _id: "English",
       {                                    titles: [
           title: "Atlas Shrugged",           "Atlas Shrugged",
           pages: 1088,                       "The Great Gatsby"
           language: "English"              ]
                                        }
       }




Tuesday, January 29, 13
$unwind
       • Applied to an array field
       • Yield new documents for each array element
             – Array replaced by element value
             – Missing/empty fields → no output
             – Non-array fields → error

       • Pipe to $group to aggregate array values




Tuesday, January 29, 13
Yielding Multiple Documents from One
       {                                { $unwind: "$subjects" }
           title: "The Great Gatsby",
           ISBN: "9781857150193",
                                        {
           subjects: [
                                            title: "The Great Gatsby",
             "Long Island",                 ISBN: "9781857150193",
             "New York",                    subjects: "Long Island"
             "1920s"                    }
           ]
       }                                {
                                            title: "The Great Gatsby",
                                            ISBN: "9781857150193",
                                            subjects: "New York"
                                        }

                                        {
                                            title: "The Great Gatsby",
                                            ISBN: "9781857150193",
                                            subjects: "1920s"
                                        }



Tuesday, January 29, 13
$sort, $limit, $skip
       • Sort documents by one or more fields
             – Same order syntax as cursors
             – Waits for earlier pipeline operator to return
             – In-memory unless early and indexed

       • Limit and skip follow cursor behavior




Tuesday, January 29, 13
Sort All the Documents in the Pipeline

       { title: "The Great Gatsby" }    { $sort: { title: 1 }}
       { title: "Brave New World" }
       { title: "Grapes of Wrath" }     { title: "Animal Farm" }
       { title: "Animal Farm" }         { title: "Brave New World" }
       { title: "Lord of the Flies" }   { title: "Fahrenheit 451" }
       { title: "Fathers and Sons" }    { title: "Fathers and Sons" }
       { title: "Invisible Man" }       { title: "Grapes of Wrath" }
       { title: "Fahrenheit 451" }      { title: "Invisible Man" }
                                        { title: "Lord of the Flies" }
                                        { title: "The Great Gatsby" }




Tuesday, January 29, 13
Limit Documents Through the Pipeline

       { title: "The Great Gatsby" }    { $limit: 5 }
       { title: "Brave New World" }
       { title: "Grapes of Wrath" }     { title: "The Great Gatsby" }
       { title: "Animal Farm" }         { title: "Brave New World" }
       { title: "Lord of the Flies" }   { title: "Grapes of Wrath" }
       { title: "Fathers and Sons" }    { title: "Animal Farm" }
       { title: "Invisible Man" }       { title: "Lord of the Flies" }
       { title: "Fahrenheit 451" }




Tuesday, January 29, 13
Skip Over Documents in the Pipeline

       { title: "The Great Gatsby" }    { $skip: 5 }
       { title: "Brave New World" }
       { title: "Grapes of Wrath" }
       { title: "Animal Farm" }         { title: "Fathers and Sons" }
       { title: "Lord of the Flies" }   { title: "Invisible Man" }
       { title: "Fathers and Sons" }    { title: "Fahrenheit 451" }
       { title: "Invisible Man" }
       { title: "Fahrenheit 451" }




Tuesday, January 29, 13
Usage and Limitations




Tuesday, January 29, 13
Usage
       • collection.aggregate() method
             – Mongo shell
             – Most drivers

       • aggregate database command




Tuesday, January 29, 13
Collection
         db.books.aggregate([
           { $project: { language: 1 }},
           { $group: { _id: "$language", numTitles: { $sum: 1 }}}
         ])




         {
             result: [
                { _id: "Russian", numTitles: 1 },
                { _id: "English", numTitles: 2 }
             ],
             ok: 1
         }




Tuesday, January 29, 13
Database Command
         db.runCommand({
           aggregate: "books",
           pipeline: [
             { $project: { language: 1 }},
             { $group: { _id: "$language", numTitles: { $sum: 1 }}}
           ]
         })



         {
             result: [
                { _id: "Russian", numTitles: 1 },
                { _id: "English", numTitles: 2 }
             ],
             ok: 1
         }



Tuesday, January 29, 13
Limitations
       • Result limited by BSON document size
             – Final command result
             – Intermediate shard results

       • Pipeline operator memory limits
       • Some BSON types unsupported
             – Binary, Code, deprecated types




Tuesday, January 29, 13
Sharding




Tuesday, January 29, 13
Sharding
       • Split the pipeline at first $group or $sort
             – Shards execute pipeline up to that point
             – mongos merges results and continues

       • Early $match may excuse shards
       • CPU and memory implications for mongos




Tuesday, January 29, 13
Sharding
       [
           {   $match: { /* filter by shard key */ }},
           {   $project: { /* select fields  */ }},
           {   $group: { /* group by some field */ }},
           {   $sort: { /* sort by some field */ }},
           {   $project: { /* reshape result   */ }}
       ]




Tuesday, January 29, 13
Aggregation in a sharded cluster

Tuesday, January 29, 13
Expressions




Tuesday, January 29, 13
Expressions
       • Return computed values
       • Used with $project and $group
       • Reference fields using $ (e.g. "$x")
       • Expressions may be nested




Tuesday, January 29, 13
Boolean Operators
       • Input array of one or more values
             – $and, $or
             – Short-circuit logic

       • Invert values with $not
       • Evaluation of non-boolean types
             – null, undefined, zero ▶ false
             – Non-zero, strings, dates, objects ▶ true

                           { $and: [true, false] } ▶ false
                           { $or: ["foo", 0] } ▶ true
                           { $not: null       } ▶ true




Tuesday, January 29, 13
Comparison Operators
       • Compare numbers, strings, and dates
       • Input array with two operands
             – $cmp, $eq, $ne
             – $gt, $gte, $lt, $lte


                          {   $cmp: [3, 4]       } ▶ -1
                          {   $eq: ["foo", "bar"] } ▶ false
                          {   $ne: ["foo", "bar"] } ▶ true
                          {   $gt: [9, 7]      } ▶ true




Tuesday, January 29, 13
Arithmetic Operators
       • Input array of one or more numbers
             – $add, $multiply

       • Input array of two operands
             – $subtract, $divide, $mod


                          {   $add:    [1, 2, 3] } ▶ 6
                          {   $multiply: [2, 2, 2] } ▶ 8
                          {   $subtract: [10, 7] } ▶ 3
                          {   $divide: [10, 2] } ▶ 5
                          {   $mod:     [8, 3] } ▶ 2




Tuesday, January 29, 13
String Operators
       • $strcasecmp case-insensitive comparison
             – $cmp is case-sensitive

       • $toLower and $toUpper case change
       • $substr for sub-string extraction
       • Not encoding aware (assumes ASCII alphabet)

                          {   $strcasecmp:   
 ["foo", "bar"] } 
   ▶   
   1
                          {   $substr:  
      ["foo", 1, 2] } 
    ▶   
   "oo"
                          {   $toUpper: 
      "foo"        }
      ▶   
   "FOO"
                          {   $toLower: 
      "BAR"         }
     ▶   
   "bar"




Tuesday, January 29, 13
Date Operators
       • Extract values from date objects
             – $dayOfYear, $dayOfMonth, $dayOfWeek
             – $year, $month, $week
             – $hour, $minute, $second


               {   $year:   ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012
               {   $month:    ISODate("2012-10-24T00:00:00.000Z") } ▶ 10
               {   $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24
               {   $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4
               {   $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299
               {   $week:    ISODate("2012-10-24T00:00:00.000Z") } ▶ 43




Tuesday, January 29, 13
Conditional Operators
       • $cond ternary operator
       • $ifNull



                          { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different”

                          { $ifNull: ["foo", "bar"] } ▶ "foo"
                          { $ifNull: [null, "bar"] } ▶ "bar"




Tuesday, January 29, 13
Looking Ahead




Tuesday, January 29, 13
Framework Use Cases
       • Basic aggregation queries
       • Ad-hoc reporting
       • Real-time analytics
       • Visualizing time series data




Tuesday, January 29, 13
Extending the Framework
       • Adding new pipeline operators, expressions
       • $out and $tee for output control
             – https://jira.mongodb.org/browse/SERVER-3253




Tuesday, January 29, 13
Future Enhancements
       • Automatically move $match earlier if possible
       • Pipeline explain facility
       • Memory usage improvements
             – Grouping input sorted by _id
             – Sorting with limited output




Tuesday, January 29, 13
#mongodbdays




       Thank You
       Emily Stolfo
       Ruby Engineer/Evangelist, 10gen
       @EmStolfo




Tuesday, January 29, 13

More Related Content

Similar to Aggregation Framework

MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"MongoDB
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB
 
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...MongoDB
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsMongoDB
 

Similar to Aggregation Framework (6)

MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
MongoDB .local Chicago 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pi...
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
MongoDB World 2019: Aggregation Pipeline Power++: How MongoDB 4.2 Pipeline Em...
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
CouchDB
CouchDBCouchDB
CouchDB
 

More from MongoDB

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump StartMongoDB
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB
 

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
 

Aggregation Framework

  • 1. #mongodbdays Aggregation Framework Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfo Tuesday, January 29, 13
  • 2. Agenda • State of Aggregation • Pipeline • Usage and Limitations • Optimization • Sharding • (Expressions) • Looking Ahead Tuesday, January 29, 13
  • 4. State of Aggregation • We're storing our data in MongoDB • We need to do ad-hoc reporting, grouping, common aggregations, etc. • What are we using for this? Tuesday, January 29, 13
  • 6. Data Warehousing • SQL for reporting and analytics • Infrastructure complications – Additional maintenance – Data duplication – ETL processes – Real time? Tuesday, January 29, 13
  • 8. MapReduce • Extremely versatile, powerful • Intended for complex data analysis • Overkill for simple aggregation tasks, such as – Averages – Summation – Grouping Tuesday, January 29, 13
  • 9. MapReduce in MongoDB • Implemented with JavaScript – Single-threaded – Difficult to debug • Concurrency – Appearance of parallelism – Write locks Tuesday, January 29, 13
  • 11. Aggregation Framework • Declared in JSON, executes in C++ • Flexible, functional, and simple – Operation pipeline – Computational expressions • Works well with sharding Tuesday, January 29, 13
  • 12. Enabling Developers • Doing more within MongoDB, faster • Refactoring MapReduce and groupings – Replace pages of JavaScript – Longer aggregation pipelines • Quick aggregations from the shell Tuesday, January 29, 13
  • 14. Pipeline • Process a stream of documents – Original input is a collection – Final output is a result document • Series of operators – Filter or transform data – Input/output chain ps ax | grep mongod | head -n 1 Tuesday, January 29, 13
  • 15. Pipeline Operators • $match • $sort • $project • $limit • $group • $skip • $unwind Tuesday, January 29, 13
  • 16. Example book data { _id: 375, title: "The Great Gatsby", ISBN: "9781857150193", available: true, pages: 218, chapters: 9, subjects: [ "Long Island", "New York", "1920s" ], language: "English" } Tuesday, January 29, 13
  • 17. $match • Filter documents • Uses existing query syntax • (No geospatial operations or $where) Tuesday, January 29, 13
  • 18. Matching Field Values { { $match: { title: "The Great Gatsby", language: "Russian" pages: 218, }} language: "English" } { title: "War and Peace", { pages: 1440, title: "War and Peace", language: "Russian" pages: 1440, } language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" } Tuesday, January 29, 13
  • 19. Matching with Query Operators { { $match: { title: "The Great Gatsby", pages: { $gt: 1000 } pages: 218, }} language: "English" } { { title: "War and Peace", title: "War and Peace", pages: 1440, pages: 1440, language: "Russian" language: "Russian" } } { { title: "Atlas Shrugged", title: "Atlas Shrugged", pages: 1088, pages: 1088, language: "English" language: "English" } } Tuesday, January 29, 13
  • 20. $project • Reshape documents • Include, exclude or rename fields • Inject computed fields • Create sub-document fields Tuesday, January 29, 13
  • 21. Including and Excluding Fields { { $project: { _id: 375, _id: 0, title: "Great Gatsby", title: 1, ISBN: "9781857150193", language: 1 available: true, }} pages: 218, subjects: [ "Long Island", "New York", "1920s" { ], title: " Great Gatsby", language: "English" language: "English" } } Tuesday, January 29, 13
  • 22. Renaming and Computing Fields { { $project: { _id: 375, avgChapterLength: { title: "Great Gatsby", $divide: ["$pages", ISBN: "9781857150193", "$chapters"] available: true, }, pages: 218, lang: "$language" chapters: 9, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" avgChapterLength: 24.2222 , } lang: "English" } Tuesday, January 29, 13
  • 23. Creating Sub-Document Fields { $project: { { title: 1, _id: 375, stats: { title: "Great Gatsby", pages: "$pages", ISBN: "9781857150193", language: "$language", available: true, } pages: 218, }} subjects: [ "Long Island", "New York", "1920s" { ], _id: 375, language: "English" title: " Great Gatsby", } stats: { pages: 218, language: "English" } Tuesday, January 29, 13
  • 24. $group • Group documents by an ID – Field reference, object, constant • Other output fields are computed – $max, $min, $avg, $sum – $addToSet, $push – $first, $last • Processes all data in memory Tuesday, January 29, 13
  • 25. Calculating an Average { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, avgPages: { $avg: language: "English" "$pages" } } }} { title: "War and Peace", pages: 1440, { language: "Russian" _id: "Russian", } avgPages: 1440 } { title: "Atlas Shrugged", { pages: 1088, _id: "English", language: "English" avgPages: 653 } } Tuesday, January 29, 13
  • 26. Summating Fields and Counting { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, numTitles: { $sum: 1 }, language: "English" sumPages: { $sum: "$pages" } }} } { title: "War and Peace", { pages: 1440, _id: "Russian", language: "Russian” numTitles: 1, } sumPages: 1440 } { { title: "Atlas Shrugged", _id: "English", pages: 1088, numTitles: 2, language: "English" sumPages: 1306 } } Tuesday, January 29, 13
  • 27. Collecting Distinct Values { { $group: { title: "The Great Gatsby", _id: "$language", pages: 218, titles: { $addToSet: "$title" } language: "English" }} } { { title: "War and Peace", _id: "Russian", titles: [ "War and Peace" ] pages: 1440, } language: "Russian" } { _id: "English", { titles: [ title: "Atlas Shrugged", "Atlas Shrugged", pages: 1088, "The Great Gatsby" language: "English" ] } } Tuesday, January 29, 13
  • 28. $unwind • Applied to an array field • Yield new documents for each array element – Array replaced by element value – Missing/empty fields → no output – Non-array fields → error • Pipe to $group to aggregate array values Tuesday, January 29, 13
  • 29. Yielding Multiple Documents from One { { $unwind: "$subjects" } title: "The Great Gatsby", ISBN: "9781857150193", { subjects: [ title: "The Great Gatsby", "Long Island", ISBN: "9781857150193", "New York", subjects: "Long Island" "1920s" } ] } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" } Tuesday, January 29, 13
  • 30. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior Tuesday, January 29, 13
  • 31. Sort All the Documents in the Pipeline { title: "The Great Gatsby" } { $sort: { title: 1 }} { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } { title: "Fathers and Sons" } { title: "Fathers and Sons" } { title: "Invisible Man" } { title: "Grapes of Wrath" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "The Great Gatsby" } Tuesday, January 29, 13
  • 32. Limit Documents Through the Pipeline { title: "The Great Gatsby" } { $limit: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "The Great Gatsby" } { title: "Animal Farm" } { title: "Brave New World" } { title: "Lord of the Flies" } { title: "Grapes of Wrath" } { title: "Fathers and Sons" } { title: "Animal Farm" } { title: "Invisible Man" } { title: "Lord of the Flies" } { title: "Fahrenheit 451" } Tuesday, January 29, 13
  • 33. Skip Over Documents in the Pipeline { title: "The Great Gatsby" } { $skip: 5 } { title: "Brave New World" } { title: "Grapes of Wrath" } { title: "Animal Farm" } { title: "Fathers and Sons" } { title: "Lord of the Flies" } { title: "Invisible Man" } { title: "Fathers and Sons" } { title: "Fahrenheit 451" } { title: "Invisible Man" } { title: "Fahrenheit 451" } Tuesday, January 29, 13
  • 35. Usage • collection.aggregate() method – Mongo shell – Most drivers • aggregate database command Tuesday, January 29, 13
  • 36. Collection db.books.aggregate([ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ]) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 } Tuesday, January 29, 13
  • 37. Database Command db.runCommand({ aggregate: "books", pipeline: [ { $project: { language: 1 }}, { $group: { _id: "$language", numTitles: { $sum: 1 }}} ] }) { result: [ { _id: "Russian", numTitles: 1 }, { _id: "English", numTitles: 2 } ], ok: 1 } Tuesday, January 29, 13
  • 38. Limitations • Result limited by BSON document size – Final command result – Intermediate shard results • Pipeline operator memory limits • Some BSON types unsupported – Binary, Code, deprecated types Tuesday, January 29, 13
  • 40. Sharding • Split the pipeline at first $group or $sort – Shards execute pipeline up to that point – mongos merges results and continues • Early $match may excuse shards • CPU and memory implications for mongos Tuesday, January 29, 13
  • 41. Sharding [ { $match: { /* filter by shard key */ }}, { $project: { /* select fields */ }}, { $group: { /* group by some field */ }}, { $sort: { /* sort by some field */ }}, { $project: { /* reshape result */ }} ] Tuesday, January 29, 13
  • 42. Aggregation in a sharded cluster Tuesday, January 29, 13
  • 44. Expressions • Return computed values • Used with $project and $group • Reference fields using $ (e.g. "$x") • Expressions may be nested Tuesday, January 29, 13
  • 45. Boolean Operators • Input array of one or more values – $and, $or – Short-circuit logic • Invert values with $not • Evaluation of non-boolean types – null, undefined, zero ▶ false – Non-zero, strings, dates, objects ▶ true { $and: [true, false] } ▶ false { $or: ["foo", 0] } ▶ true { $not: null } ▶ true Tuesday, January 29, 13
  • 46. Comparison Operators • Compare numbers, strings, and dates • Input array with two operands – $cmp, $eq, $ne – $gt, $gte, $lt, $lte { $cmp: [3, 4] } ▶ -1 { $eq: ["foo", "bar"] } ▶ false { $ne: ["foo", "bar"] } ▶ true { $gt: [9, 7] } ▶ true Tuesday, January 29, 13
  • 47. Arithmetic Operators • Input array of one or more numbers – $add, $multiply • Input array of two operands – $subtract, $divide, $mod { $add: [1, 2, 3] } ▶ 6 { $multiply: [2, 2, 2] } ▶ 8 { $subtract: [10, 7] } ▶ 3 { $divide: [10, 2] } ▶ 5 { $mod: [8, 3] } ▶ 2 Tuesday, January 29, 13
  • 48. String Operators • $strcasecmp case-insensitive comparison – $cmp is case-sensitive • $toLower and $toUpper case change • $substr for sub-string extraction • Not encoding aware (assumes ASCII alphabet) { $strcasecmp: ["foo", "bar"] } ▶ 1 { $substr: ["foo", 1, 2] } ▶ "oo" { $toUpper: "foo" } ▶ "FOO" { $toLower: "BAR" } ▶ "bar" Tuesday, January 29, 13
  • 49. Date Operators • Extract values from date objects – $dayOfYear, $dayOfMonth, $dayOfWeek – $year, $month, $week – $hour, $minute, $second { $year: ISODate("2012-10-24T00:00:00.000Z") } ▶ 2012 { $month: ISODate("2012-10-24T00:00:00.000Z") } ▶ 10 { $dayOfMonth: ISODate("2012-10-24T00:00:00.000Z") } ▶ 24 { $dayOfWeek: ISODate("2012-10-24T00:00:00.000Z") } ▶ 4 { $dayOfYear: ISODate("2012-10-24T00:00:00.000Z") } ▶ 299 { $week: ISODate("2012-10-24T00:00:00.000Z") } ▶ 43 Tuesday, January 29, 13
  • 50. Conditional Operators • $cond ternary operator • $ifNull { $cond: [{ $eq: [1, 2] }, "same", "different"] } ▶ "different” { $ifNull: ["foo", "bar"] } ▶ "foo" { $ifNull: [null, "bar"] } ▶ "bar" Tuesday, January 29, 13
  • 52. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing time series data Tuesday, January 29, 13
  • 53. Extending the Framework • Adding new pipeline operators, expressions • $out and $tee for output control – https://jira.mongodb.org/browse/SERVER-3253 Tuesday, January 29, 13
  • 54. Future Enhancements • Automatically move $match earlier if possible • Pipeline explain facility • Memory usage improvements – Grouping input sorted by _id – Sorting with limited output Tuesday, January 29, 13
  • 55. #mongodbdays Thank You Emily Stolfo Ruby Engineer/Evangelist, 10gen @EmStolfo Tuesday, January 29, 13