• Like
  • Save
MongoDB Aggregation Framework
Upcoming SlideShare
Loading in...5

MongoDB Aggregation Framework



These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems. ...

These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.

Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.

For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.



Total Views
Views on SlideShare
Embed Views



4 Embeds 1,213

http://www.bigdatanosql.com 1187
http://www.scoop.it 19
http://webcache.googleusercontent.com 6
https://twitter.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    MongoDB Aggregation Framework MongoDB Aggregation Framework Presentation Transcript

    • AggregationFramework
    • Quick Overview of
    • Quick Overview ofDocument-orientedSchemalessJSON-style documentsRich QueriesScales Horizontallydb.users.find({last_name: Smith,age: {$gt : 10}});SELECT * FROM users WHERElast_name=‘Smith’ AND age > 10;
    • Computing Aggregations inDatabasesSQL-basedRDBMSJOINGROUP BYAVG(),COUNT(),SUM(), FIRST(),LAST(),etc.MongoDB 2.0MapReduceMongoDB 2.2+MapReduceAggregation Framework
    • MapReducevar map = function(){...emit(key, val);}var reduce = function(key, vals){...return resultVal;}DataMap()emit(k,v)Sort(k)Group(k)Reduce(k,values)k,vFinalize(k,v)k,vMongoDBmap iterates ondocumentsDocument is $this1 at time per shardInput matches outputCan run multiple times
    • What’s wrong with just usingMapReduce?Map/Reduce is verypowerful, but often overkillLots of users relying on itfor simple aggregation tasks••
    • What’s wrong with just usingMapReduce?Easy to screw up JavaScriptDebugging a M/R job sucksWriting more JS for simple tasks should not be necessary•••(ಠ︿ಠ)
    • AggregationFrameworkDeclarative (no need to write JS)Implemented directly in C++Expression EvaluationReturn computed valuesFramework: We can extend it with newops•••••
    • InputData(collection)FilterProjectUnwindGroupSortLimitResult(document)
    • db.article.aggregate({ $project : {author : 1,tags : 1}},{ $unwind : "$tags" },{ $group : {_id : “$tags”,authors:{ $addToSet:"$author"}}});An aggregation command looks like:
    • db.article.aggregate({ $project : {author : 1, tags : 1}},{ $unwind : "$tags" },{ $group : {_id : “$tags”,authors : { $addToSet:"$author"}}});New HelperMethod:.aggregate()Operatorpipelinedb.runCommand({aggregate : "article",pipeline : [ {$op1, $op2, ...} ]}
    • {"result" : [{ "_id" : "art", "authors" : [ "bill", "bob" ] },{ "_id" : "sports", "authors" : [ "jane", "bob" ] },{ "_id" : "food", "authors" : [ "jane", "bob" ] },{ "_id" : "science", "authors" : [ "jane", "bill", "bob" ] }],"ok" : 1}Output Document Looks like this:result: array of pipelineoutputok: 1 for success, 0otherwise
    • PipelineInput to the start of the pipeline is a collectionSeries of operators - each one filters or transforms itsinputPasses output data to next operator in the pipelineOutput of the pipeline is the result document••••ps -ax | tee processes.txt | moreKind of like UNIX:
    • Let’s do:1. Tour of the pipelineoperators2. A couple examples based oncommon SQL aggregation tasks$match$unwind$group$project$skip $limit $sort
    • filters documents from pipeline with a query predicatefiltered with:{$match: {author:”bob”}}$match{author: "bob", pageViews:5, title:"Lorem Ipsum..."}{author: "bill", pageViews:3, title:"dolor sit amet..."}{author: "joe", pageViews:52, title:"consectetur adipi..."}{author: "jane", pageViews:51, title:"sed diam..."}{author: "bob", pageViews:14, title:"magna aliquam..."}{author: "bob", pageViews:53, title:"claritas est..."}filtered with:{$match: {pageViews:{$gt:50}}{author:"bob",pageViews:5,title:"Lorem Ipsum..."}{author:"bob",pageViews:14,title:"magna aliquam..."}{author:"bob",pageViews:53,title:"claritas est..."}{author: "joe", pageViews:52, title:"consectetur adipiscing..."}{author: "jane", pageViews:51, title:"sed diam..."}{author: "bob", pageViews:53, title:"claritas est..."}Input:
    • $unwind{"_id" : ObjectId("4f...146"),"author" : "bob","tags" :[ "fun","good","awesome"]}explode the “tags” array with:{ $unwind : ”$tags” }{ _id : ObjectId("4f...146"), author : "bob", tags:"fun"},{ _id : ObjectId("4f...146"), author : "bob", tags:"good"},{ _id : ObjectId("4f...146"), author : "bob", tags:"awesome"}produces output:Produce a new document foreach value in an input array
    • Bucket a subset of docs together,calculate an aggregated output doc from the bucket$sum$max, $min$avg$first, $last$addToSet$pushdb.article.aggregate({ $group : {_id : "$author",viewsPerAuthor : { $sum :"$pageViews" }}});$groupOutputCalculationOperators:
    • db.article.aggregate({ $group : {_id : "$author",viewsPerAuthor : { $sum : "$pageViews" }}});_id: selects a field to use asbucket key for groupingOutput field name Operation used to calculate theoutput value($sum, $max, $avg, etc.)$group (cont’d)dot notation (nested fields)a constanta multi-key expression inside{...}•••also allowed here:
    • An example with $match and $groupSELECT SUM(price) FROM ordersWHERE customer_id = 4;MongoDB:SQL:db.orders.aggregate({$match : {“$customer_id” : 4}},{$group : { _id : null,total: {$sum : “price”}})English: Find the sum of all prices of theorders placed by customer #4
    • An example with $unwind and $groupMongoDB:SQL:English:db.posts.aggregate({ $unwind : "$tags" },{ $group : {_id : “$tags”,authors : { $addToSet : "$author" }}});For all tags used in blog posts, produce a list ofauthors that have posted under each tagSELECT tag, author FROM post_tags LEFTJOIN posts ON post_tags.post_id =posts.id GROUP BY tag, author;
    • More operators - Controlling Pipeline Input$skip$limit$sortSimilar to:.skip().limit().sort()in a regular Mongo query
    • $sortspecified the same way as index keys:{ $sort : { name : 1, age: -1 } }Must be used in order to takeadvantage of $first/$last with$group.order input documents
    • $limitlimit the number of input documents{$limit : 5}$skipskips over documents{$skip : 5}
    • $projectUse for:Add, Remove, Pull up, Push down, RenameFieldsBuilding computed fieldsReshape a document
    • $project(cont’d)Include or exclude fields{$project :{ title : 1,author : 1} }Only pass on fields“title” and “author”{$project : { comments : 0}Exclude“comments” field,keep everythingelse
    • Moving + Renaming fields{$project :{ page_views : “$pageViews”,catName : “$category.name”,info : {published : “$ctime”,update : “$mtime”}}}Rename page_views to pageViewsTake nested field“category.name”, moveit into top-level fieldcalled “catName”Populate a newsub-documentinto the output$project(cont’d)
    • db.article.aggregate({ $project : {name : 1,age_fixed : { $add:["$age", 2] }}});Building a Computed FieldOutput(computed field) OperandsExpression$project(cont’d)
    • Lots of AvailableExpressions$project(cont’d)Numeric $add $sub $mod $divide $multiplyLogical $eq $lte/$lt $gte/$gt $and $not $or $eqDates$dayOfMonth $dayOfYear $dayOfWeek $second $minute$hour $week $month $isoDateStrings $substr $add $toLower $toUpper $strcasecmp
    • Example: $sort → $limit → $project→$groupMongoDB:SQL:English: Of the most recent 1000 blog posts, how manywere posted within each calendar year?SELECT YEAR(pub_time) as pub_year,COUNT(*) FROM(SELECT pub_time FROM posts ORDER BYpub_time desc)GROUP BY pub_year;db.test.aggregate({$sort : {pub_time: -1}},{$limit : 1000},{$project:{pub_year:{$year:["$pub_time"]}}},{$group: {_id:"$pub_year", num_year:{$sum:1}}})
    • Some Usage NotesIn BSON, order matters - so computedfields always show up after regular fieldsWe use $ in front of field names todistinguish fields from string literalsin expressions “$name”“name”vs.
    • Some Usage NotesUse a $match,$sort and $limitfirst in pipeline if possibleCumulative Operators $group:be aware of memory usageUse $project to discard unneeded fieldsRemember the 16MB output limit
    • Aggregation vs.MapReduceFramework is geared towards counting/accumulatingIf you need something more exotic, useMapReduceNo 16MB constraint on output size withMapReduceJS in M/R is not limited to any fixed set of expressions••••
    • thanks! ✌(-‿-)✌questions?$$$ BTW: we are hiring!http://10gen.com/jobs $$$@mpobriengithub.com/mpobrienhit me up: