Speaker: Asya Kamsky
Think you need to move your data "elsewhere" to do powerful analysis? Think again. The most efficient way to analyze your data is where it already lives. MongoDB Aggregation Pipeline has been getting more and more powerful and using new stages, expressions and tricks we can do extensive analysis of our data inside MongoDB Server.
69. #MDBTOUR
POWERFUL AGGREGATIONS
understand stages
• Best order for performance
• Avoid unnecessary "blocking"
• keep "streaming"
• Maximize use of indexes
• early stages get the index!
• Liberally check explain() output
understand expressions
• Schema manipulation
• Array transformation
use functions
• Readable, debug-able, reusable
70. #MDBTOUR
THE FUTURE OF AGGREGATION
Better performance & optimizations
More stages & expressions
More options for output
Compass helper for aggregate
72. # M D B l o c a l
THANK YOU!
https://github.com/asya999/mdbw17
Editor's Notes
store data the way you need to retrieve it, after the initial design, you need to retrieve the data again, but this time not to the way you wrote it,
if the data is important, you will need to analyze it –
analyze your success, analyze your failures,
it's hard to anticipate all the kinds of analytics you want to do.
what are the options to analyze the data you've store in MongoDB?
High level: you have three options.
first one is the best when you need the aggregated results super fast – but it requires you know all of them in advance, and you need a crystal ball for that. Anyway, that topic is "schema design".
next you can take the data and move it to another system for analysis.
of course when I say another system I usually mean hadoop or spark, some massively parallel cluster that can do massive fancy computations super duper fast. But to see a potential downside of this approach, let me use an analogy (if you've heard my talks before, you guys know I love analogies)
I've been doing some home improvements and I'm a DIY kind of gal – now for every task there are lots of ways that it can be done and all of them might be "right" they just have different trade-offs. Let's say I have to cut a piece of wood – there are lots of tools that can do it, at many different price points...
Realistically I had two options. I could go to the local HW store where they have a big wood cutting machine and it could cut my piece of wood in seconds. It'd probably only cost me a couple of bucks. Or I can stay home and cut it myself with a saw I got for $2 on a garage sale. It takes longer – maybe 5 minutes instead of 5 seconds, but overall latency is still better because I saved myself a trip to the store.
similarly when your data is in MongoDB, to get it analyzed in the massive other cluster you already have...
So even though that cluster can analyze your data super fast, the extra latency of moving the data over might make this the wrong choice IF
*IF* you can do the same analysis right in MongoDB. And that's our option three – do the aggregation in MongoDB.
aggregation in MongoDB is not just for analysis of the data that you stored in the DB
aggregation allows you to access system data
we at MongoDB are more and more choosing to return the data about the system to you
as an output of an aggregation stage in the aggregation pipeline
.
what is aggregation stage & what is aggregation pipeline
and what is the language for transforming results, including a number of stages, expressions and accumulators
being advanced talk, I’m going to make this part very short. If you get lost, check out the docs and more basic tutorials...
language for transforming data/results, including a number of stages, expressions and accumulators.
why do we call it a “pipe” or pipeline? as in we let you pipe your data through some kind of “analysis”
instead of *nix commands, it's stages and what's going through them are documents.
what does pipeline start with? documents.
Where do they come from? Coll, view, special.
Each stage has documents enter and documents exit from it.
stages themselves are specified as documents.
documents flow
how many enter, do they get changed/transformed
22 stages
way to think of them is in terms of how they act on documents coming into them
to turn them into documents coming out of them.
group: decrease
input: system info
transform (a little or a lot)
decrease in absolute numbers or based on conditional
increase (usually)
follow the first document through the pipeline
BLOCKING STAGES
blocking stages.
remember is "you send stages to the system to tell it what you want to accomplish,
the stages it runs may be different because it's allowed to shuffle things around in order to optimize the performance."
five stages in, four "things" – last one:
$sort+$limit
$cursor stage – pipeline starts at unwind, $cursor just tells you what the *source* of aggregation is. It's a collection, rather a cursor that you get when you do a *find* on a collection, and the find has whatever query we can push down to it... agg starts at blue arrow!!! what if there is no $match?
if no $match
empty.
what if we remove $match
NEXT: $PROJECT, word about PROJECT
when the aggregation asks the query subsystem for the documents, not only does it try to push down the query (and the sort)
when the aggregation asks the query subsystem for the documents, not only does it try to push down the query (and the sort)
figures out which fields are necessary to accomplish the entire pipeline and only asks for those fields.
So usually you do NOT need to add a $project just to exclude some fields.
so that was stages, what about what you can do inside each stage? I'll go over some of the more powerful expressions (schema, arrays)
all examples are in github.
This one is from SO question
3.6 will have $mergeObjects but that's okay, we can simulate that with object to array and array manipulations
for clarity, don't be afraid to split a stage into multiple $addFields stages for clarity, readability and correctness!
you can always merge them later if you don't want anyone else to be able to understand what you are doing
now here in the middle we need to merge together something to get the result array, so ...
we'll be using "$concatArrays" for this purpose.
you can always merge them later if you don't want anyone else to be able to understand what you are doing
what does this look like as one stage?
turns out it's quite readable!
the reason is I use "$let" to define a variable "elem"
use its components – but this is the same thing I did as five stages before.
so we already saw that I had to do some array manipulations for schema transformations – arrays are big strength of MongoDB so we want to know how to handle them in aggregations in ways that allow you to avoid unnecessary "$unwind"ing and re-grouping of large arrays.
of over 100 expressions, over a dozen for dealing with arrays.
we used these, but the most important and powerful ones are
just like stages output a certain number of documents relative to how many they get as input,
so do these array expressions.
$map:{input:"$array", output array of same size. gives you each element and it outputs a single "thing" for each element (I.e. Array in, array out) $filter outputs subset of exact same array it was passed in – for each element it outputs it if condition true.
$reduce input array, output single value - which can be an array, a document (array of documents) or a scalar value.
no matter what I do I'll get back out four elements, but I can do it a number of different ways
$filter – condition. could get back [] or entire input array.
reduce allows you to specify what your result looks like at the beginning, before you iterated over any elements of "input" array
that $reduce expression is what {$sum:[ ]} does.
Notice that the second reduce expression is equivalent to $reverseArray. what that means is that you can write any array processing expression yourself using these expressions. The additional ones we give you are just syntactic sugar.
obviously this is just an example as there exists $reverseArray expression, but if it didn't then you could express it with $reduce
not just array expressions, but many others – calculations, etc.
obviously this is just an example as there exists $reverseArray expression, but if it didn't then you could express it with $reduce
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
Have all login records with time, and IP address they logged in from.
Would like to check for a particular time period, did any user log in from more than one IP within some interval
In terms of performance, you send stages to the system to tell it what you want to accomplish, the stages it runs may be different because it's allowed to shuffle things around in order to optimize the performance.