Speaker: Ruben Terceño, Senior Solutions Architect, MongoDB
Level: 200 (Intermediate)
Track: Jumpstart
The MongoDB aggregation framework allows you to perform real-time analytics on your live operational data set. It's an important tool to understand when considering analytics options for your application. In this session we will give you an overview of basic aggregation functionality. You should walk away with an understanding of when to use the aggregation framework for your needs and how to leverage different functions for different purposes.
This is a Jumpstart session, held before the keynotes, designed to give you an overview of MongoDB aggregation basics so you can dive into more advanced sessions later in the day.
What You Will Learn:
- Discover the Aggregation Framework
- Understand the sweet spot for MongoDB Analytics
- Have fun crushing numbers!
6. #MDBW17
THAT MANAGER
• The CRM system is the mean
source of revenue information.
• Up to date information is critical
for our business owners.
• Grouped data is much more
valuable while taking decisions.
• Graphs are a powerful mean to
present grouped information.
10. #MDBW17
STEP 1: ANALYTICS ON THE OPERATIONAL
DB
• Running Analytics on your operational database.
‒ Analytical workload affects operational users
o Lots of table scans and heavy counts and groups.
11. #MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.
13. #MDBW17
STEP 2: ETL AND OLAP
• ETLing your data into an analytical dedicated database.
‒ Longer time to react to business requests.
o Every change affects four systems.
‒ Lack of accuracy on real time reports.
o Data synchronization was happening overnight, so today’s report is on yesterday’s data.
14. #MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.
15. #MDBW17
STEP 3: DEDICATED NICHE PRODUCTS
• Real-Time data replication (CDC), embedded BI capabilities,
dedicated hardware.
‒ New skills required in them team.
o Hardware, CDC, Middleware, Java, UI.
‒ The solution reliability was low.
o Too many moving parts.
o Monitoring and debugging was complex.
‒ Cost was very high.
o More expensive than the CRM itself!
16. #MDBW17
SO… WHAT DO WE NEED?
• Analytical capabilities.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
18. #MDBW17
MONGODB AGGREGATION FRAMEWORK
• A Series of Document Transformations.
‒ Executed in stages.
o Original input is a collection.
o Output of one stage sent as input of next.
o Output as a cursor or a collection.
• Rich Library of Functions.
‒ Filter, manipulate, group, join and summarize data.
• Optimized for performance.
‒ Full index support.
‒ Operations executed in sequential order, performing stage optimization, if possible.
27. #MDBW17
PROBLEM DESCRIPTION
• Database containing the biggest ships out there and, in a different
collection, the containers (not docker, shipping containers).
• Information of the cargo is at container level, but we need it at ship
level where information like destination sits.
• We want to know the cargo of each ship to be able to find things
like all ships currently in the North Atlantic, arriving in the US with
more than 100000 TM of Iron.
28. #MDBW17
BUILDING THE AGGREGATION STEP BY
STEP
• We’ll create one variable for every step of the aggregation
framework, so we can easily build and test our pipe.
var myMatch = {some JSON};
var myGroup = {other JSON};
var mySort = {more JSON};
db.ships.aggregate([myMatch, myGroup, mySort])
29. #MDBW17
ALL SHIPS WITHIN THE NORTH ATLANTIC
• Our first stage is a match. It allow us to filter the vessels. Let’s
find all ships in the North Atlantic going to US ports.
var match = {$match :
{location: {
$geoWithin: { $geometry : atlantic}},
"route.destination.Country": "United States"}}
30. #MDBW17
FINDING THE CONTAINERS OF EACH SHIP
• The containers are in a different collection. In order to find the
containers of each ship let’s join both collection together. The
lookup operator will allow us to do this.
var lookup = {$lookup :
{from: "containers",
as: "cargo",
localField: "Name",
foreignField: "shipName"}}
31. #MDBW17
MANIPULATING THE ARRAY
• That huge array is not going to be usable, let’s transform it into
something easier to handle. The unwind function will help us.
var unwind = {$unwind: "$cargo”}
32. #MDBW17
GROUPING BY SHIP AND CARGO TYPE
• This stage will group the individual documents by ship and cargo
type, count and add up the TM for each ship and cargo type.
var group = {$group :
{_id: {ship: "$Name",
cargo : "$cargo.cargo",
route: "$route",
location: "$location"},
sum: {$sum: "$cargo.Tons"},
count : {$sum: 1}}}
33. #MDBW17
MANIPULATING THE FIELD NAMES
• It’s possible to change the shape of our documents at any moment
thanks to project stage. Let’s put the cargo info in a sub document.
var project = {$project: {
_id : {ship: "$_id.ship", route: "$_id.route",
location: "$_id.location"},
cargo : { type : "$_id.cargo",
tons: "$sum",
count: "$count"}}}
34. #MDBW17
GROUPING BY SHIP
• And now let’s group again only by ship. The different cargos of each
ship will be pushed into a newly created array of documents.
var group2 = {$group : {
_id: "$_id",
cargo: {$push: "$cargo"}}}
35. #MDBW17
FINAL POLISHING
• Finally, let’s reorder our fields again with another project stage
var project2 = {$project: {_id: 0,
ship: "$_id.ship”,
route: "$_id.route",
location: "$_id.location”,
cargo: 1}}
36. #MDBW17
SAVING THE RESULTS
• We can store the results to a new collection using the out stage.
var out = {$out: "result"}
37. #MDBW17
SHOW ME THE VOLUME!!
• Will it perform with a much larger volume? Let’s try with 5000 ships
and 21 million containers.
• Thanks to our step by step approach, we only need to build a new
lookup step.
var lookup2 = {"$lookup" : {
"from" : "containers2",
"as" : "cargo",
"localField" : "Name",
"foreignField" : "shipName”}}
38. #MDBW17
COMMON PIPELINE OPERATORS
• $match
‒ Filter documents
• $project
‒ Reshape documents
• $group
‒ Summarize documents
• $lookup
‒ Join two collections together
• $unwind
‒ Expand an array
• $out
‒ Create new collections
• $sort
‒ Order documents
• $limit/$skip
‒ Paginate documents
• $facet
‒ Executes multiple expressions
• $sample
‒ samples random data
• $bucket
‒ Creates groups by range
• $redact
‒ Restrict documents
39. #MDBW17
SO… WHAT DO WE NEED?
• Analytical capabilities.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
40. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
41. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture. No extra products, no data transfer.
• Workload isolation.
• Real time data.
• High Availability.
• Cost aligned with provided value.
42. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture. No extra products, no data transfer.
• Workload isolation. Secondary reads.
• Real time data.
• High Availability.
• Cost aligned with provided value.
43. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture. No extra products, no data transfer.
• Workload isolation. Secondary reads.
• Real time data. Replication lag typically under 1 sec.
• High Availability.
• Cost aligned with provided value.
44. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture. No extra products, no data transfer.
• Workload isolation. Secondary reads.
• Real time data. Replication lag typically under 1 sec.
• High Availability. Native MongoDB replication and failover.
• Cost aligned with provided value.
45. #MDBW17
SO… WHAT DO WE HAVE?
• Analytical capabilities. Native, rich and performing.
• Simple Architecture. No extra products, no data transfer.
• Workload isolation. Secondary reads.
• Real time data. Replication lag typically under 1 sec.
• High Availability. Native MongoDB replication and failover.
• Cost aligned with provided value. No extra servers or licenses.