Joins and Other MongoDB 3.2 Aggregation Enhancements

MongoDB 3.2 – $lookup
and OtherAggregation
Enhancements
AndrewMorgan
@clusterdb
clusterdb.com
andrew.morgan@mongodb.com
17rd November2015

DISCLAIMER: MongoDB's product
plans are for informational purposes
only. MongoDB's plans may change
and you should not rely on them for
delivery of a specific feature at a
specific time.

Agenda
Document vs. Relational Model
Analytics on MongoDB data
60,000 feet – what is the aggregation pipeline
Aggregation pipeline operators
$lookup (Left Outer Equi Joins) in MongoDB
3.2
Other aggregation enhancements
Worked examples

Document vs. Relational Model
RDBMS MongoDB
{
_id: ObjectId("4c4ba5e5e8aabf3"),
employee_name: {First: "Billy",
Last: "Fish"},
department: "Engineering",
title: "Aquarium design",
pay_band: "C",
benefits: [
{ type: "Health",
plan: "PPO Plus" },
{ type: "Dental",
plan: "Standard" }
]
}

Existing Alternatives to Joins
{ "_id": 10000,
"items": [
{
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23
},
{
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
}
],
…
}
• Option 1: Include all data for an order in
the same document
– Fast reads
• One find delivers all the required data
– Captures full description at the time of the
event
– Consumes extra space
• Details of each product stored in many order
documents
– Complex to maintain
• A change to any product attribute must be
propagated to all affected orders
orders

Existing Alternatives to Joins
{
"_id": 10000,
"items": [
12345,
54321
],
...
}
• Option 2: Order document
references product documents
– Slower reads
• Multiple trips to the database
– Space efficient
• Product details stored once
– Lose point-in-time snapshot of full
record
– Extra application logic
• Must iterate over product IDs in
the order document and find the
product documents
• RDBMS would automate through
a JOIN
orders
{
"_id": 12345,
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
}
{
"_id": 54321,
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
}
products

The Winner?
• In general, Option 1 wins
– Performance and containment of everything in same place beats space
efficiency of normalization
– There are exceptions
• e.g. Comments in a blog post -> unbounded size
• However, analytics benefit from combining data from multiple collections
– Keep listening...

Aggregation Pipeline
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}

$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}

$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}

$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{=d+s}

$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{=d+s}

$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}

$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{★[]}
{★[]}
{★}

$match $project $lookup $group
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{
Σ λ σ}
{
Σ λ σ}
{
Σ λ σ}
{★[]}
{★[]}
{★}

Aggregation Pipeline Stages
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer equi joins
• $unwind
Expand documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset
of documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection

$lookup
• Left-outer join
– Includes all documents from the
left collection
– For each document in the left
collection, find the matching
documents from the right
collection and embed them
Left Collection Right Collection

$lookup
db.leftCollection.aggregate(
[{
$lookup:
{
from: “rightCollection”,
localField: “leftVal”,
foreignField: “rightVal”,
as: “embeddedData”
}
}])
leftCollection rightCollection

New Aggregation Operators
• Array operations
– $slice, $arrayElemAt,
$concatArrays, $isArray,
$filter, $min, $max, $avg
and $sum
• Standard Deviations
– $stdDevSamp (sample) and
$stdDevPop (complete)
• Square Root
– $sqrt
• Absolute (make +ve) value
– $abs
• Rounding numbers
– $trunc, $ceil, $floor
• Logarithms
– $log, $log10, $ln
• Raise to power
– $pow
• Natural Exponent
– $exp

Worked Example – Data Set
db.postcodes.findOne()
{
"_id": ObjectId("5600521e50fa77da54dfc0d2"),
"postcode": "SL6 0AA",
"location": {
"type": "Point",
"coordinates": [
51.525605,
-0.700974
]
}
}
db.homeSales.findOne()
{
"_id": ObjectId("56005dd980c3678b19792b7f"),
"amount": 9000,
"date": ISODate("1996-09-19T00:00:00Z"),
"address": {
"nameOrNumber": 25,
"street": "NORFOLK PARK COTTAGES",
"town": "MAIDENHEAD",
"county": "WINDSOR AND MAIDENHEAD",
"postcode": "SL6 7DR"
}
}

Reduce Data Set First
db.homeSales.aggregate([
{$match: {
amount: {$gte:3000000}}
}
])
…
{
"_id": ObjectId("56005dda80c3678b19799e52"),
"amount": 3000000,
"date": ISODate("2012-04-19T00:00:00Z"),
"address": {
"nameOrNumber": "TEMPLE FERRY PLACE",
"street": "MILL LANE",
"postcode": "SL6 5ND"
}
},…

Join (left-outer-equi) Results With Second
Collection
{$match: {
amount: {$gte:3000000}}
},
{$lookup: {
from: "postcodes",
localField:
"address.postcode",
foreignField: "postcode",
as: "postcode_docs"}
}
])
...
},
"postcode_docs": [
{
"_id": ObjectId("560053e280c3678b1978b293"),
"postcode": "SL6 5ND",
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]
}}]}, ...

Refactor Each Resulting Document
...},
{$project: {
_id: 0,
saleDate: ”$date",
price: "$amount",
address: 1,
location:
{$arrayElemAt:
["$postcode_docs.location",
0]}}
])
{ "address": {
"nameOrNumber": "TEMPLE FERRY PLACE",
"street": "MILL LANE",
},
"saleDate": ISODate("2012-04-19T00:00:00Z"),
"price": 3000000,
"location": {
"type": "Point",
"coordinates": [
51.549516,
-0.80702
]}},...

Sort on Sale Price & Write to Collection
...},
{$sort:
{price: -1}},
{$out: "hotSpots"}
])
…{"address": {
"nameOrNumber": "2 - 3",
"street": "THE SWITCHBACK",
"postcode": "SL6 7RJ"
},
"saleDate": ISODate("1999-03-15T00:00:00Z"),
"price": 5425000,
"location": {
"type": "Point",
"coordinates": [
51.536848,
-0.735835
]}},...

Aggregated Statistics
{$group:
{ _id:
{$year: "$date"},
higestPrice:
{$max: "$amount"},
lowestPrice:
{$min: "$amount"},
averagePrice:
{$avg: "$amount"},
amountStdDev:
{$stdDevPop: "$amount"}
}}
])
...
{
"_id": 1995,
"higestPrice": 1000000,
"lowestPrice": 12000,
"averagePrice": 114059.35206869633,
"amountStdDev": 81540.50490801703
},
{
"_id": 1996,
"averagePrice": 118862,
"amountStdDev": 79871.07569783277
}, ...

Clean Up Output
...,
{$project:
{
_id: 0,
year: "$_id",
higestPrice: 1,
lowestPrice: 1,
averagePrice:
{$trunc: "$averagePrice"},
priceStdDev:
{$trunc: "$amountStdDev"}
}
}
])
...
{
"year": 1995,
"priceStdDev": 81540
},
{
"year": 2004,
"priceStdDev": 199643
},...

Postal Code & Location for Each Year’s
Highest Priced Sale
{$sort: {amount: -1}},
{$group: {
_id: {$year: "$date"},
priciestPostCode:
{$first:
"$address.postcode"}
}
},
{$lookup: {
from: "postcodes",
localField:
"priciestPostCode",
foreignField: "postcode",
as: "locationData"
}
},
{$sort: {_id: -1}},

Postal Code & Location for Each Year’s
Highest Priced Sale
{$project: {
_id: 0,
Year: "$_id",
PostCode:
"$priciestPostCode",
Location:{$arrayElemAt: [
"$locationData.location”,
0]}
}
}
])
...
{
"Year": 2014,
"PostCode": "SL6 1UP",
"Location”: {
"type": "Point",
"coordinates": [
51.51407,
-0.704414
]
}
},
...

Aggregation Options
db.cData.aggregate([
<pipeline stages>
],
{
'allowDiskUse': true,
'cursor’:
{
'batchSize': 5
}
}
)
• explain
– Information on execution plan
• allowDiskUse
– Enable use of disk to store
intermediate results
• cursor.batchsize
– Specify the size of the initial
result set

Aggregation With a Sharded Database
• Workload split between shards
– Client works through mongos as with
any query
– Shards execute pipeline up to a point
– A single shard merges cursors and
continues processing
– Use explain to analyze pipeline split
– Early $match on shard key may
exclude shards
– Potential CPU and memory
implications for primary shard host
– $lookup & $out performed within
Primary shard for the database
?

Tableau + MongoDB Connector for BI

Restrictions
• $lookup only support equality for the match
• $lookup can only be used in the aggregation pipeline (e.g. not for find)
• The pipeline is linear; no forks. Can remove data at each stage and can only add new
raw data through $lookup
• Right collection for $lookup cannot be sharded
• Indexes are only used at the beginning of the pipeline (and right tables in subsequent
$lookups), before any data transformations
• $out can only be used in the final stage of the pipeline
• $geoNear can only be the first stage in the pipeline
• The BI Connector for MongoDB is part of MongoDB Enterprise Advanced
– Not in community

Next Steps
• Documentation
– https://docs.mongodb.org/manual/release-notes/3.2/#aggregation-framework-enhancements
• Not yet ready for production but download and try!
– https://www.mongodb.org/downloads#development
• Detailed blog
– https://www.mongodb.com/blog/post/joins-and-other-aggregation-enhancements-coming-in-mongodb-3-2-
part-1-of-3-introduction
• Webinars
– Tomorrow: What's New in MongoDB 3.2 https://www.mongodb.com/webinar/whats-new-in-mongodb-3-2
– Replay: 3.2 $lookup & aggregation https://www.mongodb.com/presentations/webinar-joins-and-other-
aggregation-enhancements-coming-in-mongodb-3-2
• Feedback
– MongoDB 3.2 Bug Hunt
• https://www.mongodb.com/blog/post/announcing-the-mongodb-3-2-bug-hunt
– https://jira.mongodb.org/
DISCLAIMER: MongoDB's product plans are for informational purposes only. MongoDB's plans may change and you
should not rely on them for delivery of a specific feature at a specific time.

MongoDB Days 2015
October 6, 2015
October 20, 2015
November 5, 2015
December 2, 2015
France
Germany
UK
Silicon Valley

Joins and Other MongoDB 3.2 Aggregation Enhancements

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Joins and Other MongoDB 3.2 Aggregation Enhancements

Similar to Joins and Other MongoDB 3.2 Aggregation Enhancements (20)

More from Andrew Morgan

More from Andrew Morgan (14)

Recently uploaded

Recently uploaded (20)

Joins and Other MongoDB 3.2 Aggregation Enhancements