MongoDB Aggregation Performance

MongoDB Aggregation
Performance.
John Page – Principal Consulting engineer

The Aggregation Framework
• What is it?
• When should I use it?
• What can it do and not do?
• When should I use it instead of an RDBMS

What is the Aggregation Framework
• It’s a data transformation pipeline.
• It ultimately a Turing complete functional language.
• It’s SELECT AS JOIN GROUP BY HAVING.
• It’s a fun challenge to use.

What can it do? and not do?
• It can read and examine documents and apply logic to them and
create new ones.
• Technically – it can do almost anything.
• Mine Bitcoins.
• Learn (in the AI sense).
• Emulate / Transpile SQL statements.
• Generate graphics.
• Run simulations.
• It can’t currently edit existing data in place.

When should I definitely use it.
• When the data’s in MongoDB and you don’t want to copy it.
• When you want to report on live data.
• When your application operations require more than find()

When should I use it versus my RDBMS?
• That’s a very good question.

Received Wisdom
• Conventional wisdom says RDBMS is just better
• Optimized for reporting.

Let us take a scenario
• You have a set of data
• You want to Report on it and Analyze it
• This data isn’t live – so we don’t need to worry about that.
• There may be a lot of it.

Our Test Data Set
• Large and Meaningful

Data Details
• Every medical practice in England
• 10+ years available month by month
• Quantity and cost of each item prescribed and number of scripts.
• >100 million+ rows a year

The Hardware
• Centos 7 – on Amazon EC2
• 32GB RAM
• 4 CPU Cores
• Databases on 2000 IOPS 400GB Disk
• Temp files on 1200 IOPS 400GB Disk

In the Blue corner
• MySQL Version 8.0
• Out of the box defaults
• Cache (innodb_buffer_pool) set to 80% of RAM
• 3 Tables (13GB)
• Indexing as required

In the Green Corner
• MongoDB 4.0.3
• Cache set to default (50% - 1GB)
• 1 Collection
• 15 GB of BSON
• 5GB on disk due to Snappy.

How much did the UK Spend in 2017?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
"$prescriptions.cost”
}
}
}
}

The Result
£ 8,309,203,021.46

The other Result
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
37 Seconds 54 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0 MB/s

Row Format
RDBMS
• Fixed length values
• Known column offsets
• Fast to find COLUMN.
• Fast to next ROW.
• Expensive to change .
MongoDB
• Dynamic documents
• Traverse from start
• Known sizes
• Depth Matters
• More flexibility
1 Bob 3.5 18-7-1972 NULL
2 Sally 8.9 15-3-1984 “Magic”

Row Format
RDBMS
MongoDB
• Known sizes
_id:int 1 name: str(3) “bob” size:double 3.5 when: date 18-7-1972

Row Format
RDBMS
MongoDB
• Known sizes
• Hierarchy Matters
• Much more flexibility
_id:int 1 name: str(3) bob sizes: array(256) [
double 3.5.
double 10.
double 1.2,
double 99]
when: date 18-7-1972

What about an Index?
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
21 Seconds (vs. 37) 54 Seconds
Apply Covering
Index Here

But can’t MongoDB index cover too?
• Yes, but not when it’s a Multikey (array) index
• We only index unique values index only once
• So the index cannot recreate the array

Can we fix that?
• What if we flatten the data?
• Lots of redundancy
• Collection is now 200% larger
• Normalisation?
db.prescriptions.aggregate([
{$unwind:”$prescriptions”},
{$project:{_id:0}},
{$out: “tabular”}
])
db.tabular.createIndex({‘prescriptions.cost’:1})

Flat, wide data.
• 110 M Rows
• 51 GB as BSON
• 15 GB Compressed
• Not really tabular.
• 860 MB Index
• Prefix Compression
• Super space efficient

Query Performance when flattened.
MongoDB No covering Index MongoDB With covering Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s

Query Performance when flattened.
• That doesn’t look right.
MongoDB No Index Index
509 Seconds (vs 54) 509 Seconds
6% CPU 1700IOPS 30MB/s 6% CPU 1700IOPS 30MB/s
"queryPlanner" : {
"winningPlan" : {
"stage" : "COLLSCAN”,
"direction" : "forward”
}}

Flat, wide data.
• Need to persuade aggregation to use the index
• Add a query ( cost > 0) or sort by cost at the start
• Still slower than document model ?
• Document model is efficient.
• This data is actually MOST of the database 110M Entries
• Imagine if our index was a small percentage of the data.
• Index compression has a cost when reading.
No Index Index
509 Seconds (vs 54) 177 Seconds (vs 54)
6% CPU 1700IOPS 30MB/s 25% CPU 0 IOPS 0MB/s

Table Layout
RDBMS
• Lots of fixed size rows in a file
• Nice predictable layout
MongoDB
• Variable Length rows in a file
• Less predictable layout

Table Layout – The Truth
• RDBMS and MongoDB both store records in Trees
• Records are in some ways, just like indexes.

RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Index on identity
• Can only walk the tree
• Slower to collection scan
• Less lock contention

RDBMS
• Rows held in Balanced Tree
• This IS the Primary Key
• Linked leaves
MongoDB
• Docs in Balanced Tree
• Organised by Identity (int64)
• No links between leaves
• Slower to scan everything
• Much less lock contention

In-Document rollup.
• We have multiple data items in each document.
• Add summaries of cost in each document?
• No cost when updating anyway $max,$min,$sum,$count.
• RDBMS equivalent has big cost.
• You need to know in advance, or add as needed
• Like an RDBMS index
• What if we index the in-document rollup?

MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 18 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s

MongoDB with in document roll-up.
No Index on IDI Index on IDI
18 Seconds (Versus 54, or 21 in RDBMS) 0.01 Seconds
25% CPU 0 IOPS 0 MB/s 25% CPU 0 IOPS 0MB/s

So far…
• When Data fits in RAM
• RDBMS Table scan faster than Mongo Collection scan
• RDBMS Index scan faster than RDMBS Table scan
• Large MongoDB Index scan isn’t solution
• In document rollups beat RDBMS Index scan
• Index scan of in-document rollups is really quick.

What if it wasn’t all about the CPU?
• Data Lakes and “Big Data”
• Limited by reading data from disk
• Limited by Parallelism
• New Experiment Time.
• Reduce RAM to much less than Data Size*
• Work with disk bound data.
• Still one CPU.

Table/Collection scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
3.5% CPU 1253 IOPS 103 MB/s 25% CPU 650 IOPS 103 MB/s

Why is MongoDB faster
MySQL
• Data Size = 15 GB
MongoDB
• Data Size = 5GB*
• Minimal decompression overhead
*Not ‘Big’ Data

Index scan from Disk
select sum(cost)
From
prescriptions;
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
Add an
Index

More Complex Queries
From RAM
• May still use Disk for temp tables,
Storage etc.
• All Tables and Indexes fit in RAM
From DISK
Data does NOT fit in RAM
Some indexes MAY be in RAM
No indexes used for MongoDB

With Group BY (RAM)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
30 Seconds (Index)
25% CPU 0 IOPS 0 MB/s

With Group BY (Disk)
select sum(cost)
from prescriptions
group by period;
{ $group : {
_id: "$period",
t : {$sum :
{ $sum:
"$prescriptions.cost"}
}}}
39 Seconds (index)
18% CPU 1010 IOPS 103 MB/s

Top 10 practices by spend (RAM)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
51 Seconds (index)
18% CPU 1010 IOPS 103 MB/s

Top 10 practices by spend (Disk)
SELECT
practice, SUM(cost) AS totalspend
FROM
prescriptions
GROUP BY practice
ORDER BY totalspend DESC
LIMIT 10;
[
{ $group: { _id: "$practice”,
spend: { $sum:
{ $sum:
"$prescriptions.cost"}}}},
{ $sort: { $spend: -1}},
{ $limit: 10}
]
55 Seconds (indexed)
21% CPU 724 IOPS 77 MB/s

£ per patient – JOIN and Group (RAM)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}

£ per patient – JOIN and Group (Disk)
SELECT
practice,
SUM(cost / numpatients) AS
totalspend, AVG(numpatients)
FROM
nhs.prescriptions pr,
nhs.patientcounts pc
WHERE
pr.practice = pc.code
GROUP BY practice
ORDER BY totalspend DESC LIMIT 10;
{ "$group" : { "_id”:"$practice",
"perpatient": {"$sum":
{"$divide":
[{"$sum”:"$prescriptions.cost"},
"$numpatients"]
}
}}},
{ "$sort": {"perpatient":-1},
{ "$limit": 10}

£/patient/county – nested SELECT (RAM)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])
24% 0IOPS 0IOPS 24% CPU 0 IOPS 0 MBs

Result
Spend (£M) Patients Per Patient(£)
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162

Result
LINCOLNSHIRE 122 699309 175
WIRRAL 25 149554 172
CO DURHAM 102 596638 171
CLEVELAND 75 462593 163
ISLE OF WIGHT 25 144555 162
BERKSHIRE 102 944538 108
MIDDLESEX 150 1469189 102
BRISTOL 14 145660 97
LONDON 522 5672564 92
LEEDS 9 122785 74

£/patient/county – nested SELECT (Disk)
select county,sum(totalcost) as spend,sum(patients) as
patients,sum(totalcost)/sum(patients) as costperpatient
from
(select county,sum(cost) as totalcost, avg(numpatients)
as patients
from prescriptions pr,patientcounts pc,practices pa
where pr.practice=pc.code
and pr.practice=pa.code and pa.period=pr.period
group by county,practice) as byprac
group by county
having patients > 100000
order by costperpatient desc limit 20;
{"$group" : {"_id" : { "county": "$address.county”,
"practice": "$practice"},"spend" : { "$sum" : {"$sum" :
"$prescriptions.cost"}}, "numpatients" : { "$avg" :
"$numpatients"}}},
{ "$group": { "_id" : "$_id.county", "spend" :{ "$sum" :
"$spend" }, "patients" : {"$sum": "$patients"}}},
{"$addFields" : { "costperpatient" : { "$divide”
:["$spend","$patients"] }}},
{"$match" : { "numpatients" : { "$gt" : 100000}}},
{"$sort" : { "costperpatient" : -1}}
,{$limit:20} ])

Most common drugs – REGROUP (RAM)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
23% CPU 0 IOPS 0MB/s 25% CPU 0 IOPS 0MB/s
126 Seconds (Indexed)
25% CPU 0 IOPS 0MB/S

Grouping Techniques
SQL
Can take advantage of index ordering
by group key, all items with same key
come together so can process one at a
time.
1,1,1,1,1,2,2,2,2,3,3
Uses a temp table and sort when it
can’t.
MongoDB
Does not take advantage of ordering
(yet) – maintains a data structure
with all values.
Assumed you will want to group further
down the pipeline so optimised for
that.
Builds a tree (using disk) for the
values.

Result
Omeprazole_Cap E/C 20mg Acid Reflux 23,292,184
Aspirin Disper_Tab 75mg 15,361,735
Paracet_Tab 500mg 14,562,514
Amlodipine_Tab 5mg Blood Pressure 14,416,079
Atorvastatin_Tab 20mg Cholesterol 13,152,079
Lansoprazole_Cap 30mg (E/C Gran) Acid Reflux 12,906,650
Simvastatin_Tab 40mg Cholesterol 12,760,343
Metformin HCl_Tab 500mg Diabetes 11,404,331
Salbutamol_Inha 100mcg (200 D) C Asthma 10,595,100
Levothyrox Sod_Tab 100mcg Thyroid 9,312,464

Most common drugs – REGROUP (Disk)
select bnfcode,max(name), sum(nitems) as items
from nhs.prescriptions
group
by bnfcode
order by items desc
limit 10;
{ "$unwind":"$prescriptions"},
{"$group" :
_id:‘$prescriptions.bnfcode’,
name:{$max:’$prescriptions.name’},
items:{$sum:’$prescriptions.nitems’}}},
{"$sort" : { "items" : -1}},
{"$limit":10}]
4% CPU 1800 IOPS 100 MB/S 24% CPU 180 IOPS 23 MB/s
192 Seconds (Index)
13% CPU 520 IOPS 56MB/s

Most Expensive Drugs - Result
Rivaroxaban_Tab 20mg Anti Coagulent £100,007,025
Apixaban_Tab 5mg Anti Coagulent £79,302,385
Fostair_Inh 100mcg/6mcg (120D) C Asthma £75,541,726
Tiotropium_Pdr For Inh Cap 18mcg COPD £66,348,167
Sitagliptin_Tab 100mg Diabetes £60,919,725
Symbicort_Turbohaler 200mcg/6mcg Asthma £44,314,887
Apixaban_Tab 2.5mg Anti Coagulent £41,290,937
Ins Lantus SoloStar_100u/ml 3ml Diabetes £41,182,602
Ezetimibe_Tab 10mg Cholesterol £40,756,234
Linagliptin_Tab 5mg Diabetes £38,503,893

Anomaly Detection – JOIN Derived (RAM)
SELECT
prescriptions.bnfcode,MAX(prescriptions.name),
prescriptions.practice,MAX(practices.name),
AVG(nitems),AVG(patientcounts.numpatients),
AVG(aveperperson),AVG((nitems / patientcounts.numpatients) /
aveperperson) AS ratio
FROM
prescriptions
LEFT JOIN
(SELECT
bnfcode, AVG(nitems / numpatients) AS aveperperson
FROM
prescriptions, patientcounts
WHERE
prescriptions.practice = patientcounts.code
GROUP BY bnfcode) AS avgs ON avgs.bnfcode = prescriptions.bnfcode
LEFT JOIN
patientcounts ON prescriptions.practice = patientcounts.code
LEFT JOIN
practices ON practices.code = prescriptions.practice
WHERE
patientcounts.numpatients > 500
AND aveperperson > 0
AND prescriptions.practice NOT IN ('Y01924')
GROUP BY prescriptions.practice,prescriptions.bnfcode
ORDER BY ratio DESC
LIMIT 10;
db.prescriptions.aggregate([{"$unwind":"$prescriptions"},{"$group":{"_id
":"$prescriptions.bnfcode","aveperperson":{"$avg":{"$divide":["$prescrip
tions.nitems","$numpatients"]}}}},{"$match":{"aveperperson":{"$ne":null}
}},{"$out":"typical"}])
db.prescriptions.aggregate([{"$match":{"numpatients":{"$gt":500},"actice
":{"$nin":["Y01924"]}}},{"$unwind":"$prescriptions"},{"$group":{"_id":{"
bnfcode":"$prescriptions.bnfcode","practice":"$practice"},"name":{"$max"
:"$prescriptions.name"},"pracicename":{"$max":"$address.name"},"nitems":
{"$sum":"$prescriptions.nitems"},"numpatients":{"$max":"$numpatients"},"
nmonths":{"$sum":1}}},{"$addFields":{"perpatient":{"$divide":[{"$divide"
:["$nitems","$nmonths"]},"$numpatients"]}}},{"$lookup":{"from":"typical"
,"localField":"_id.bnfcode","foreignField":"_id","as":"typical"}},{"$unw
ind":"$typical"},{"$addFields":{"ratio":{"$divide":["$perpatient","$typi
cal.aveperperson"]}}},{"$sort":{"ratio":-1}},{"$limit":10}])

Anomaly Detection – JOIN Derived (RAM)
SELECT
FROM
prescriptions
LEFT JOIN
(SELECT
FROM
WHERE
LEFT JOIN
LEFT JOIN
WHERE
ORDER BY ratio DESC
LIMIT 10;

Results
DRUG PRACTICE RATIO
Methadone FULCRUM 297 Rehab
Trazodone CARE HOMES
MEDICAL
242 Elderly Care
Buprenorphine FULCRUM 233
Thickenup PDR CARE HOMES
MEDICAL
174
Vitrex Nitrile Gloves REETH MEDICAL 173 Preference?
Ema Film Gloves NEW SPrintwells1 168
Pro D3 Cap PALFREY HEALTH
CENTRE
123 Vitamin D?
Fultium D3 Cap MOHANTY 123 Vitamin D

Results
DRUG PRACTICE RATIO
Oxycodone Merseyside GP 108
Tamiflu Lancashire GP 56
Nicotine_Transdermal Dorset GP 33
Loperamide (Immodium) Kent GP 16

Anomaly Detection – JOIN Derived (Disk)
SELECT
FROM
prescriptions
LEFT JOIN
(SELECT
FROM
WHERE
LEFT JOIN
LEFT JOIN
WHERE
ORDER BY ratio DESC
LIMIT 10;

Conclusions
• MongoDB is faster from disk when there are no indexes
• MongoDB is generally faster for more complex queries
• MongoDB fits the data-lake model nicely.

BI Connector
• Translating Proxy Server
• SQL -> MySQL Wire Protocol -> Bi Connector -> MongoDB
Aggregation

BI Connector
• Total Spend.
• Sum one column
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 157 41 61
RAM 37 21 54 150

Why slower ?
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
{"$unwind":{"includeArrayIndex":"prescr
iptions_idx","path":"$prescriptions"}},
{"$group":{"_id":{},"sum(nhs_DOT_presc
riptions_DOT_cost)":{"$sum":"$cost"},"s
um(nhs_DOT_prescriptions_DOT_cost)_coun
t":{"$sum":{"$cond":[{"$eq":[{"$ifNull"
:["$cost",null]},null]},0,1]}}}},
{"$project":{"_id":0,"sum(nhs_DOT_pres
criptions_DOT_cost)":{"$let":{"in":{"$c
ond":[{"$or":[{"$eq":[{"$ifNull":["$$ex
pr",null]},null]},{"$eq":["$$expr",0]},
{"$eq":["$$expr",false]}]},{"$literal":
null},"$sum(nhs_DOT_prescriptions_DOT_c
ost)"]},"vars":{"expr":"$sum(nhs_DOT_pr
escriptions_DOT_cost)_count"}}}}},
{"$project":{"nhs_DOT_sum(nhs_DOT_pres
criptions_DOT_cost)":"$sum(nhs_DOT_pres
criptions_DOT_cost)","_id":0}}

Why so much slower - Unwind
{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
:["$cost",null]},null]},0,1]}}}},

Why so much slower – SQL Standard
semantics{ $group :
{ _id: “all” ,
t : { $sum :
{ $sum:
}
}
}
}
$sum : [ NULL, NULL, NULL ] =
0
SUM [NULL,NULL,NULL] =
NULL
:["$cost",null]},null]},0,1]}}}},

BI Connector
• Total Spend By Period.
• Sum one column Group by Primary Key
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 39 63
RAM 63 30 60 187

BI Connector
• Total Spend By PRACTICE.
• Sum one column Group by PART OF KEY
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 55 62
RAM 75 51 62 230

BI Connector
• Total Spend By PATIENT.
• Sum one column Group BY JOINED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 160 62
RAM 105 62 230

BI Connector
• AVG Spend By SPEND PER PATIENT PER COUNTY .
• Sum one column Group BY COMPUTED FIELD
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 220 67
RAM 160 66 220

BI Connector
• MOST prescribed drugs.
• Sum one column Group BY Not PK
SQL
Unindexed
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 1427 192 262
RAM 300 126 262 280

BI Connector
• Anamoly Detection.
• Sum one column Group from subquery Joined table
SQL
Indexed
MongoDB
Aggregatio
on
BI
Connector
Disk 7848 1655
RAM 2250 1489 D.N.F !

BI Connector – Did Not Finish – why?
• Anomaly Detection.
• Did Run – but was taking a long time
• Using expressive $lookup for the join
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}

BI Connector – why Did Not Finish
• Anomaly Detection.
• Did run – but was taking a long time
• Using expressive $lookup for the join
• No Index on in-memory table
• Hand written version
• used temp collection
• Made use _id was lookup field
$lookup: {
from: ”prescriptions”,
let: { drug : “aspirin” },
pipeline: [
group total by drugname,practice,
divide by practise size,
group by drugname,
match $$drug
],
as: <output array field>
}

Conclusions
• BI Connector a little slower than RDBMS for simple queries
• The more complicated the query, the faster it is relatively.
• It’s not as quick as hand crafted Aggregation
• But you can put that in views.
• But it’s very convenient
• You can use your existing BI Tooling
• And you could always use Charts instead.

Is there an Elephant in
the room?

External compute engines
• Spark
• Hadoop
• R
• Python
• Java / C

Pros
• Simpler to write much more complicated processing.
• Lots of libraries of pre written code
• Able to perform a lot of in-memory computation
• MongoDB can send them data very, very quickly

Cons
• Costs of transferring from or inside
clouds
• Atlas
• AWS
• Network speed limitations.
• Additional hardware.
• Security considerations.
AWS Same Region 1 cent / GB
AWS Between
Regions
9 cents / GB
AWS Out to 11 cents / GB

So do I use Spa^HR^doop! Or not?
• Yes – those tools are great for many things
• But always push computation DOWN to MongoDB if you can
• There is a balance
• Amount of effort to write as a Pipeline
• Reduced network costs in time and money

Simple Example
• Pearsons RHO
• Degree of correlation between two numeric lists
• Lets compare Lattitude (North vs South)
• And Quantity of drug per person
• Hypothesis “For some drugs, more is prescribed as you travel, up or down the UK”

Step 1 - Geocoding
• We need to augment our records with Lat/Long
• Download a handy set of postcode centroids
• mongoimport --type csv --headerline -d nhs postcodes.csv
• Use $lookup and $out to attach to each record and make new collection.

Simple Geocoding
tidypc = {$addFields: { "address.postcode" : {$rtrim:{input:"$address.postcode"}}}}
Geocode = { $lookup :{
from: "postcodes",
localField: "address.postcode",
foreignField: "Postcode",
as: "location"
}}
firstelement = { $addFields : { location : { $arrayElemAt : [ "$location" , 0 ]} }}
choosefields = { $project : { practice:1, numpatients:1, lon: "$location.Longitude",
lat: "$location.Latitude","prescriptions.name":1,"prescriptions.nitems":1}}

Step 2 – Group by drug
• For each drug compute average quantity/10,000 patients per
surgery
• Group to one record per drug with an array of objects
{
drug : “Aspirin”,
prescribed : [
{ where : [ -3.5, 55.2],
per10k : 75.4 }
…
]
}

Group by BNF code
unwind ={ $unwind:"$prescriptions"}
regroup = { $group : {
_id : "$prescriptions.name",
p : { $push : { where : ["$lon","$lat"] ,
per10k : { $multiply: [10000,
{$divide : [ "$prescriptions.nitems",
"$numpatients"]}]}}}}}

Step 1 – Pearsons Rho
• Compute RHO on Array comparing per10K to latitude.

Computing Pearsons RHO
sumcolumns = {$addFields : { s : { $reduce : { input : "$p",
initialValue: {count:0,suml:0,sumt:0,sumlsquared:0,sumtsquared:0,sumtl:0},
in : {
count : { $add : [ "$$value.count", 1]},
suml : { $add : [ "$$value.suml", "$$this.l" ]},
sumt : { $add : [ "$$value.sumt", "$$this.t" ]},
sumlsquared : { $add : [ "$$value.sumlsquared", { $multiply : ["$$this.l" ,"$$this.l"]}]},
sumtsquared : { $add : [ "$$value.sumtsquared", { $multiply : ["$$this.t" ,"$$this.t"]}]},
sumtl : { $add : [ "$$value.sumtl", { $multiply : ["$$this.t" ,"$$this.l"]}]},
}
}}}}

Computing Pearsons RHO
multiply_suml_sumt = { $multiply : [ "$s.suml","$s.sumt"] }
multiply_sumtl_count = { $multiply : ["$s.sumtl","$s.count"]}
partone = { $subtract : [ multiply_sumtl_count, multiply_suml_sumt ]}
multiply_sumlsquared_count = { $multiply : ["$s.sumlsquared","$s.count"]}
suml_squared = { $multiply : ["$s.suml","$s.suml"]}
subparttwo = { $subtract : [ multiply_sumlsquared_count,suml_squared ]}
multiply_sumtsquared_count = { $multiply : ["$s.sumtsquared","$s.count"]}
sumt_squared = { $multiply : ["$s.sumt","$s.sumt"]}
subpartthree = { $subtract : [ multiply_sumtsquared_count,sumt_squared ]}
parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}}
rho = {$addFields : { rho: {$divide : [partone,parttwo]}}}

Sort by rho and output
MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:-1}).limit(5)
{ "_id" : "Audmonal_Cap 60mg", "rho" : 0.32690588090961403 }
{ "_id" : "ExoCream 500g", "rho" : 0.32119819297625635 }
{ "_id" : "Luventa XL_Cap 24mg", "rho" : 0.2593870002284348 }
{ "_id" : "Finetest Lite (Reagent)_Strips", "rho" : 0.2518374958339396 }
{ "_id" : "Campral EC_Tab 333mg", "rho" : 0.24376784724040662 }
MongoDB Enterprise > db.result.find({},{_id:1,rho:1}).sort({rho:1}).limit(5)
{ "_id" : "Ultra Lite 10cm x 4.5m M/Layer C", "rho" : -0.258189560181513 }
{ "_id" : "Triptorelin Embon_Inj 22.5mg Vl", "rho" : -0.13752453990107172 }

Conclusions
• The Aggregation Framework is fast.
• There is no truth to “RDBMS is Just better”
• It’s a good choice for non trivial, ad-hoc queries.
• It’s a good choice for large data sets
• Consider sharding and microsharding.
• In a Cloud world – push work to the database
• Even with R/SAS/Spark! Etc.

MongoDB Aggregation Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB Aggregation Performance

Similar to MongoDB Aggregation Performance (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB Aggregation Performance

Editor's Notes