Back to Basics, webinar 5: Introduzione ad Aggregation Framework

MongoDBEurope2016
Old Billingsgate, London
15th November
mongodb.com/europe
egistratevi con il codice massimobrignoli20 per il 20% di scon

Massimo Brignoli
Principal Solution Architect, EMEA
@massimobrignoli
Back to Basics 2016 : Webinar 5
Introduzione all’Aggregation Framework

Riassunto
• Webinar 1 – Introduzione a NoSQL
– I diversi tipi di database NoSQL
– MongoDB come document database
• Webinar 2 – La nostra prima applicazione
– Creazione di database e collectione
– CRUD, indici e explain
• Webinar 3 – Schema Design
– Schema dinamico
– Approcci all’Embedding
• Webinar 4 –Indici Full-Text e geospaziali

L’Aggregation Framework
• Motore anallitico per MongoDB
• Pensate ai due tipi di database: OLTP e OLAP
• OLTP : Online Transaction Processing
– Prenotazione voli,
– Bancomat,
– Prenotazione taxi
• OLAP : Online Analytical Processing
– Quale biglietto ci fa guadagnare di più?
– Quando dobbiam ricaricare il bancomat?
– Di quanti taxi abbiamo bisogno per effettuare il servizio ad est di Milano?

OLAP - There Be (Hadoop) Dragons Here
• Query OLAP sono spesso scansioni di tabelle
• L’output delle query è spesso strutturato per analisi future e
comparazioni
• Molti clienti stanno guardando a Spark e Hadoop ma:
– La complessità è astronomica
– Focus sugli algoritmi di analisi dei dati (dovete scrivere un programma)
– Richiede una certa conoscenza di algoritmi paralleli
• L’aggregation framework è un tool molto più ”gentile”
• L’obiettivo è di fare quello che volete fare in meno temppo

Analytics on MongoDB Data
• Extract data from MongoDB and
perform complex analytics with
Hadoop
– Batch rather than real-time
– Extra nodes to manage
• Direct access to MongoDB from
SPARK
• MongoDB BI Connector
– Direct SQL Access from BI Tools
• MongoDB aggregation pipeline
– Real-time
– Live, operational data set
– Narrower feature set
Hadoop
Connector
MapReduce & HDFS
SQL
Connector

What is an Aggregation Pipeline?
• Una serie di trasformazioni di documenti
– Eseguita in stage
– L’input iniziale è una collection
– L’output può essere un cursorse o una collection
• Ricca libreria di funzioni
– Filter, compute, group e summarize data
– L’output di uno stage è l’input dello stage successivo
– Le operazioni sono eseguite in ordine sequenziale

Operatori della Pipeline
• $match
Filter documents
• $geoNear
Geospherical query
• $project
Reshape documents
• $lookup
New – Left-outer equi joins
• $unwind
Expand documents
• $group
Summarize documents
• $sample
New – Randomly selects a subset of
documents
• $sort
Order documents
• $skip
Jump over a number of documents
• $limit
Limit number of documents
• $redact
Restrict documents
• $out
Sends results to a new collection

Aggregation Pipeline
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}

$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}

$match
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}

$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{=d+s}

$match $project
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{=d+s}

$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}

$match $project $lookup
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{★[]}
{★[]}
{★}

$match $project $lookup $group
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds}
{★ds} {}
{★ds}
{★ds}
{★ds}
{★}
{★}
{★}
{★}
{★}
{★}
{★}
{=d+s}
{
Σ λ σ}
{
Σ λ σ}
{
Σ λ σ}
{★[]}
{★[]}
{★}

Esempio: US Census Data
• Census data dal 1990, 2000, 2010
• Domande:
– Quale US Division è la densità con il piu alto tasso di crescita?
– Division = a group of US States
– Population density = Area of division/# of people
– Data is provided at the state level

Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{ "totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{ "totalPop" : 37253956,
"occHouse" : 12577498,
"year" : 2010},
{ "totalPop" : 29760021,
"occHouse" : 29008161,
"year" : 1990}
],
…
}

Area US Totale
db.cData.aggregate([
{"$group" : {
"_id" : null,
"totalArea" : {$sum : "$areaM"},
”avgArea" : {$avg : "$areaM"}
}}
])

$group
• Raggruppa i documenti per valore
– Field reference, object, constant
– Calcoli altri campi di output
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processa tutti i dati in memoria di default

Area per Regione
{"$group" : {
"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}
}
}
])

Calcola l’Area media degli Stati per Regione
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”New Jersey",
areaM: 90,
region: “North East”
}
{state: “California",
area: 300,
region: “West"
}
{ $group: {
_id: "$region",
avgAreaM: {$avg: ”$areaM" }
}}
{ _id: ”North East",
avgAreaM: 154}
{_id: “West",
avgAreaM: 300}

Popolazione US Totale per Anno
{$unwind : "$data"},
{$group : {
"_id" : "$data.year",
"totalPop" : {$sum :"$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])

$unwind
• Opera su un campo array
– Crea documenti dagli elementi dell’array
• Gli Array sono sostituiti dal valore degli elementi
• Se il campo array manca → no output
• Se il campo non è un array → errore
– Pipe a $group per aggregare

$unwind
{ state: ”New York",
census: [1990, 2000, 2010]}
{ state: ”New Jersey",
census: [1990, 2000]}
{ state: “California",
census: [1980, 1990, 2000, 2010]}
{ state: ”Delaware",
census: [1990, 2000]}
{ $unwind: $census }
{ state: “New York”, census: 1990}
{ state: “New Jersey”, census: 1990}
{ state: “New Jersey”, census: 2000}

$match
• Filtra I documenti
– Usa la stessa sintassi del .find()

$match
{state: ”New York",
areaM: 218,
region: “North East"
}
{state: ”Oregon",
areaM: 245,
region: “West”
}
{state: “California",
areaM: 300,
region: “West"
}
{state: ”Oregon", areaM: 245,
region:“West”}
{state: “California", areaM: 300,
region: “West"}
{ $match:
{ “region” : “West” }
}

Delta Popolazione per Stato dal 1990 al 2010
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : { "_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" : ["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}])

$sort, $limit, $skip
• Ordina I documenti per uno o più campi
– Stessa sintassi dei cursori
– Aspetta la piepeline precedente
– In-memory se non all’inizio della pipeline e indicizzato
• Limit e skip stesso comportamento dei
cursori

Includere e Escludere Campi
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
}
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1}
}
{"pop1990" : 453588,
"pop2010" : 3725789}
{"pop1990" : 453588,
"pop2010" : 3725789}

Rinominare e Calcolare Campi
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
} {”name" : “Virginia”,
”delta" : 1813666}
{“name" : “South Dakota”,
“delta" : 118176}

$geoNear
• Ordina/Filtra Documenti per posizione
– Richiede un indice geospaziale
– L’Output include la distanza fisica
– Deve essere il primo stage di aggregazione

$geoNear
{"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
{“type” : “Point”,
“coordinates” :
[78.6, 37.5]}}
{ "_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
“coordinates” :
[86.6, 37.8]}}
{"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
“coordinates” :
[86.6, 37.8]}}
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}

Opzioni di Aggregate
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
• explain – simile a find().explain()
• allowDiskUse – Abilita l’uso del disco
• cursor – specifica la taglia del risultato iniziale

Sharding
• Carico diviso tra gli shard
– Shard eseguono la pipeline fino
ad un certo punto
– Lo shard primario unisce I
cursori e continua il processing
– Usate explain per analizzare lo
split della pipeline
– $match inziale può escludere
alcuni shard inutili
*Prior to v2.6 second stage pipeline processing was
done by mongos

Alternative Esistenti alle Join
{ "_id": 10000,
"items": [
{ "productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23},
{ "productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276}],
…
}
• Option 1: Includere tutti I dati
di un ordine nello stesso
documento
– Fast read
• Una find ritorna tutti I dati richiesti
– Consuma spazio extra
• I dettagli di ogni prodotti sono in tanti
ordini
– Complesso da mantenere
• Un cambiamento di un attrobuto di un
prodotto deve essere propagato all’interno
degli ordini
orders

Alternative Esistenti alle Join
{
"_id": 10000,
"items": [
12345,
54321
],
...
}
• Option 2: Il documento degli
ordini referenzia I documenti
dei prodotti
– Read piu lente
• Multiple trips al database
– Efficiente uso dello spazio
• I dettagli dei prodotti sono memorizzati una
volta sola
– Perde point-in-time snapshot di tutto il record
– Logica nell’applicazione
• Deve iterare sull’ID prodotto per trovare
tutti I documenti dei prodotti
• RDBMS automatizza con unaJOIN
orders
{
"_id": 12345,
"productName": "laptop",
"unitPrice": 1000,
"weight": 1.2,
"remainingStock": 23
}
{
"_id": 54321,
"productName": "mouse",
"unitPrice": 20,
"weight": 0.2,
"remainingStock": 276
}
products

Il vincitore?
• In generale, Opzione 1 vince
– Performance e il contenere tutto in un solo posto batte l’efficienza dello
spazio e la normalizzazione
– Ci sono eccezioni
• e.g. Commenti in un blog post -> unbounded size
• Tuttavia l’Analytics può beneficiare del combinare I dati di
diverse collection

$lookup
• Left-outer join
– Include tutti I documenti della
collection di sinistra
– Per ogni documenti della
collection di sinistra trova I
documenti corrispondenti dalla
collection di destra e li
incapsula
Left Collection Right Collection

$lookup
db.leftCollection.aggregate([{
$lookup:
{
from: “rightCollection”,
localField: “leftVal”,
foreignField: “rightVal”,
as: “embeddedData”
}
}])
Left Collection Right Collection

BI Connector
MongoDB
BI
Connector
Mapping meta-data Application data
{name:
“Andrew”,
address:
{street:…
}}
DocumentTableAnalytics & visualization

Sommario
• A pipeline di operazioni
• Select, project, group, sort
• $out deve essere l’ultimo operatore
• Ci sono vari tipi di accumulatori (guardate la documentazione
di $group)
• Sistema molto potente per analizzare e trasformare i dati
• Shard aware per avere il massimo del guadagno dai grandi
cluster

Back to Basics, webinar 5: Introduzione ad Aggregation Framework

Back to Basics, webinar 5: Introduzione ad Aggregation Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (20)

Similar to Back to Basics, webinar 5: Introduzione ad Aggregation Framework

Similar to Back to Basics, webinar 5: Introduzione ad Aggregation Framework (20)

More from MongoDB

More from MongoDB (20)

Back to Basics, webinar 5: Introduzione ad Aggregation Framework