MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB

Jane Fine
Director of Product Marketing, Analytics
jane.fine@mongodb.com
@janeuyvova

Bryan Reinero
Product Manager
bryan@mongodb.com
@bryanreinero

Data Analytics with MongoDB
Custom Code +
Charting Libraries
ETL +
3rd Party BI
Tools
MongoDB BI Connector +
3rd Party BI Tools
MongoDB
Charts
MongoDB
Compass

But First, Setup
MongoDB Atlas Cluster
MongoDB Connector for BI
Charts Server
MongoDB Compass
https://github.com/breinero/workshop

MongoDB Connector
for Business Intelligence

MongoDB BI Connector
Visualize and explore MongoDB
data in SQL-based BI tools:
Automatically discovers the schema
Translates complex SQL statements
issued by the BI tool into MongoDB
aggregation queries
Converts the results into a tabular format
for rendering inside the BI tool

Schema Translation
{
_id:
ObjectId("5b9a6d75f68257c5a994558d"),
id: 2318,
listing_url:
"https://www.airbnb.com/rooms/2318",
name: "Casa Madrona",
summary: "Gorgeous 1917 Dutch
Colonial…”
calendar: [
{ listing_id: 2318,
date: "2019-08-15",
available: "t",
price: 296
},
{ listing_id: 2318,
date: "2019-08-14",
available: "t",
price: 296
}
,...
mysql> show tables;
+----------------+
| Tables_in_test |
+----------------+
| sea |
| sea_calendar |
| sea_ratings |
+----------------+
3 rows in set (0.06 sec)

Demo
● use $ mongo to examine data in MongoDB
● use $ mysql to examine tabular representation of collections
● explain()
● mysql -h dataanalyticsworkshop-biconnector-sshrq.mongodb.net -P
27015 -u demo -p --default-auth=mongosql_auth

Exercise 1: Analysing AirBnB Data
with BI Connector
Make a DSN connection to MongoDB
Import Data Into Excel
Identify the most expensive neighborhood group in Austin
Determine average nightly price for an Airbed, per neighborhood

MongoDB BI Connector - When to Use
Want to speak SQL to MongoDB
Multi data sources (not just MongoDB)
Business analysts
Reporting only
Powerful but you lose some benefits of schema flexibility

Options for Visualizing MongoDB Data
Custom Code + Charting
Libraries
ETL + 3rd Party BI ToolsBI Connector + 3rd Party
BI Tools

Wouldn’t it be nice if...
You could visualize your MongoDB Data…
without needing to write your own code
without needing to move your data into a different repository
without needing to purchase and configure third-party tools
without losing the richness of the Document Model

MongoDB Charts beta
The fastest way to build
visualizations over your
MongoDB data
Built for the MongoDB Document
Model
Visualize live data from on-prem
or Atlas DB

Charts Basic Concepts
A data source is a reference to a MongoDB collection or view that
contains data you want to visualize.
A chart is a visualization of data from a single data source.
A dashboard is a collection of charts which you manage as a unit
(name, layout, sharing)

What to Expect
Charting Capabilities
Common chart types
Aggregation functions
Filtering
Sample Mode
Binning
Sorting
Document Model Support
Type handling
Polymorphic collections
Nested documents
Array reductions

Sharing and Permissions
All users log onto Charts with their own account
• Users can be managed by anyone with the UserAdmin role
Any user can add a data source
• They must have a valid connection URI to connect to the MongoDB instance
• A data source owner can choose to share with nobody, specific people or
everybody
Any user can add a dashboard
• By default, the dashboard is only visible to the creator
• A dashboard owner can choose to share with nobody, specific people or
everybody

Exercise 2: Analysing AirBnB Data
with Charts
Add a Data Source
Create a Dashboard
Add some charts
Multi-Series Stacked Bar Chart
Coloured Bar Chart
Review Scores Histogram
Area Chart with Binning

The fastest way to build visualizations over your MongoDB
data
Ad hoc analyses
Benefit from the Document Model
Collaboration
Self-service
Intuitive enough for domain experts, non-devs to use!

MongoDB Aggregation Framework
and Compass

MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [ {
number : "1-212-777-1212",
type : "work"
},
{
number : "1-212-777-1213",
type : "cell"
}]
……...
Expressive
Queries
Find anyone with phone # “1-212…”
Check if the person with number “555…” is on the “do not call” list
Geospatial
Find the best offer for the customer at geo coordinates of 42nd St.
and 6th Ave
Text Search Find all tweets that mention the firm within the last 2 days
Aggregation
Count and sort number of customers by city, compute min, max, and
average spend
Native Binary
JSON Support
Add an additional phone number to Mark Smith’s record without
rewriting the document
Update just 2 phone numbers out of 10
Sort on the modified date
JOIN
($lookup)
Query for all San Francisco residences, lookup their transactions,
and sum the amount by person
Graph Queries
($graphLookup)
Query for all people within 3 degrees of separation from Mark
Rich query functionality

Advanced data processing
pipeline for transformations
and analytics
Multiple stages
Similar to a unix pipe
Rich Expressions
Collection
db.orders.aggregate( [
$match stage {$match: { status: "A" } },
$group stage { $group: { _id: "$cust_id",total: { $sum: "$amount" } }
}
] )
{
cust_id: "A123",
amount: 500,
status: "A",
}
{
cust_id: "A123",
amount: 250,
status: "A",
}
{
cust_id: "B212",
amount: 200,
status: "A",
}
{
cust_id: "A123",
amount: 300,
status: "D",
}
Orders
{
cust_id: "A123",
amount: 500,
status: "A",
}
{
cust_id: "A123",
amount: 250,
status: "A",
}
{
cust_id: "B212",
amount: 200,
status: "A",
}
{
id: "A123",
total: 750
}
{
id: "B212",
total: 200
}
$match $group
Aggregations

$match
$group
$facet
$geoNear
$graphLookup
Pipeline Stages
Mathematical
$add, $abs,
$substract,
$multiply, $divide,
$log, $log10,
$stdDevPop,
$stdDevSam, $avg,
$sqrt, $pow, $sum,
$zip, $convert, etc.
Array
$push, $reduce,
$reverseArray,
$addToSet,
$arrayElemAt,
$slice, etc.
Operators
$lookup
$project
$sort
$unwind
Conditionals
$and, $or, $eq, $lt,
$lte, $gt, $gte,
$cmp, $cond,
$switch, $in, etc
Date
$dateFromParts,
$dateToParts,
$dateFromString,
$dateToString,
$dayOfMonth,
$isoWeek, $minute,
$month, $year, etc.
String
$toUpper, $toLower,
$substr,
$strcasecmp,
$concat, $split, etc.
Laterals
$exp, $let, $literal,
$map, $type, etc
Aggregation Features

Developer / Data Analyst Tool
Data management and manipulation
document view
table view
Visual schema analyzer
with query builder
export to language
Aggregation pipeline builder
A good place to start

Exercise 3: Building Powerful
Aggregations
Explore data by doing some basic finds
Learn how to do aggregations you usually do in SQL
$match
$group
Learn how to do aggregations you cannot do in SQL
$objectToArray
$graphlookup
$geonear

Find or Count Specific Documents
4. Find how many listings have a real bed and are greater than
$100 but less than $200 per night by filling in the blanks
db.___.find( {
price: { ___: ___, $lt: ___ } } ,
{ ___ : "Real Bed" }).count()
db.austinListingsAndReviews.find( {
"price": { $gt: 100, $lt: 200 } } ,
{"bed_type" : "Real Bed"}).count()
Answer: 2892

Minimum price per night by number
of beds
Take a look at this example aggregation which computes minimum
price per night by number of beds
db.austinListingsAndReviews.aggregate( [
{
$group: {
_id: "$beds",
avg_by_beds: { $min: "$price" }
}
},
{ $sort: { avg_by_beds: 1 } }
] )

Average price per night by room type
Construct a very similar query to find average price per night
by room type by filling in the blanks.
{
$group: {
_id: "___",
avg_by_roomtype: { $avg: "___" }
}
}}
] )

ANSWER: Average price per night by
room type
Construct a very similar query to find average price per night
by room type by filling in the blanks.
{
$group: {
_id: "$room_type",
avg_by_roomtype: { $avg: "$price" }
}
}}
] )

Average price per night by suburb
Let’s practice some more.
This time write the query from scratch to find the average
price per night by suburb. Examine the document structure
first.
Hint: suburb is a field within a subdocument address; use dot
notation to reference it.

ANSWER: Average price per night by
suburb
{
$group: {
_id: "$address.suburb",
avg_by_suburb: { $avg: "$price" }
}
}
] )

Count and average price per night
Compute the count of properties and average price per night by
suburbs and number of bedrooms by filling in the blanks.
db.austinListingsAndReviews.aggregate([
{"$group" :
{
_id:{suburb:"$___", bedrooms:"$___"},
___:{$sum:1},
avg:{$___:"$price"}}
}
])

ANSWER: Count and average price
per night
{"$group" :
{
_id:{suburb:"$address.suburb", bedrooms:"$bedrooms"},
count:{$sum:1},
avg:{$avg:"$price"}}
}
])

Add $match Stage
Use the query from previous step and add a $match stage to only
show those properties that have a pool
{ $match : {amenities: "Pool"} },
{"$group" :
{
_id:{suburb:"$address.suburb",bedrooms:"$bedrooms"},
count:{$sum:1},
avg:{$avg:"$price"}}
}
])

Learn how to do aggregations you
cannot do in SQL
We want to determine average review scores for all properties
in Austin based on some amenities.
Remember that review_scores is a sub document. Luckily we have
$objectToArray to transform our object into an array which we
can then $unwind and $group.

Average review scores for all
properties in Austin with amenities
{ $match : {
amenities: { $in: ["Cable TV", "Garden or backyard",
"Coffee maker"] } } },
{ $project: { scores: { $objectToArray: "$review_scores" } } },
{ $unwind: "$scores" },
{ $group: { _id: "$scores.k", average_score: { $avg:
"$scores.v" } } }
])

Compare two neighbourhoods
Compare two neighbourhoods - South Lamar and West Campus - and
decide based on average review where you would rather stay.
Assume you also want a place that has pool, wifi, and
air-conditioning, under $200
Hint: You will need to look up how to use $and operator for this

ANSWER: South Lamar
{
$match: {
$and: [
{amenities: { $in: ["Pool", "Air conditioning", "Wifi"] } },
{price: { $lt: 200 } },
{"address.suburb": "South Lamar"}
]
}
},
{ $group: { _id: "$scores.k", average_score: { $avg: "$scores.v" } } }
])

ANSWER: West Campus
{
$match: {
$and: [
{amenities: { $in: ["Pool", "Air conditioning", "Wifi"] } },
{price: { $lt: 200 } },
{"address.suburb": "West Campus"}
]
}
},
{ $group: { _id: "$scores.k", average_score: { $avg: "$scores.v" } } }
])

$geoNear
Use $geoNear to find airbnbs near location of your choice (or
near Alamo Drafthouse - 30.2988474,-97.7064386) and add the
same search criteria we used in in the prior exercise to only
show airbnbs that have pool, wifi, and air-conditioning, under
$200

ANSWER: $geoNear
{
$geoNear: {
near: { type: "Point", coordinates: [ -97.7064386 , 30.2988474 ] },
distanceField: "dist.calculated",
maxDistance: 500,
query: { $and: [ {amenities: { $in: ["Pool", "Air conditioning", "Wifi"] } },
{price: { $lt: 200 } }]},
includeLocs: "dist.location",
num: 5,
spherical: true
}
}]).pretty()

$graphLookup
Let’s say we want to build a recommendation engine based on
users that reviewed the same properties.
We can use $graphLookup to run a recursive query that says:
“for a given reviewer, e.g. 7538 let’s find all listings
reviewed by users who also reviewed the same listings as 7538”

$graphLookup - example
{ $match: { "reviews.reviewer_id": "7538" } } ,{ $unwind: "$reviews" },{
$graphLookup: {
from: "austinListingsAndReviews",
startWith: "$reviews.reviewer_id",
connectFromField: "reviews.reviewer_id",
connectToField: "reviews.reviewer_id",
as: "reviewed_listings",
maxDepth: 1,
depthField: "depth",
restrictSearchWithMatch: { "reviews.reviewer_id": { $ne: "7538" } }
}},
{ $project: {
_id: 0,
listing: "$_id",
url: "$listing_url" }}]).pretty()

Now try $graphlookup yourself
Some host in Austin have multiple listings, for example,
host_id:100208662. Construct a $graphLookup query that finds
all the listings by that host.

ANSWER: $graphlookup
{ $match: { "host.host_id": "100208662" } },
{ $graphLookup:
{
from: "austinListingsAndReviews",
startWith: "$host.host_id",
connectFromField: "host.host_id",
connectToField: "host.host_id",
as: "multi-listing-hosts"
} }]).pretty()

Exploratory data analysis
Data preparation & basic manipulation
Data ingestion via JSON or CSV import
Day-to-day development/operations
Adding and understanding indexes
Adding validation rules
Authoring & troubleshooting aggregation pipelines
Viewing real-time server stats
10,000 → 1ft view of data

MongoDB Connector
for Apache Spark

Business Intelligence, Analytics, Machine Learning
Process data in MongoDB with the massive parallelism
of Spark, it's machine learning libraries, and streaming
API
● Process data “in place”, avoiding the
latency otherwise required by an
incremental ETL task.
● Reduced Operational Complexity and Faster
Time-To-Analytics
● Aggregation pre-filtering in combination with
secondary indexing means that an analytics
query only draws that data required
● Multiple Language APIs

JSON
JSON
JSON
JSON
JSON
JSON
JSON
JSON
JSON
JSON
JSON
Process data in MongoDB with the massive
parallelism of Spark, it's machine learning libraries,
and streaming API
● Process data “in place”, avoiding the latency
otherwise required by an incremental ETL task
● Aggregation pre-filtering in combination
with secondary indexing means that an
analytics query only draws that data
required
● Reads from secondaries isolate analytics
workload from business critical operations
● Shard aware for data locality

WRITE
READ
Primary
2ndary
2ndary
and streaming API
● Shard aware for data locality

and streaming API
● Multiple language APIs

Partners and Certifications
Coming soon,....

Using MongoDB Spark Connector
./spark-2.3.0/bin/spark-shell
--conf
"spark.mongodb.input.uri=mongodb+srv://demo:demo@dataanalyticsworks
hop-sshrq.mongodb.net/recommendation.ratings"
--conf
"spark.mongodb.output.uri=mongodb+srv://demo:demo@dataanalyticswork
shop-sshrq.mongodb.net/recommendation.perUser"
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.0

Exercise 4: MongoDB Spark
Connector
https://community.cloud.databricks.com/?o=2374217986360057#notebook/42327714871690/command/42327714871707
Databricks

Whats Next?
● Github: https://github.com/breinero/workshop
● M121: The MongoDB Aggregation Framework -
https://university.mongodb.com/courses/M121/about
● Videos
○ https://www.mongodb.com/presentations/world18-data-analytics-with-mongodb
○ Charts -
https://www.mongodb.com/presentations/world18-bringing-data-to-life-with-m
ongodb-charts
○ Spark -
https://www.mongodb.com/presentations/analytics-and-machine-learning-with-
spark-and-mongodb-bryan-reinero

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB

Similar to MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB