This document discusses using MongoDB and R together for advanced analytics. It begins by explaining why R and MongoDB are popular tools individually and the benefits of using them together. It then covers MongoDB's aggregation framework and how to connect R to MongoDB. Two use cases are presented: genome-wide association analysis using genomic data from HapMap, and vehicle situational awareness using open data from the city of Chicago. The document concludes by discussing schema design considerations and scaling out analytics to Spark.
2. #MDBW17
LEARNING OBJECTIVES
Aggregation
Framework
How to design
your MongoDB
schema and
utilize the
aggregation
framework for
data
preparation and
enrichment.
R Connectors
How to connect
R to your
MongoDB
environment.
Understand
connectors
available and be
able to choose
the right
deployment
topology for
production.
Analytical
Patters
How to
recognize the
analytical
patterns
required and
apply MongoDB
and R together
to deliver key
insights in your
organization.
01 02 03
3. #MDBW17
DATA VS INSIGHT
big data is not valuable
insight is valuable
time-to-insight is critical
source of competitive advantage
4. Why
?
• The most popular data science environment
• Wide variety of statistical and graphical techniqu
• Open source / highly extensible
• 2+ M users
• Taught in most universities
• Thriving user groups worldwide
• 10,000+ contributed packages
Language
+
Platform
Community
Ecosystem
• Rich application & platform integration
• Finance, Genetics, Social Sciences, Geospatial &
5. Why
?
• The most popular NoSQL database
• 20M+ downloads, 3500+ customers
• Open source
• Analytics against live operational systems
• Flexible schema to capture ALL data
• More / new / changing data types
Community
+
Ecosystem
Speed
Innovation • Advanced capabilities beyond K-V
• Many industries and use cases
• Extreme developer productivity
6. #MDBW17
Geospatial
Text Search &
Collation
Aggregation
Left Outer
Join
AGGREGATION FRAMEWORK
Graph
Processing
Faceted
Navigation
Map Reduce
$match $group $sort $limit
$lookup
$geoNear
$meta
$facet $bucket
$graphLooku
p
a a a a a
á á á á á
7. #MDBW17
MONGODB AND R TOGETHER: USE CASES
Multi-genre analytics
churn analysis
fraud detection
drug discovery via mining genomes at scale
sentiment analysis
geospatial analysis
customer segmentation
predictive failure & maintenance
12. #MDBW17
GENOME-WIDE ASSOCIATION STUDIES
Examination of a genome-wide set of genetic variants in different individuals to see if any variant is
associated with a trait. GWASs typically focus on associations between single-nucleotide
polymorphisms (SNPs) and traits like major human diseases
Manhattan plot is popular graphical methods for visualizing results from high-dimensional data
analysis such as a genome wide association study in which p-values, Z-scores, test statistics are
plotted on a scatter plot against their genomic position. Manhattan plots are used for visualizing
potential regions of interest in the genome that are associated with a phenotype.
13. #MDBW17
DATA SET: HAPMAP
HapMap3: release 2 of genome-wide SNP genotyping using 1,115 DNA
samples from 11 human populations and total 1.6M SNPs
The p-values, zscores, and effectsizes used are taken from relevant
Prostate Cancer and Breast Cancer studies using GWAS Catalog
Annotation information (nearest gene and distance to nearest gene)
was obtained from the UCSC genome annotation database using
14. #MDBW17
GENOME-WIDE ASSOCIATION STUDY:
DATA EXPLORATION
#load data from mongo into R
hapmap <- mongo("hapmap", url =
"mongodb://localhost:27017/gene_annotation")
#count the number of records
hapmap$count('{}')
# read all the data back into R dataframe
hapmap_data <- hapmap$find('{}')
# create manhattan plot for exploration
manhattanly(hapmap_data, snp = "SNP", gene =
"GENE")
15. #MDBW17
GENOME-WIDE ASSOCIATION STUDY:
ANNOTATION
#use Bioconductor libraries to annotate the
genome: find nearest gene and distance to it
BSgenome.Hsapiens.UCSC.hg19
genome <- BSgenome.Hsapiens.UCSC.hg19
seqlengths(genome)
hapmap_range <- genome$chr5
#use aggregation framework to zoom into
chrmosomal region of interest
hapmap_range <- hapmap $find('{ "CHR": { "$gt": 3,
"$lt": 8 } }' )
#read data back into mongo
hapmap_range $export(file(" hapmap_range.json"))
17. #MDBW17
FIELDS CAN CONTAIN AN
ARRAY
OF SUB-DOCUMENTS
TYPED
FIELD
VALUES
STRING
FIELDS
DNA SEQUENCE
RICH DOCUMENTS FOR GENOMIC DATA
REGION MAP
CLINICAL
SIGNIFICANCE
20. #MDBW17
DATA SET: CHICAGO OPEN DATA
311 Service Requests
Abandoned Buildings
Potholes
Tree Trimming
Sanitation Code Complaints
Abandoned Vehicles
Garbage Carts
Tree Debris
Street Lights Out
Transportation
Street Closures
Red Light Camera Violations
Speed Camera Violations
Transportation Department
Permits
Public Right-of-Way Use
Permits
Events
Special Events Permits
Public Building Commission
Public Parks
21. #MDBW17
VEHICLE SITUATIONAL AWARENESS:
DATA EXPLORATION
#import 311 service request data into mongo
mongoimport -d chicago -c street_closures
#create geoindex
#read data into R
mdb <- mongo("street_closures", url = "mongodb://localhost:27017/chicago)
22. #MDBW17
VEHICLE SITUATIONAL AWARENESS PLOT
#calculate all geopoints for city alerts
points <- get_points
('{"location":{"$geoWithin":{"$centerSphere":[[-
87.622772,41.887694], 0.0001567865]}}}')
#plot Hyatt Regency marker and all points on the map
hyatt_map
+ geo_point(
+ aes(x = lons, y = lats),
+ color = "red",
+ alpha = 0.1,
+ size = 2,
+ data = points)
+ hyatt_marker
23. #MDBW17
VEHICLE SITUATIONAL AWARENESS
DENSITY MAP
# use stat_density2d from ggplot2 to estimates
contours from discrete samples
hyatt_map
+ geom_density2d(
data = points,
aes(x = lons, y = lats),
size = 0.3)
+ stat_density2d(
data = points,
aes(x = lons, y = lats,
fill = ..level.., alpha = ..level..),
size = 0.01,
bins = 16,
geom = "polygon”)
+ scale_fill_gradient(low = "green", high = "red")
+ scale_alpha(range = c(0, 0.3), guide = FALSE)
29. #MDBW17
USING SPARK TO PARALLELIZE & SCALE R
SparkR
distributed data frame implementation
supports selection, filtering, aggregation on large datasets.
supports distributed machine learning using MLlib
SparkDataFrame
distributed & optimized collection of data organized
can be constructed from a wide array of sources including
MongoDB
SparkSession
entry point into SparkR which connects your R program to a
Spark cluster. You can use a SparkSession object to write data to
MongoDB, read data from MongoDB, and perform SQL
operations.
#import data into Spark R
df <- read.df("",source = "com.mongodb.spark.sql.DefaultSource",
database = ”chicago", collection = ”three_eleven")
31. #MDBW17
BRING IT ALL TOGETHER
open standards
high adoption
iterative workflow -> fail fast
don’t throw ANY data away
scale out
WHAT WILL YOU SOLVE?
32. #MDBW17
REFERENCES & THANKS
MongoDB 3.4 - https://docs.mongodb.com/manual/
Aggregation Framework - https://docs.mongodb.com/manual/aggregation/
MongoDB Compass - https://docs.mongodb.com/compass/current/
Mongo Spark Connector 2.0 - https://docs.mongodb.com/spark-connector/current/r-api/
Mongolite: https://github.com/jeroen/mongolite
https://github.com/ajdavis/three-eleven-mongolite-
demo/blob/master/mongolite-demo.R
Rstudio - https://www.rstudio.com/
Visualization: plot.ly – https://plot.ly/
manhattanly - http://sahirbhatnagar.com/manhattanly/
ggplot2 - https://cran.r-project.org/web/packages/ggplot2/
HapMap3- ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2009-01_phaseIII/
Bioconductor - http://bioconductor.org/
City of Chicago - https://data.cityofchicago.org/
Special Thanks to Jeroen Ooms and A. Jesse Jiryu Davis