The Developer Data
Scientist Creating New Analytics Driven
Applications
Using Azure Databricks® and Apache Spark™
About Richard
Richard Garris
● Principal Solutions Architect
● 14+ years in data management and
advanced analytics
● advises customers on their data
science and advanced analytic
projects
● Degrees from The Ohio State
University and Carnegie Mellon
University
2
Agenda
- Introduction to Data Science
- Data Science Lifecycle
- Data Ingestion
- Data Understanding & Exploration
- Modeling
- Integrating Machine Learning in Your Application
- End-to-End Example Use Cases
3
Introduction to Data
Science
4
AI is Changing the World
What is the secret to AI?
AlphaGoSelf-driving cars Alexa
AI is Changing the World
What do these companies have in common?
AlphabetTesla Amazon
Hardest Part of AI isn’t AI, its Big Data
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS
2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by
the small green box in the middle. The required surrounding infrastructure is vast and complex.
Business Value of Data Science
Present the
Right Offer at
the Right Time
•Businesses have to Adapt Faster to
Change
•Data driven decisions need to be
made quickly and accurately
•Customers expect faster responses
Data Science Lifecycle
9
Agile Modeling Process
Set Business
Goals
Understand Your
Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results
Data Scientists or Data Janitors?
1
1
“3 out of 5 data
scientists spend
80% of their time
collecting, cleaning
and organizing
data”
Data Understanding
Schema - understand the field names / data types
Metadata Management - understand descriptions and business
meaning
Data Quality - data validation / profiling / checks
Exploration / Visualization - scatter plots, charts, correlations
Summary Statistics - average, min, max, range, median,
standard deviation
Data Understanding - Visualization
Data Understanding
Descriptive Statistics
Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis
Relationship Statistics
Pearson’s Coefficient, Spearman correlation,
Data Understanding
Structured
• Key Value
• Tabular (Relational,
Nested)
• Graph
• Geocoded / Location
• Time Series
Unstructured
• Text (logs, tweets, articles)
• Sound / Waveform
• Sensor
• Genomic / Scientific Data
Structured Data (relational, tabular,
nested)
Text (logs, tweets, articles, social)
Graph Data (connections, social)
Start with the Raw Data
Geocoded / Location Data
Time Series Data
Stock Charts
Streaming / Events
Sound and Waveform Data
Sensor Data
Images and Video
Genomic & Scientific Data
What is a Data Science Platform?
Gartner defines Data Science Platform :
“an end-to-end platform for developing and
deploying models”
Using sophisticated statistical models, machine learning,
neural networks, text analytics, and other advanced data
mining techniques
25
What is a Model
A simplified and idealized
representation of the real-world
What does Modeling Mean?
A Class is a Model Model of a Building Data Model
class Employee {
FirstName : String
LastName : String
DOB : java.calendar.Date
Grades : Seq[Grade]
}
Types of Models
Machine Learning Models
Statistical Models
Financial Models
Graph Models
Simulation Models
Predictive Models
Biological Models
Two Broad Categories of Models
●Supervised learning: prediction
Classification (binary or multiclass): predict a category (label)
Regression: predict a number (target)
●Unsupervised learning: discovery
Clustering: find groupings based on pattern
Density estimation: match data with distribution pattern
Dimensionality: reduction / reduce # of columns
Similarity search: find similar data
Frequent Items (or association rules): finding relationships in variables
Model Category Use Cases
●Anomaly detection
Density estimation: “Is this observation uncommon?”
Similarity search: “How far is it from other observations?”
Clustering: “Are there groups of strange observations?”
●Lead scoring / recommendation
Classification: “Will this user become a buyer?”
Regression: “How much will he/she spend?”
Similarity search: “What products did similar users buy?”
A Model is a Mathematical Function
Unsupervised Methods in MLlib
Clustering
●Gaussian mixture models
●K-Means
●Streaming K-Means
●Latent Dirichlet Allocation
●Power Iteration Clustering
Frequent itemsets
●FP-growth
●Prefix span
Recommendation
●Alternating Least Squares
Supervised Methods in MLlib
Classification
Logistic regression w/ elastic net
Naive Bayes
Streaming logistic regression
Linear SVMs
Decision trees
Random forests
Gradient-boosted trees
Multilayer perceptron
One-vs-rest
DeepImagePredictor
Regression
Least squares w/ elastic net
Isotonic regression
Decision trees
Random forests
Gradient-boosted trees
Streaming linear methods
But What is a Model Really?
A model is a complex pipeline of components
Data Sources
Joins
Featurization Logic
Algorithm(s)
Transformers
Estimators
Tuning Parameters
ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline
ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!
Integrating Machine
Learning in Your
Application
37
Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
Problems with Productionizing
Models
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models
Data Science Data Engineering
MLLib 2.X Model Serialization
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLLib 2.X Model Serialization
Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•
Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!
Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_s
trIdx_bb9728f85745/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet(”/mod
els/lr/stages/00_strIdx_bb9728f85745/
data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)
Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_l
ogreg_325fa760f925/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet("/mod
els/lr/stages/18_logreg_325fa760f925/
data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}
Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
Visualize Stage (DecisionTree)
Visualization of the Tree
In Databricks
Databricks + ML Pipelines: Ideal
Modeling Tool
Data Science - highly iterative,
agile
● Lots of data sources
● Lots of dirty data
● Lots and lots of data
ML Pipelines and notebooks are
ideal way to experiment with new
methods, data, features in order to
minimize error
Databricks Runtime
Elastic, Fully Managed, Highly Tuned Engine
48
FULLY MANAGED CLOUD
SERVICE
• Auto-configured multi-user elastic
clusters
• Reliable sharing with fault
isolation and workload
Preemption
PERFORMANCE OPTIMIZATIONS
• Increases performance by 5X
(TPC Benchmark)
• Connector optimizations
for Cloud ( Kafka, S3 and
Kinesis)
COST OPTIMIZED / LINEAR
SCALING
• 2x nodes - time cut in half
• 2x data, 2x nodes - time constant
• Cost of 10 nodes for 10 hours
equal to 100 nodes for 1 hour
DATABRICKS UNIFIED RUNTIME
Databricks I/O Databricks Serverless
Databricks Collaborative Workspace
Frictionless Collaboration Enabling Faster Innovation
49
Secure collaboration for fast feedback loops with single click access to
clusters
Production Jobs
FAST, RELIABLE AND SECURE JOBS
• Executes jobs 30-50% faster
• Notebooks to Production Jobs with one-
click
• Debug faster with logs and Spark history
UI.
DATA ENGINEER
ANALYZE DATA WITH NOTEBOOKS
• Multi-language: SQL, R, Scala, Python
• Advanced Analytics (Graph, ML & DL)
• Built-in visualization, including D3 & ggplot
DATA SCIENTIST
Interactive
Notebooks
BUILD DASHBOARDS
• Publish Insights
• Real-time updates
• Interactive reportsBUSINESS SME
Dashboards
49
50
Databricks’ Approach to Accelerate Innovation
INCREASE PERFORMANCE
By more than 5x and reduce TCO by
more than 70%
INCREASE PRODUCTIVITY
Of data science teams by 4-5x
STREAMLINE ANALYTIC
WORKFLOWS
Reducing deployment time to minutes
REDUCE RISK
And enable innovation with out-of-the-box
enterprise security and compliance
UNIFY ANALYTICS WITH APACHE
SPARK
Eliminating disparate tools
DATA SCIENTIST
/ANALYST
BUSINESS SMEDATA
ENGINEER
OPTIMIZEBIG DATA
CLUSTERS
SETUP
BREAK-FIX
DATA
WAREHOUSES
CLOUD
STORAGE
HADOOP STORAGEIoT / STREAMING DATA
MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD
ML
LIBRARIES
STREAMIN
G
STATISTICS
PACKAGES
ETL SQL
Unified Engine
• SQL
• Streaming
• MLlib
• Graph
Databricks Optimizations
and Managed Cloud Service
DATABRICKS ENTERPRISE
SECURITY
DATABRICKS COLLABORATIVE
WORKSPACE
Databricks
Production
Databricks Interactive
DATABRICKS
RUNTIME
Databricks I/ODatabricks Serverless Unified Engine
Open APIs
End-to-end Examples
51
Demonstrations
•Predicting Power Output for an Energy Company
•Scoring inbound Leads
•Predicting Ratings given Reviews from Amazon
5
2
Demonstration
Databricks Community Edition or Free Trial
<Link to Azure>
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
5
3

Azure Databricks for Data Scientists

  • 1.
    The Developer Data ScientistCreating New Analytics Driven Applications Using Azure Databricks® and Apache Spark™
  • 2.
    About Richard Richard Garris ●Principal Solutions Architect ● 14+ years in data management and advanced analytics ● advises customers on their data science and advanced analytic projects ● Degrees from The Ohio State University and Carnegie Mellon University 2
  • 3.
    Agenda - Introduction toData Science - Data Science Lifecycle - Data Ingestion - Data Understanding & Exploration - Modeling - Integrating Machine Learning in Your Application - End-to-End Example Use Cases 3
  • 4.
  • 5.
    AI is Changingthe World What is the secret to AI? AlphaGoSelf-driving cars Alexa
  • 6.
    AI is Changingthe World What do these companies have in common? AlphabetTesla Amazon
  • 7.
    Hardest Part ofAI isn’t AI, its Big Data ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex.
  • 8.
    Business Value ofData Science Present the Right Offer at the Right Time •Businesses have to Adapt Faster to Change •Data driven decisions need to be made quickly and accurately •Customers expect faster responses
  • 9.
  • 10.
    Agile Modeling Process SetBusiness Goals Understand Your Data Create Hypothesis Devise Experiment Prepare Data Train-Tune-Test Model Deploy Model Measure / Evaluate Results
  • 11.
    Data Scientists orData Janitors? 1 1 “3 out of 5 data scientists spend 80% of their time collecting, cleaning and organizing data”
  • 12.
    Data Understanding Schema -understand the field names / data types Metadata Management - understand descriptions and business meaning Data Quality - data validation / profiling / checks Exploration / Visualization - scatter plots, charts, correlations Summary Statistics - average, min, max, range, median, standard deviation
  • 13.
    Data Understanding -Visualization
  • 14.
    Data Understanding Descriptive Statistics Max,Min, Mean, Standard Deviation, Median, Skewness, Kurtosis Relationship Statistics Pearson’s Coefficient, Spearman correlation,
  • 15.
    Data Understanding Structured • KeyValue • Tabular (Relational, Nested) • Graph • Geocoded / Location • Time Series Unstructured • Text (logs, tweets, articles) • Sound / Waveform • Sensor • Genomic / Scientific Data
  • 16.
  • 17.
    Text (logs, tweets,articles, social)
  • 18.
  • 19.
    Start with theRaw Data Geocoded / Location Data
  • 20.
    Time Series Data StockCharts Streaming / Events
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    What is aData Science Platform? Gartner defines Data Science Platform : “an end-to-end platform for developing and deploying models” Using sophisticated statistical models, machine learning, neural networks, text analytics, and other advanced data mining techniques 25
  • 26.
    What is aModel A simplified and idealized representation of the real-world
  • 27.
    What does ModelingMean? A Class is a Model Model of a Building Data Model class Employee { FirstName : String LastName : String DOB : java.calendar.Date Grades : Seq[Grade] }
  • 28.
    Types of Models MachineLearning Models Statistical Models Financial Models Graph Models Simulation Models Predictive Models Biological Models
  • 29.
    Two Broad Categoriesof Models ●Supervised learning: prediction Classification (binary or multiclass): predict a category (label) Regression: predict a number (target) ●Unsupervised learning: discovery Clustering: find groupings based on pattern Density estimation: match data with distribution pattern Dimensionality: reduction / reduce # of columns Similarity search: find similar data Frequent Items (or association rules): finding relationships in variables
  • 30.
    Model Category UseCases ●Anomaly detection Density estimation: “Is this observation uncommon?” Similarity search: “How far is it from other observations?” Clustering: “Are there groups of strange observations?” ●Lead scoring / recommendation Classification: “Will this user become a buyer?” Regression: “How much will he/she spend?” Similarity search: “What products did similar users buy?”
  • 31.
    A Model isa Mathematical Function
  • 32.
    Unsupervised Methods inMLlib Clustering ●Gaussian mixture models ●K-Means ●Streaming K-Means ●Latent Dirichlet Allocation ●Power Iteration Clustering Frequent itemsets ●FP-growth ●Prefix span Recommendation ●Alternating Least Squares
  • 33.
    Supervised Methods inMLlib Classification Logistic regression w/ elastic net Naive Bayes Streaming logistic regression Linear SVMs Decision trees Random forests Gradient-boosted trees Multilayer perceptron One-vs-rest DeepImagePredictor Regression Least squares w/ elastic net Isotonic regression Decision trees Random forests Gradient-boosted trees Streaming linear methods
  • 34.
    But What isa Model Really? A model is a complex pipeline of components Data Sources Joins Featurization Logic Algorithm(s) Transformers Estimators Tuning Parameters
  • 35.
    ML Pipelines Train model Evaluate Loaddata Extract features A very simple pipeline
  • 36.
    ML Pipelines Train model1 Evaluate Datasource 1 Datasource 2 Datasource 3 Extract featuresExtract features Feature transform 1 Feature transform 2 Feature transform 3 Train model 2 Ensemble A real pipeline!
  • 37.
  • 38.
    Productionizing Models Today DataScience Data Engineering Develop Prototype Model using Python/R Re-implement model for production (Java)
  • 39.
    Problems with Productionizing Models DevelopPrototype Model using Python/R Re-implement model for production (Java) - Extra work - Different code paths - Data science does not translate to production - Slow to update models Data Science Data Engineering
  • 40.
    MLLib 2.X ModelSerialization Data Science Data Engineering Develop Prototype Model using Python/R Persist model or Pipeline: model.save(“s3n://...”) Load Pipeline (Scala/Java) Model.load(“s3n://…”) Deploy in production
  • 41.
    Scala val lrModel =lrPipeline.fit(dataset) // Save the Model lrModel.write.save("/models/lr") • MLLib 2.X Model Serialization Snippet Python lrModel = lrPipeline.fit(dataset) # Save the Model lrModel.write.save("/models/lr") •
  • 42.
    Model Serialization Output Code //List Contents of the Model Dir dbutils.fs.ls("/models/lr") • Output Remember this is a pipeline model and these are the stages!
  • 43.
    Transformer Stage (StringIndexer) Code //Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/00_s trIdx_bb9728f85745/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/mod els/lr/stages/00_strIdx_bb9728f85745/ data/")) Output { "class":"org.apache.spark.ml.feature.StringIndexerModel", "timestamp":1488120411719, "sparkVersion":"2.1.0", "uid":"strIdx_bb9728f85745", "paramMap":{ "outputCol":"workclassIdx", "inputCol":"workclass", "handleInvalid":"error" } } Metadata and params Data (Hashmap)
  • 44.
    Estimator Stage (LogisticRegression) Code //Cat the contents of the Metadata dir dbutils.fs.head(”/models/lr/stages/18_l ogreg_325fa760f925/metadata/part- 00000") // Display the Parquet File in the Data dir display(sqlContext.read.parquet("/mod els/lr/stages/18_logreg_325fa760f925/ data/")) Output Model params Intercept + Coefficients {"class":"org.apache.spark.ml.classification.LogisticRegressionModel", "timestamp":1488120446324, "sparkVersion":"2.1.0", "uid":"logreg_325fa760f925", "paramMap":{ "predictionCol":"prediction", "standardization":true, "probabilityCol":"probability", "maxIter":100, "elasticNetParam":0.0, "family":"auto", "regParam":0.0, "threshold":0.5, "fitIntercept":true, "labelCol":"label” }}
  • 45.
    Output Decision Tree Splits EstimatorStage (DecisionTree) Code // Display the Parquet File in the Data dir display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/")) // Re-save as JSON sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").
  • 46.
  • 47.
    Databricks + MLPipelines: Ideal Modeling Tool Data Science - highly iterative, agile ● Lots of data sources ● Lots of dirty data ● Lots and lots of data ML Pipelines and notebooks are ideal way to experiment with new methods, data, features in order to minimize error
  • 48.
    Databricks Runtime Elastic, FullyManaged, Highly Tuned Engine 48 FULLY MANAGED CLOUD SERVICE • Auto-configured multi-user elastic clusters • Reliable sharing with fault isolation and workload Preemption PERFORMANCE OPTIMIZATIONS • Increases performance by 5X (TPC Benchmark) • Connector optimizations for Cloud ( Kafka, S3 and Kinesis) COST OPTIMIZED / LINEAR SCALING • 2x nodes - time cut in half • 2x data, 2x nodes - time constant • Cost of 10 nodes for 10 hours equal to 100 nodes for 1 hour DATABRICKS UNIFIED RUNTIME Databricks I/O Databricks Serverless
  • 49.
    Databricks Collaborative Workspace FrictionlessCollaboration Enabling Faster Innovation 49 Secure collaboration for fast feedback loops with single click access to clusters Production Jobs FAST, RELIABLE AND SECURE JOBS • Executes jobs 30-50% faster • Notebooks to Production Jobs with one- click • Debug faster with logs and Spark history UI. DATA ENGINEER ANALYZE DATA WITH NOTEBOOKS • Multi-language: SQL, R, Scala, Python • Advanced Analytics (Graph, ML & DL) • Built-in visualization, including D3 & ggplot DATA SCIENTIST Interactive Notebooks BUILD DASHBOARDS • Publish Insights • Real-time updates • Interactive reportsBUSINESS SME Dashboards 49
  • 50.
    50 Databricks’ Approach toAccelerate Innovation INCREASE PERFORMANCE By more than 5x and reduce TCO by more than 70% INCREASE PRODUCTIVITY Of data science teams by 4-5x STREAMLINE ANALYTIC WORKFLOWS Reducing deployment time to minutes REDUCE RISK And enable innovation with out-of-the-box enterprise security and compliance UNIFY ANALYTICS WITH APACHE SPARK Eliminating disparate tools DATA SCIENTIST /ANALYST BUSINESS SMEDATA ENGINEER OPTIMIZEBIG DATA CLUSTERS SETUP BREAK-FIX DATA WAREHOUSES CLOUD STORAGE HADOOP STORAGEIoT / STREAMING DATA MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD ML LIBRARIES STREAMIN G STATISTICS PACKAGES ETL SQL Unified Engine • SQL • Streaming • MLlib • Graph Databricks Optimizations and Managed Cloud Service DATABRICKS ENTERPRISE SECURITY DATABRICKS COLLABORATIVE WORKSPACE Databricks Production Databricks Interactive DATABRICKS RUNTIME Databricks I/ODatabricks Serverless Unified Engine Open APIs
  • 51.
  • 52.
    Demonstrations •Predicting Power Outputfor an Energy Company •Scoring inbound Leads •Predicting Ratings given Reviews from Amazon 5 2
  • 53.
    Demonstration Databricks Community Editionor Free Trial <Link to Azure> Additional Questions? Contact us at http://go.databricks.com/contact-databricks 5 3