Azure Databricks for Data Scientists

The Developer Data
Scientist Creating New Analytics Driven
Applications
Using Azure Databricks® and Apache Spark™

About Richard
Richard Garris
● Principal Solutions Architect
● 14+ years in data management and
advanced analytics
● advises customers on their data
science and advanced analytic
projects
● Degrees from The Ohio State
University and Carnegie Mellon
University
2

Agenda
- Introduction to Data Science
- Data Science Lifecycle
- Data Ingestion
- Data Understanding & Exploration
- Modeling
- Integrating Machine Learning in Your Application
- End-to-End Example Use Cases
3

Introduction to Data
Science
4

AI is Changing the World
What is the secret to AI?
AlphaGoSelf-driving cars Alexa

AI is Changing the World
What do these companies have in common?
AlphabetTesla Amazon

Hardest Part of AI isn’t AI, its Big Data
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS
2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by
the small green box in the middle. The required surrounding infrastructure is vast and complex.

Business Value of Data Science
Present the
Right Offer at
the Right Time
•Businesses have to Adapt Faster to
Change
•Data driven decisions need to be
made quickly and accurately
•Customers expect faster responses

Agile Modeling Process
Set Business
Goals
Understand Your
Data
Create
Hypothesis
Devise
Experiment
Prepare Data
Train-Tune-Test
Model
Deploy Model
Measure /
Evaluate Results

Data Scientists or Data Janitors?
1
1
“3 out of 5 data
scientists spend
80% of their time
collecting, cleaning
and organizing
data”

Data Understanding
Schema - understand the field names / data types
Metadata Management - understand descriptions and business
meaning
Data Quality - data validation / profiling / checks
Exploration / Visualization - scatter plots, charts, correlations
Summary Statistics - average, min, max, range, median,
standard deviation

Data Understanding - Visualization

Data Understanding
Descriptive Statistics
Max, Min, Mean, Standard Deviation, Median, Skewness, Kurtosis
Relationship Statistics
Pearson’s Coefficient, Spearman correlation,

Data Understanding
Structured
• Key Value
• Tabular (Relational,
Nested)
• Graph
• Geocoded / Location
• Time Series
Unstructured
• Text (logs, tweets, articles)
• Sound / Waveform
• Sensor
• Genomic / Scientific Data

Structured Data (relational, tabular,
nested)

Text (logs, tweets, articles, social)

Graph Data (connections, social)

Start with the Raw Data
Geocoded / Location Data

Time Series Data
Stock Charts
Streaming / Events

What is a Data Science Platform?
Gartner defines Data Science Platform :
“an end-to-end platform for developing and
deploying models”
Using sophisticated statistical models, machine learning,
neural networks, text analytics, and other advanced data
mining techniques
25

What is a Model
A simplified and idealized
representation of the real-world

What does Modeling Mean?
A Class is a Model Model of a Building Data Model
class Employee {
FirstName : String
LastName : String
DOB : java.calendar.Date
Grades : Seq[Grade]
}

Types of Models
Machine Learning Models
Statistical Models
Financial Models
Graph Models
Simulation Models
Predictive Models
Biological Models

Two Broad Categories of Models
●Supervised learning: prediction
Classification (binary or multiclass): predict a category (label)
Regression: predict a number (target)
●Unsupervised learning: discovery
Clustering: find groupings based on pattern
Density estimation: match data with distribution pattern
Dimensionality: reduction / reduce # of columns
Similarity search: find similar data
Frequent Items (or association rules): finding relationships in variables

Model Category Use Cases
●Anomaly detection
Density estimation: “Is this observation uncommon?”
Similarity search: “How far is it from other observations?”
Clustering: “Are there groups of strange observations?”
●Lead scoring / recommendation
Classification: “Will this user become a buyer?”
Regression: “How much will he/she spend?”
Similarity search: “What products did similar users buy?”

A Model is a Mathematical Function

Unsupervised Methods in MLlib
Clustering
●Gaussian mixture models
●K-Means
●Streaming K-Means
●Latent Dirichlet Allocation
●Power Iteration Clustering
Frequent itemsets
●FP-growth
●Prefix span
Recommendation
●Alternating Least Squares

Supervised Methods in MLlib
Classification
Logistic regression w/ elastic net
Naive Bayes
Streaming logistic regression
Linear SVMs
Decision trees
Random forests
Gradient-boosted trees
Multilayer perceptron
One-vs-rest
DeepImagePredictor
Regression
Least squares w/ elastic net
Isotonic regression
Decision trees
Random forests
Gradient-boosted trees
Streaming linear methods

But What is a Model Really?
A model is a complex pipeline of components
Data Sources
Joins
Featurization Logic
Algorithm(s)
Transformers
Estimators
Tuning Parameters

ML Pipelines
Train model
Evaluate
Load data
Extract features
A very simple pipeline

ML Pipelines
Train model 1
Evaluate
Datasource 1
Datasource 2
Datasource 3
Extract featuresExtract features
Feature transform 1
Feature transform 2
Feature transform 3
Train model 2
Ensemble
A real pipeline!

Integrating Machine
Learning in Your
Application
37

Productionizing Models Today
Data Science Data Engineering
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)

Problems with Productionizing
Models
Develop Prototype
Model using
Python/R
Re-implement model for
production (Java)
- Extra work
- Different code paths
- Data science does not translate to production
- Slow to update models

MLLib 2.X Model Serialization
Develop Prototype
Model using
Python/R
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production

Scala
val lrModel = lrPipeline.fit(dataset)
// Save the Model
lrModel.write.save("/models/lr")
•
MLLib 2.X Model Serialization
Snippet
Python
lrModel = lrPipeline.fit(dataset)
# Save the Model
lrModel.write.save("/models/lr")
•

Model Serialization Output
Code
// List Contents of the Model Dir
dbutils.fs.ls("/models/lr")
•
Output
Remember this is a pipeline
model and these are the stages!

Transformer Stage (StringIndexer)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/00_s
trIdx_bb9728f85745/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet(”/mod
els/lr/stages/00_strIdx_bb9728f85745/
data/"))
Output
{
"class":"org.apache.spark.ml.feature.StringIndexerModel",
"timestamp":1488120411719,
"sparkVersion":"2.1.0",
"uid":"strIdx_bb9728f85745",
"paramMap":{
"outputCol":"workclassIdx",
"inputCol":"workclass",
"handleInvalid":"error"
}
}
Metadata and params
Data (Hashmap)

Estimator Stage (LogisticRegression)
Code
// Cat the contents of the Metadata dir
dbutils.fs.head(”/models/lr/stages/18_l
ogreg_325fa760f925/metadata/part-
00000")
// Display the Parquet File in the Data
dir
display(sqlContext.read.parquet("/mod
els/lr/stages/18_logreg_325fa760f925/
data/"))
Output
Model params
Intercept + Coefficients
{"class":"org.apache.spark.ml.classification.LogisticRegressionModel",
"timestamp":1488120446324,
"sparkVersion":"2.1.0",
"uid":"logreg_325fa760f925",
"paramMap":{
"predictionCol":"prediction",
"standardization":true,
"probabilityCol":"probability",
"maxIter":100,
"elasticNetParam":0.0,
"family":"auto",
"regParam":0.0,
"threshold":0.5,
"fitIntercept":true,
"labelCol":"label” }}

Output
Decision Tree Splits
Estimator Stage (DecisionTree)
Code
// Display the Parquet File in the Data dir
display(sqlContext.read.parquet(”/models/dt/stages/18_dtc_3d614bcb3ff825/data/"))
// Re-save as JSON
sqlContext.read.parquet("/models/dt/stages/18_dtc_3d614bcb3ff825/data/").json((”/models/json/dt").

Visualize Stage (DecisionTree)
Visualization of the Tree
In Databricks

Databricks + ML Pipelines: Ideal
Modeling Tool
Data Science - highly iterative,
agile
● Lots of data sources
● Lots of dirty data
● Lots and lots of data
ML Pipelines and notebooks are
ideal way to experiment with new
methods, data, features in order to
minimize error

Databricks Runtime
Elastic, Fully Managed, Highly Tuned Engine
48
FULLY MANAGED CLOUD
SERVICE
• Auto-configured multi-user elastic
clusters
• Reliable sharing with fault
isolation and workload
Preemption
PERFORMANCE OPTIMIZATIONS
• Increases performance by 5X
(TPC Benchmark)
• Connector optimizations
for Cloud ( Kafka, S3 and
Kinesis)
COST OPTIMIZED / LINEAR
SCALING
• 2x nodes - time cut in half
• 2x data, 2x nodes - time constant
• Cost of 10 nodes for 10 hours
equal to 100 nodes for 1 hour
DATABRICKS UNIFIED RUNTIME
Databricks I/O Databricks Serverless

Databricks Collaborative Workspace
Frictionless Collaboration Enabling Faster Innovation
49
Secure collaboration for fast feedback loops with single click access to
clusters
Production Jobs
FAST, RELIABLE AND SECURE JOBS
• Executes jobs 30-50% faster
• Notebooks to Production Jobs with one-
click
• Debug faster with logs and Spark history
UI.
DATA ENGINEER
ANALYZE DATA WITH NOTEBOOKS
• Multi-language: SQL, R, Scala, Python
• Advanced Analytics (Graph, ML & DL)
• Built-in visualization, including D3 & ggplot
DATA SCIENTIST
Interactive
Notebooks
BUILD DASHBOARDS
• Publish Insights
• Real-time updates
• Interactive reportsBUSINESS SME
Dashboards
49

50
Databricks’ Approach to Accelerate Innovation
INCREASE PERFORMANCE
By more than 5x and reduce TCO by
more than 70%
INCREASE PRODUCTIVITY
Of data science teams by 4-5x
STREAMLINE ANALYTIC
WORKFLOWS
Reducing deployment time to minutes
REDUCE RISK
And enable innovation with out-of-the-box
enterprise security and compliance
UNIFY ANALYTICS WITH APACHE
SPARK
Eliminating disparate tools
DATA SCIENTIST
/ANALYST
BUSINESS SMEDATA
ENGINEER
OPTIMIZEBIG DATA
CLUSTERS
SETUP
BREAK-FIX
DATA
WAREHOUSES
CLOUD
STORAGE
HADOOP STORAGEIoT / STREAMING DATA
MODEL DATA PRODUCTEXPLOREINGEST DASHBOARD
ML
LIBRARIES
STREAMIN
G
STATISTICS
PACKAGES
ETL SQL
Unified Engine
• SQL
• Streaming
• MLlib
• Graph
Databricks Optimizations
and Managed Cloud Service
DATABRICKS ENTERPRISE
SECURITY
DATABRICKS COLLABORATIVE
WORKSPACE
Databricks
Production
Databricks Interactive
DATABRICKS
RUNTIME
Databricks I/ODatabricks Serverless Unified Engine
Open APIs

Demonstrations
•Predicting Power Output for an Energy Company
•Scoring inbound Leads
•Predicting Ratings given Reviews from Amazon
5
2

Demonstration
Databricks Community Edition or Free Trial
<Link to Azure>
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
5
3

Azure Databricks for Data Scientists

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Azure Databricks for Data Scientists

Similar to Azure Databricks for Data Scientists (20)

Recently uploaded

Recently uploaded (20)

Azure Databricks for Data Scientists