What’s New with Databricks Machine Learning

“Software is eating the World”
-Marc Andreessen
SOFTWARE

AI
SOFTWARE
“AI is eating software”

AI
SOFTWARE
“Data is eating AI”
-Matei Zaharia
DATA

Software AI (Software + Data)
The Hard Part about AI is Data

Goal Functional correctness Optimization of a metric, e.g. minimize loss

Goal
Quality
Functional correctness Optimization of a metric, e.g. minimize loss
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code

Goal
Quality
Outcome
Works deterministically
Depends on code
Changes due to data drift

Goal
Quality
Outcome
Depends on code
AI depends on Code AND Data

Goal
Quality
Outcome
Depends on code
People Software Engineers Software Engineers, Data Scientists, Research
Scientists, Data Engineers, ML engineers
AI requires collaboration between Software and
Data Engineering practitioners

Goal
Quality
Outcome
Depends on code
AI requires many different
roles to get involved
AI requires collaboration between Software and
Data Engineering practitioners

Goal
Quality
Outcome
Depends on code
The AI Tooling Landscape is a Mess
Tooling Usually standardized within a
dev team
Established/hardened over
decades
Often heterogeneous even within teams
Few established standards and in constant
change due to open source innovation

Thriving ecosystem
of innovation!
VC Researcher

VC Researcher Tech Lead
Enterprise
Architect
Thriving ecosystem
of innovation!
Procurement and
DevOps nightmare!

Goal
Quality
Outcome
Depends on code
Tooling Usually standardized within a
dev team
Established/hardened over
decades
Often heterogeneous even within teams
Few established standards and in constant
change due to open source innovation
The AI Tooling Landscape is a Mess
AI requires integrating
many different components

Attributes of a Solution

Data Native

Data Native
Collaborative

Full ML Lifecycle
Data Native
Collaborative

Open Data Lakehouse Foundation with
Announcing: Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace

Ingestion
Tables
Reﬁned
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage
Your Existing Data Lake
Delta Lake for Machine Learning

Ingestion
Tables
Reﬁned
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
▪ Optimized Performance
▪ Consistent Quality due
to ACID transactions
ML Runtime
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage

Ingestion
Tables
Reﬁned
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
▪ Optimized Performance
▪ Consistent Quality due
to ACID transactions
▪ Tracking of Data Versions
due to Time Travel
▪ Full Lineage / Governance
ML Runtime
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage
Integration

Databricks Machine Learning
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving

Multi-Language: Scala, SQL, Python, and R
Collaborative: Uniﬁed Platform for Data Teams
DATA ENGINEERS DATA SCIENTISTS ML
ENGINEERS
DATA ANALYSTS

Cloud-native Collaboration Features
Commenting Co-Presence
Co-Editing
Multi-Language: Scala, SQL, Python, and R
ENGINEERS
DATA ANALYSTS

Cloud-native Collaboration Features
Commenting Co-Presence
Co-Editing
Multi-Language: Scala, SQL, Python, and R Experiment Tracking with MLﬂow integration
ENGINEERS
DATA ANALYSTS

Full ML Lifecycle: From Data to Model Deployment (and back)
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML
Text Images Video /
Audio
Tabular

Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML Out-of-the-box environment for all ML frameworks
Text Images Video /
Audio
Tabular

Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML Out-of-the-box environment for all ML frameworks Deploy anywhere at any scale
Text Images Video /
Audio
Tabular

DataOps DevOps ModelOps
MLOps = + +
MLOps / Governance
Full ML Lifecycle: MLOps for Data Teams

Data Versioning
with Time
Travel
Code Versioning
with Git Integration
Model Lifecycle Management
with Model Registry
DataOps DevOps ModelOps
MLOps = + +
MLOps / Governance
Full ML Lifecycle: MLOps for Data Teams
Repos

Model Serving
Full ML Lifecycle: How you know you did it right
MLOps / Governance

Staging Production Archived
v2
v3
v1
Model Registry Model Serving
MLOps / Governance

Experiment Tracking
Parameters
Metrics Artifacts Models
Data
Versioning
v2
v3
v1
Runtime and
Environment
Code
Versioning
MLOps / Governance

Experiment Tracking
Parameters
Data
Versioning
v2
v3
v1
Runtime and
Environment
Code
Versioning
Notebooks and Git
Clusters
Runtime and Libraries
Data Versioning
Workspace
MLOps / Governance

MLOps / Governance
Experiment Tracking
Parameter
s
Data
Versioning
Data Scientists Deployment Engineers
v2
v3
v1
Runtime and
Environment
Code
Versioning
Notebooks and Git
Clusters
Runtime and Libraries
Data Versioning
Workspace

MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving

MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Feature Store
The ﬁrst Feature Store codesigned with a Data and MLOps Platform

MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
AutoML
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Announcing: Databricks AutoML
A glassbox approach to AutoML that empowers data teams without taking away control

First things ﬁrst: What is a feature?
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
On the example of a recommendation system

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item
ML Model
Prediction
Features

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item
ML Model
Prediction
Types of Features
Transformations
e.g. Category Encoding

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item
ML Model
Prediction
Types of Features
Transformations
Context Features
e.g. Weekday

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item
ML Model
Prediction
Types of Features
Transformations
Context Features
e.g. Weekday
Feature Augmentation
e.g. Weather

Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
P(purchase|user)
0.58
0.13
0.12
0.01
Item
ML Model
Prediction
Types of Features
Transformations
Context Features
e.g. Weekday
Feature Augmentation
e.g. Weather
Pre-computed Features
e.g. Purchases last 7, 14, 21 days

A day (or 6 months) in the life of an ML model
Raw Data

Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv

Raw Data
Featurization
Training
csv
csv

Raw Data
Featurization
Training
csv
csv
No reuse of Features

Raw Data
Featurization
Training
csv
csv
Serving

Raw Data
Featurization
Training
csv
csv
Serving Client

Raw Data
Featurization
Training
csv
csv
Serving Client
need to be equivalent

Raw Data
Featurization
Training
csv
csv
Serving Client
need to be equivalent
Online / Offline Skew

Solving the Feature Store Problem
Raw Data
Featurization
Training
Serving Client
Feature Store
Feature Registry

Raw Data
Featurization
Training
Serving Client
Feature Store
Feature Registry
Feature
Provider

Raw Data
Featurization
Training
Serving Client
Feature Store
Feature Registry
Feature
Provider
Co-designed with
▪ Open format
▪ Built-in data versioning and governance
▪ Native access through PySpark, SQL, etc.

Raw Data
Featurization
Training
Serving Client
Feature Store
Feature Registry
Feature
Provider
Online (low latency)

Raw Data
Featurization
Training
Serving Client
Feature Store
Feature Registry
Feature
Provider
Co-designed with
▪ Open model format that supports all ML
frameworks
▪ Feature version and lookup logic
hermetically logged with Model

No reuse of Features Online / Offline Skew
Feature Store

Feature Store
Feature Registry
Feature Registry
▪ Discoverability and Reusability
▪ Versioning
▪ Upstream and downstream Lineage

Feature Store
Feature Registry
Feature
Provider
Feature Provider
▪ Batch and online access to Features
▪ Feature lookup packaged with Models
▪ Simpliﬁed deployment process
Feature Registry
▪ Discoverability and Reusability
▪ Versioning
▪ Upstream and downstream Lineage

# register feature table
@feature_store.feature_table
def pickup_features_fn(df):
# feature transformations
return pickupzip_features
fs.create_feature_table(
name="taxi_demo_features.pickup",
keys=["zip", "ts"],
features_df=pickup_features_fn(df),
partition_columns="yyyy_mm",
description="Taxi fare prediction. Pickup features",
)
Feature Registry: Creating a Feature Table
Feature Store
Feature Registry
Feature
Provider

Upstream Lineage
Feature discovery based on data sources

Downstream Lineage
All consumers of a speciﬁc Feature
(Models, Endpoints, Jobs, Notebooks)

Feature Provider: Batch Access to Features
# create training set from feature store
training_set = fs.create_training_set(
taxi_data,
feature_lookups = pickup_feature_lookups + dropoff_feature_lookups,
label = "fare_amount",
exclude_columns = ["rounded_pickup_datetime", "rounded_dropoff_datetime"]
)
Feature Store
Feature Registry
Feature
Provider

Feature Provider: Online Access to Features
# publish feature table to online store
fs.publish_table("taxi_demo_features.pickup", online_store_spec)
# code to get online features and call the model
# not necessary :)
Feature Store
Feature Registry
Feature
Provider

“The Databricks Feature Store is the missing piece to
our uniﬁed ML platform. It creates a marketplace for
features, enabling us to quickly develop and deploy new
models from existing features.”
-- Jorg Klein, ABN Amro

Problem Statement: AutoML is an opaque box
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
Persona

Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ Code
/ Flexibility and
Performance
Persona Goal Driving Analogy

Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ Low-Code
/ Augmentation
/ Code
/ Flexibility and
Performance

Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ No-Code
/ Full Automation
/ Low-Code
/ Augmentation
/ Code
/ Flexibility and
Performance

Conﬁgure
Augment
Train and Evaluate
Databricks AutoML
Deploy

Notebook source
databricks.automl.classify(df, target_col='label', timeout_minutes=60)
Solution: “Glass Box” AutoML

“Databricks’ AutoML greatly improved our time to
market for our category personalisation model with
ready-to-use code for quick iteration and we were able
to outperform our previous model by 2-3% on the same
dataset.”

Predict crypto mining attacks in real-time
Using Databricks Machine Learning
DATA SCIENCE
TEAM
NO.
SECURITY TEAM
Quarterly Budget??
▪ 2 days for viability
▪ 2 weeks for proof-of-concept

MLOps / Governance
Data
Prep
Data
Batch
Scoring
Online Serving
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
AutoML

MLOps / Governance
Data
Prep
Data
Batch
Scoring
Online Serving
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
AutoML
Databricks
Machine Learning

Persona-based Navigation
Purpose-built surfaces for data teams

ML Dashboard
All ML related assets and resources in one place

“... improved accuracy of
vehicle pricing, automated
model updates and their
frequency ...”
Customer Success with Databricks Machine Learning

frequency ...”
“... increased revenue by
personalizing user
experience ...”

frequency ...”
“... increased revenue by
personalizing user
experience ...”
“... improved developer
productivity by enabling parallel
training of models for different
countries, types of articles, and
time periods ...”

What’s New with Databricks Machine Learning

More Related Content

What's hot

Similar to What’s New with Databricks Machine Learning

More from Databricks

Recently uploaded

What’s New with Databricks Machine Learning