Databricks Machine Learning
“Software is eating the World”
-Marc Andreessen
SOFTWARE
AI
SOFTWARE
“AI is eating software”
AI
SOFTWARE
“Data is eating AI”
-Matei Zaharia
DATA
Software AI (Software + Data)
The Hard Part about AI is Data
Goal Functional correctness Optimization of a metric, e.g. minimize loss
Software AI (Software + Data)
The Hard Part about AI is Data
Goal
Quality
Functional correctness Optimization of a metric, e.g. minimize loss
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Software AI (Software + Data)
The Hard Part about AI is Data
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
Software AI (Software + Data)
The Hard Part about AI is Data
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
The Hard Part about AI is Data
Software AI (Software + Data)
AI depends on Code AND Data
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Research
Scientists, Data Engineers, ML engineers
AI requires collaboration between Software and
Data Engineering practitioners
Software AI (Software + Data)
AI depends on Code AND Data
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Research
Scientists, Data Engineers, ML engineers
Software AI (Software + Data)
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires collaboration between Software and
Data Engineering practitioners
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Research
Scientists, Data Engineers, ML engineers
The AI Tooling Landscape is a Mess
Tooling Usually standardized within a
dev team
Established/hardened over
decades
Often heterogeneous even within teams
Few established standards and in constant
change due to open source innovation
Software AI (Software + Data)
AI depends on Code AND Data
AI requires many different
roles to get involved
Thriving ecosystem
of innovation!
VC Researcher
VC Researcher Tech Lead
Enterprise
Architect
Thriving ecosystem
of innovation!
Procurement and
DevOps nightmare!
Goal
Quality
Outcome
Functional correctness Optimization of a metric, e.g. minimize loss
Works deterministically
Depends on data, code, model architecture,
hyperparameters, random seeds, ...
Depends on code
Changes due to data drift
People Software Engineers Software Engineers, Data Scientists, Research
Scientists, Data Engineers, ML engineers
Tooling Usually standardized within a
dev team
Established/hardened over
decades
Often heterogeneous even within teams
Few established standards and in constant
change due to open source innovation
The AI Tooling Landscape is a Mess
Software AI (Software + Data)
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Attributes of a Solution
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Attributes of a Solution
Data Native
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Attributes of a Solution
Data Native
Collaborative
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Attributes of a Solution
Full ML Lifecycle
Data Native
Collaborative
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Full ML Lifecycle
Data Native
Collaborative
Attributes of a Solution
AI depends on Code AND Data
AI requires many different
roles to get involved
AI requires integrating
many different components
Open Data Lakehouse Foundation with
Announcing: Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Open Data Lakehouse Foundation with
Announcing: Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Ingestion
Tables
Refined
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage
Your Existing Data Lake
Delta Lake for Machine Learning
Ingestion
Tables
Refined
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
▪ Optimized Performance
▪ Consistent Quality due
to ACID transactions
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage
Your Existing Data Lake
Delta Lake for Machine Learning
Ingestion
Tables
Refined
Tables
Aggregated
Tables
Azure Data
Lake Storage
Amazon
S3
▪ Optimized Performance
▪ Consistent Quality due
to ACID transactions
▪ Tracking of Data Versions
due to Time Travel
▪ Full Lineage / Governance
ML Runtime
IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs
Structured
Semi-structured
Unstructured
Streaming
Google Cloud
Storage
Your Existing Data Lake
Delta Lake for Machine Learning
Integration
Open Data Lakehouse Foundation with
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Multi-Language: Scala, SQL, Python, and R
Collaborative: Unified Platform for Data Teams
Data Science Workspace
DATA ENGINEERS DATA SCIENTISTS ML
ENGINEERS
DATA ANALYSTS
Cloud-native Collaboration Features
Commenting Co-Presence
Co-Editing
Multi-Language: Scala, SQL, Python, and R
Collaborative: Unified Platform for Data Teams
Data Science Workspace
DATA ENGINEERS DATA SCIENTISTS ML
ENGINEERS
DATA ANALYSTS
Cloud-native Collaboration Features
Commenting Co-Presence
Co-Editing
Multi-Language: Scala, SQL, Python, and R Experiment Tracking with MLflow integration
Collaborative: Unified Platform for Data Teams
Data Science Workspace
DATA ENGINEERS DATA SCIENTISTS ML
ENGINEERS
DATA ANALYSTS
Open Data Lakehouse Foundation with
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Full ML Lifecycle: From Data to Model Deployment (and back)
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML
Text Images Video /
Audio
Tabular
Full ML Lifecycle: From Data to Model Deployment (and back)
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML Out-of-the-box environment for all ML frameworks
Text Images Video /
Audio
Tabular
Full ML Lifecycle: From Data to Model Deployment (and back)
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data prep designed for ML Out-of-the-box environment for all ML frameworks Deploy anywhere at any scale
Text Images Video /
Audio
Tabular
DataOps DevOps ModelOps
MLOps = + +
MLOps / Governance
Full ML Lifecycle: MLOps for Data Teams
Data Versioning
with Time
Travel
Code Versioning
with Git Integration
Model Lifecycle Management
with Model Registry
DataOps DevOps ModelOps
MLOps = + +
MLOps / Governance
Full ML Lifecycle: MLOps for Data Teams
Repos
Model Serving
Full ML Lifecycle: How you know you did it right
MLOps / Governance
Staging Production Archived
v2
v3
v1
Model Registry Model Serving
Full ML Lifecycle: How you know you did it right
MLOps / Governance
Experiment Tracking
Parameters
Metrics Artifacts Models
Data
Versioning
Staging Production Archived
v2
v3
v1
Model Registry Model Serving
Runtime and
Environment
Code
Versioning
Full ML Lifecycle: How you know you did it right
MLOps / Governance
Experiment Tracking
Parameters
Metrics Artifacts Models
Data
Versioning
Staging Production Archived
v2
v3
v1
Model Registry Model Serving
Runtime and
Environment
Code
Versioning
Notebooks and Git
Clusters
Runtime and Libraries
Data Versioning
Workspace
Full ML Lifecycle: How you know you did it right
MLOps / Governance
MLOps / Governance
Experiment Tracking
Parameter
s
Metrics Artifacts Models
Data
Versioning
Staging Production Archived
Data Scientists Deployment Engineers
v2
v3
v1
Model Registry Model Serving
Runtime and
Environment
Code
Versioning
Notebooks and Git
Clusters
Runtime and Libraries
Data Versioning
Workspace
Full ML Lifecycle: How you know you did it right
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
AutoML
Data Science Workspace
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Databricks AutoML
A glassbox approach to AutoML that empowers data teams without taking away control
Feature Store Deep Dive
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning
Model
Training
Model
Tuning
Runtime and
Environments
Monitoring
Batch
Scoring
Online Serving
Data Science Workspace
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
First things first: What is a feature?
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
On the example of a recommendation system
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
ML Model
Prediction
Features
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
ML Model
Prediction
Types of Features
Transformations
e.g. Category Encoding
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
ML Model
Prediction
Types of Features
Transformations
e.g. Category Encoding
Context Features
e.g. Weekday
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
ML Model
Prediction
Types of Features
Transformations
e.g. Category Encoding
Context Features
e.g. Weekday
Feature Augmentation
e.g. Weather
First things first: What is a feature?
Outcome
Raw data
Users table
Zip code, Payment
methods, etc.
Items table
Description,
Category, etc.
Purchases
User ID, Item ID,
Date, Quantity, Price
P(purchase|user)
0.58
0.13
0.12
0.01
Item
On the example of a recommendation system
ML Model
Prediction
Types of Features
Transformations
e.g. Category Encoding
Context Features
e.g. Weekday
Feature Augmentation
e.g. Weather
Pre-computed Features
e.g. Purchases last 7, 14, 21 days
A day (or 6 months) in the life of an ML model
Raw Data
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
No reuse of Features
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving Client
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving Client
need to be equivalent
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving Client
need to be equivalent
Online / Offline Skew
A day (or 6 months) in the life of an ML model
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
csv
csv
Serving Client
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Co-designed with
▪ Open format
▪ Built-in data versioning and governance
▪ Native access through PySpark, SQL, etc.
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Solving the Feature Store Problem
Raw Data
Featurization
Training
Joins, Aggregates, Transforms, etc.
Serving Client
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Co-designed with
▪ Open model format that supports all ML
frameworks
▪ Feature version and lookup logic
hermetically logged with Model
No reuse of Features Online / Offline Skew
Solving the Feature Store Problem
Feature Store
Solving the Feature Store Problem
Feature Store
Feature Registry
Feature Registry
▪ Discoverability and Reusability
▪ Versioning
▪ Upstream and downstream Lineage
Solving the Feature Store Problem
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Feature Provider
▪ Batch and online access to Features
▪ Feature lookup packaged with Models
▪ Simplified deployment process
Feature Registry
▪ Discoverability and Reusability
▪ Versioning
▪ Upstream and downstream Lineage
# register feature table
@feature_store.feature_table
def pickup_features_fn(df):
# feature transformations
return pickupzip_features
fs.create_feature_table(
name="taxi_demo_features.pickup",
keys=["zip", "ts"],
features_df=pickup_features_fn(df),
partition_columns="yyyy_mm",
description="Taxi fare prediction. Pickup features",
)
Feature Registry: Creating a Feature Table
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Upstream Lineage
Feature discovery based on data sources
Downstream Lineage
All consumers of a specific Feature
(Models, Endpoints, Jobs, Notebooks)
Feature Provider: Batch Access to Features
# create training set from feature store
training_set = fs.create_training_set(
taxi_data,
feature_lookups = pickup_feature_lookups + dropoff_feature_lookups,
label = "fare_amount",
exclude_columns = ["rounded_pickup_datetime", "rounded_dropoff_datetime"]
)
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
Feature Provider: Online Access to Features
# publish feature table to online store
fs.publish_table("taxi_demo_features.pickup", online_store_spec)
# code to get online features and call the model
# not necessary :)
Feature Store
Feature Registry
Feature
Provider
Batch (high throughput)
Online (low latency)
“The Databricks Feature Store is the missing piece to
our unified ML platform. It creates a marketplace for
features, enabling us to quickly develop and deploy new
models from existing features.”
-- Jorg Klein, ABN Amro
AutoML Deep Dive
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
AutoML
Data Science Workspace
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Databricks AutoML
A glassbox approach to AutoML that empowers data teams without taking away control
Problem Statement: AutoML is an opaque box
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
Persona
Problem Statement: AutoML is an opaque box
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ Code
/ Flexibility and
Performance
Persona Goal Driving Analogy
Problem Statement: AutoML is an opaque box
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ Low-Code
/ Augmentation
Persona Goal Driving Analogy
/ Code
/ Flexibility and
Performance
Problem Statement: AutoML is an opaque box
Citizen
Data Scientist
Engineer
ML Expert /
Researcher
/ No-Code
/ Full Automation
Persona Goal Driving Analogy
/ Low-Code
/ Augmentation
/ Code
/ Flexibility and
Performance
Databricks AutoML
Configure
Augment
Train and Evaluate
Databricks AutoML
Deploy
Notebook source
databricks.automl.classify(df, target_col='label', timeout_minutes=60)
Solution: “Glass Box” AutoML
“Databricks’ AutoML greatly improved our time to
market for our category personalisation model with
ready-to-use code for quick iteration and we were able
to outperform our previous model by 2-3% on the same
dataset.”
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
AutoML
Data Science Workspace
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
Announcing: Databricks AutoML
A glassbox approach to AutoML that empowers data teams without taking away control
Demo
Predict crypto mining attacks in real-time
Using Databricks Machine Learning
DATA SCIENCE
TEAM
NO.
SECURITY TEAM
Quarterly Budget??
▪ 2 days for viability
▪ 2 weeks for proof-of-concept
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
AutoML
Data Science Workspace
Wrap-Up
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
AutoML
Data Science Workspace
Open Data Lakehouse Foundation with
MLOps / Governance
Data
Prep
Data
Versioning Monitoring
Batch
Scoring
Online Serving
Model
Training
Model
Tuning
Runtime and
Environments
Feature Store
Batch (high throughput)
Real time (low latency)
AutoML
Data Science Workspace
Databricks Machine Learning
A data-native and collaborative solution for the full ML lifecycle
Databricks
Machine Learning
Persona-based Navigation
Purpose-built surfaces for data teams
ML Dashboard
All ML related assets and resources in one place
“... improved accuracy of
vehicle pricing, automated
model updates and their
frequency ...”
Customer Success with Databricks Machine Learning
“... improved accuracy of
vehicle pricing, automated
model updates and their
frequency ...”
“... increased revenue by
personalizing user
experience ...”
Customer Success with Databricks Machine Learning
“... improved accuracy of
vehicle pricing, automated
model updates and their
frequency ...”
“... increased revenue by
personalizing user
experience ...”
“... improved developer
productivity by enabling parallel
training of models for different
countries, types of articles, and
time periods ...”
Customer Success with Databricks Machine Learning
databricks.com/ml

What’s New with Databricks Machine Learning

  • 1.
  • 2.
    “Software is eatingthe World” -Marc Andreessen SOFTWARE
  • 3.
  • 4.
    AI SOFTWARE “Data is eatingAI” -Matei Zaharia DATA
  • 5.
    Software AI (Software+ Data) The Hard Part about AI is Data
  • 6.
    Goal Functional correctnessOptimization of a metric, e.g. minimize loss Software AI (Software + Data) The Hard Part about AI is Data
  • 7.
    Goal Quality Functional correctness Optimizationof a metric, e.g. minimize loss Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Software AI (Software + Data) The Hard Part about AI is Data
  • 8.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift Software AI (Software + Data) The Hard Part about AI is Data
  • 9.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift The Hard Part about AI is Data Software AI (Software + Data) AI depends on Code AND Data
  • 10.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift People Software Engineers Software Engineers, Data Scientists, Research Scientists, Data Engineers, ML engineers AI requires collaboration between Software and Data Engineering practitioners Software AI (Software + Data) AI depends on Code AND Data
  • 11.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift People Software Engineers Software Engineers, Data Scientists, Research Scientists, Data Engineers, ML engineers Software AI (Software + Data) AI depends on Code AND Data AI requires many different roles to get involved AI requires collaboration between Software and Data Engineering practitioners
  • 12.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift People Software Engineers Software Engineers, Data Scientists, Research Scientists, Data Engineers, ML engineers The AI Tooling Landscape is a Mess Tooling Usually standardized within a dev team Established/hardened over decades Often heterogeneous even within teams Few established standards and in constant change due to open source innovation Software AI (Software + Data) AI depends on Code AND Data AI requires many different roles to get involved
  • 14.
  • 15.
    VC Researcher TechLead Enterprise Architect Thriving ecosystem of innovation! Procurement and DevOps nightmare!
  • 16.
    Goal Quality Outcome Functional correctness Optimizationof a metric, e.g. minimize loss Works deterministically Depends on data, code, model architecture, hyperparameters, random seeds, ... Depends on code Changes due to data drift People Software Engineers Software Engineers, Data Scientists, Research Scientists, Data Engineers, ML engineers Tooling Usually standardized within a dev team Established/hardened over decades Often heterogeneous even within teams Few established standards and in constant change due to open source innovation The AI Tooling Landscape is a Mess Software AI (Software + Data) AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 17.
    Attributes of aSolution AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 18.
    Attributes of aSolution Data Native AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 19.
    Attributes of aSolution Data Native Collaborative AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 20.
    Attributes of aSolution Full ML Lifecycle Data Native Collaborative AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 21.
    Full ML Lifecycle DataNative Collaborative Attributes of a Solution AI depends on Code AND Data AI requires many different roles to get involved AI requires integrating many different components
  • 22.
    Open Data LakehouseFoundation with Announcing: Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace
  • 23.
    Open Data LakehouseFoundation with Announcing: Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace
  • 24.
    Ingestion Tables Refined Tables Aggregated Tables Azure Data Lake Storage Amazon S3 IAMPassthrough | Cluster Policies | Table ACLs | Automated Jobs Structured Semi-structured Unstructured Streaming Google Cloud Storage Your Existing Data Lake Delta Lake for Machine Learning
  • 25.
    Ingestion Tables Refined Tables Aggregated Tables Azure Data Lake Storage Amazon S3 ▪Optimized Performance ▪ Consistent Quality due to ACID transactions ML Runtime IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs Structured Semi-structured Unstructured Streaming Google Cloud Storage Your Existing Data Lake Delta Lake for Machine Learning
  • 26.
    Ingestion Tables Refined Tables Aggregated Tables Azure Data Lake Storage Amazon S3 ▪Optimized Performance ▪ Consistent Quality due to ACID transactions ▪ Tracking of Data Versions due to Time Travel ▪ Full Lineage / Governance ML Runtime IAM Passthrough | Cluster Policies | Table ACLs | Automated Jobs Structured Semi-structured Unstructured Streaming Google Cloud Storage Your Existing Data Lake Delta Lake for Machine Learning Integration
  • 27.
    Open Data LakehouseFoundation with Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace
  • 28.
    Multi-Language: Scala, SQL,Python, and R Collaborative: Unified Platform for Data Teams Data Science Workspace DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS
  • 29.
    Cloud-native Collaboration Features CommentingCo-Presence Co-Editing Multi-Language: Scala, SQL, Python, and R Collaborative: Unified Platform for Data Teams Data Science Workspace DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS
  • 30.
    Cloud-native Collaboration Features CommentingCo-Presence Co-Editing Multi-Language: Scala, SQL, Python, and R Experiment Tracking with MLflow integration Collaborative: Unified Platform for Data Teams Data Science Workspace DATA ENGINEERS DATA SCIENTISTS ML ENGINEERS DATA ANALYSTS
  • 31.
    Open Data LakehouseFoundation with Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace
  • 32.
    Full ML Lifecycle:From Data to Model Deployment (and back) Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data prep designed for ML Text Images Video / Audio Tabular
  • 33.
    Full ML Lifecycle:From Data to Model Deployment (and back) Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data prep designed for ML Out-of-the-box environment for all ML frameworks Text Images Video / Audio Tabular
  • 34.
    Full ML Lifecycle:From Data to Model Deployment (and back) Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data prep designed for ML Out-of-the-box environment for all ML frameworks Deploy anywhere at any scale Text Images Video / Audio Tabular
  • 35.
    DataOps DevOps ModelOps MLOps= + + MLOps / Governance Full ML Lifecycle: MLOps for Data Teams
  • 36.
    Data Versioning with Time Travel CodeVersioning with Git Integration Model Lifecycle Management with Model Registry DataOps DevOps ModelOps MLOps = + + MLOps / Governance Full ML Lifecycle: MLOps for Data Teams Repos
  • 37.
    Model Serving Full MLLifecycle: How you know you did it right MLOps / Governance
  • 38.
    Staging Production Archived v2 v3 v1 ModelRegistry Model Serving Full ML Lifecycle: How you know you did it right MLOps / Governance
  • 39.
    Experiment Tracking Parameters Metrics ArtifactsModels Data Versioning Staging Production Archived v2 v3 v1 Model Registry Model Serving Runtime and Environment Code Versioning Full ML Lifecycle: How you know you did it right MLOps / Governance
  • 40.
    Experiment Tracking Parameters Metrics ArtifactsModels Data Versioning Staging Production Archived v2 v3 v1 Model Registry Model Serving Runtime and Environment Code Versioning Notebooks and Git Clusters Runtime and Libraries Data Versioning Workspace Full ML Lifecycle: How you know you did it right MLOps / Governance
  • 41.
    MLOps / Governance ExperimentTracking Parameter s Metrics Artifacts Models Data Versioning Staging Production Archived Data Scientists Deployment Engineers v2 v3 v1 Model Registry Model Serving Runtime and Environment Code Versioning Notebooks and Git Clusters Runtime and Libraries Data Versioning Workspace Full ML Lifecycle: How you know you did it right
  • 42.
    Databricks Machine Learning Adata-native and collaborative solution for the full ML lifecycle Open Data Lakehouse Foundation with MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace
  • 43.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace Feature Store Batch (high throughput) Real time (low latency) Announcing: Feature Store The first Feature Store codesigned with a Data and MLOps Platform
  • 44.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Announcing: Databricks AutoML A glassbox approach to AutoML that empowers data teams without taking away control
  • 45.
  • 46.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Model Training Model Tuning Runtime and Environments Monitoring Batch Scoring Online Serving Data Science Workspace Feature Store Batch (high throughput) Real time (low latency) Announcing: Feature Store The first Feature Store codesigned with a Data and MLOps Platform
  • 47.
    First things first:What is a feature? Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price On the example of a recommendation system
  • 48.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system
  • 49.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system ML Model Prediction Features
  • 50.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system ML Model Prediction Types of Features Transformations e.g. Category Encoding
  • 51.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system ML Model Prediction Types of Features Transformations e.g. Category Encoding Context Features e.g. Weekday
  • 52.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system ML Model Prediction Types of Features Transformations e.g. Category Encoding Context Features e.g. Weekday Feature Augmentation e.g. Weather
  • 53.
    First things first:What is a feature? Outcome Raw data Users table Zip code, Payment methods, etc. Items table Description, Category, etc. Purchases User ID, Item ID, Date, Quantity, Price P(purchase|user) 0.58 0.13 0.12 0.01 Item On the example of a recommendation system ML Model Prediction Types of Features Transformations e.g. Category Encoding Context Features e.g. Weekday Feature Augmentation e.g. Weather Pre-computed Features e.g. Purchases last 7, 14, 21 days
  • 54.
    A day (or6 months) in the life of an ML model Raw Data
  • 55.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv
  • 56.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv
  • 57.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv
  • 58.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv No reuse of Features
  • 59.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving
  • 60.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving Client
  • 61.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving Client need to be equivalent
  • 62.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving Client need to be equivalent Online / Offline Skew
  • 63.
    A day (or6 months) in the life of an ML model Raw Data Featurization Training Joins, Aggregates, Transforms, etc. csv csv Serving Client
  • 64.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry
  • 65.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput)
  • 66.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Co-designed with ▪ Open format ▪ Built-in data versioning and governance ▪ Native access through PySpark, SQL, etc.
  • 67.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)
  • 68.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)
  • 69.
    Solving the FeatureStore Problem Raw Data Featurization Training Joins, Aggregates, Transforms, etc. Serving Client Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency) Co-designed with ▪ Open model format that supports all ML frameworks ▪ Feature version and lookup logic hermetically logged with Model
  • 70.
    No reuse ofFeatures Online / Offline Skew Solving the Feature Store Problem Feature Store
  • 71.
    Solving the FeatureStore Problem Feature Store Feature Registry Feature Registry ▪ Discoverability and Reusability ▪ Versioning ▪ Upstream and downstream Lineage
  • 72.
    Solving the FeatureStore Problem Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency) Feature Provider ▪ Batch and online access to Features ▪ Feature lookup packaged with Models ▪ Simplified deployment process Feature Registry ▪ Discoverability and Reusability ▪ Versioning ▪ Upstream and downstream Lineage
  • 73.
    # register featuretable @feature_store.feature_table def pickup_features_fn(df): # feature transformations return pickupzip_features fs.create_feature_table( name="taxi_demo_features.pickup", keys=["zip", "ts"], features_df=pickup_features_fn(df), partition_columns="yyyy_mm", description="Taxi fare prediction. Pickup features", ) Feature Registry: Creating a Feature Table Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)
  • 74.
  • 75.
    Downstream Lineage All consumersof a specific Feature (Models, Endpoints, Jobs, Notebooks)
  • 76.
    Feature Provider: BatchAccess to Features # create training set from feature store training_set = fs.create_training_set( taxi_data, feature_lookups = pickup_feature_lookups + dropoff_feature_lookups, label = "fare_amount", exclude_columns = ["rounded_pickup_datetime", "rounded_dropoff_datetime"] ) Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)
  • 77.
    Feature Provider: OnlineAccess to Features # publish feature table to online store fs.publish_table("taxi_demo_features.pickup", online_store_spec) # code to get online features and call the model # not necessary :) Feature Store Feature Registry Feature Provider Batch (high throughput) Online (low latency)
  • 78.
    “The Databricks FeatureStore is the missing piece to our unified ML platform. It creates a marketplace for features, enabling us to quickly develop and deploy new models from existing features.” -- Jorg Klein, ABN Amro
  • 79.
  • 80.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Announcing: Databricks AutoML A glassbox approach to AutoML that empowers data teams without taking away control
  • 81.
    Problem Statement: AutoMLis an opaque box Citizen Data Scientist Engineer ML Expert / Researcher Persona
  • 82.
    Problem Statement: AutoMLis an opaque box Citizen Data Scientist Engineer ML Expert / Researcher / Code / Flexibility and Performance Persona Goal Driving Analogy
  • 83.
    Problem Statement: AutoMLis an opaque box Citizen Data Scientist Engineer ML Expert / Researcher / Low-Code / Augmentation Persona Goal Driving Analogy / Code / Flexibility and Performance
  • 84.
    Problem Statement: AutoMLis an opaque box Citizen Data Scientist Engineer ML Expert / Researcher / No-Code / Full Automation Persona Goal Driving Analogy / Low-Code / Augmentation / Code / Flexibility and Performance
  • 85.
  • 86.
  • 87.
    Notebook source databricks.automl.classify(df, target_col='label',timeout_minutes=60) Solution: “Glass Box” AutoML
  • 88.
    “Databricks’ AutoML greatlyimproved our time to market for our category personalisation model with ready-to-use code for quick iteration and we were able to outperform our previous model by 2-3% on the same dataset.”
  • 89.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving AutoML Data Science Workspace Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) Announcing: Databricks AutoML A glassbox approach to AutoML that empowers data teams without taking away control
  • 90.
  • 91.
    Predict crypto miningattacks in real-time Using Databricks Machine Learning DATA SCIENCE TEAM NO. SECURITY TEAM Quarterly Budget?? ▪ 2 days for viability ▪ 2 weeks for proof-of-concept
  • 93.
    Databricks Machine Learning Adata-native and collaborative solution for the full ML lifecycle Open Data Lakehouse Foundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) AutoML Data Science Workspace
  • 94.
  • 95.
    Databricks Machine Learning Adata-native and collaborative solution for the full ML lifecycle Open Data Lakehouse Foundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) AutoML Data Science Workspace
  • 96.
    Open Data LakehouseFoundation with MLOps / Governance Data Prep Data Versioning Monitoring Batch Scoring Online Serving Model Training Model Tuning Runtime and Environments Feature Store Batch (high throughput) Real time (low latency) AutoML Data Science Workspace Databricks Machine Learning A data-native and collaborative solution for the full ML lifecycle Databricks Machine Learning
  • 97.
  • 98.
    ML Dashboard All MLrelated assets and resources in one place
  • 99.
    “... improved accuracyof vehicle pricing, automated model updates and their frequency ...” Customer Success with Databricks Machine Learning
  • 100.
    “... improved accuracyof vehicle pricing, automated model updates and their frequency ...” “... increased revenue by personalizing user experience ...” Customer Success with Databricks Machine Learning
  • 101.
    “... improved accuracyof vehicle pricing, automated model updates and their frequency ...” “... increased revenue by personalizing user experience ...” “... improved developer productivity by enabling parallel training of models for different countries, types of articles, and time periods ...” Customer Success with Databricks Machine Learning
  • 102.