_Python Ireland Meetup - Serverless ML - Dowling.pdf

Python Ireland Meetup, Sep 14th 2022
Jim Dowling, CEO @ Hopsworks and Assoc Prof @ KTH
Serverless ML in Python
Predict surf height at Lahinch Beach

Beyond Notebooks: Don’t just train models, build “Prediction Services”
❌ Static Datasets
❌ Data is downloaded from a single
URL
❌ Features for ML are engineered,
correct, and unbiased
❌ Use a model evaluation metric
(accuracy) to communicate the value
of your model
💚 Data never stops coming
💚 Data comes from different
heterogeneous data sources
💚 Write code to extract and validate
the features from input data
💚 Communicate the value of your
model with a UI or app/service
💚 Build and deploy a reliable service
around your model with MLOps

Serverless ML “Prediction Service”
Once or Twice/day
Features Pipelines & Batch Prediction Pipelines
HOPSWORKS.AI
Features
Twice/day Predictions
Github Pages UI
Publish to UI
train model
https://github.com/jimdowling/cjsurf
Models

Serverless Python Functions
● render. com
● pythonanywhere.com
● replit.com
● deta.sh
● linode.com
● hetzner.com
● digitalocean.com
● AWS lambda functions
● Google Cloud Functions
Orchestration Platforms
● Astronomer (Airflow)
● Dagster
● Prefect
● Azure Data Factory
● Amazon Managed Workflows
for Apache Airflow (MWAA)
● Google Cloud Composer
● Databricks Workflows
Alternatives to Github Actions for Serverless Python

(Good) Bombs going off at Mullaghmore, Ireland

What height will the surf be at Lahinch this weekend?
When I lived in Dublin, I always wanted to
know what I would do the next weekend…
surfs up?
No Yes

We built a system called CJSurf to predict surf at Lahinch
Open Ocean Swell Predictions Lahinch Beach Surf Height Predictions

Swells/Waves have (1) height, (2) period, (3) direction
Height
Period is the time between waves
Direction
Wave height at the point is 4 times higher than wave height at the beach

https://polar.ncep.noaa.gov/waves/WEB/gfswave.latest_run/plots/gfswave.62108.bull
Swell Predictions by NOAA Buoys with height, period, direction

Accurate Surf Height Observations by Lahinch Surf Shop
Reports are published at 10am every day by
https://www.lahinchsurfshop.com/

Can I write CJSurf from 2004 with with free managed services?
Can we rewrite a LAMP architecture to a free serverless Python architecture in 2022?
Java Data Collector
& K-NN Predictions.
CronJob.
Php Web App
MySQL
lahinchsurfhop.com noaa.gov (62081, 62105)
Production Machine Learning in 2004!
Lookup Precomputed Predictions
Write Features &
Predictions

Serverless Analytical ML Application in Python (2022)
surf-report-features.ipynb
swell-features.ipynb
batch-predict-surf.ipynb
Github
Pages
Hopsworks
Feature Store
Lahinch, NOAA
Hopsworks
Model Registry
download
model
latest_lahinch.png
insert
DataFrames
https://github.com/jimdowling/cjsurf
train-model.ipynb
add model
SERVERLESS COMPUTE SERVERLESS STATE SERVERLESS UI

Feature Engineering with Pandas/Spark/SQL/Flink
Feature Store
HOPSWORKS
DataFrames DataFrames/Files
Aggregations
Dimensionality
Reductions
Validations
Normalization
One-hot encoding
SQL

Feature Groups Feature Views
Search, Versioning, Metrics
Lineage, Source Code
</>
Feature Store: write to Feature Groups, read from Feature Views
Write DataFrames
Real-Time
Features
Batch Data
Read Feature Vectors
Online API
HOPSWORKS FEATURE STORE
Read Files/DataFrames
Ofﬂine API

Feature Engineering: what time does the swell “hit_at” Lahinch?
Prediction
Time=0
“hits_at”
Lahinch Time=?
The swell velocity is calculated by
multiplying the swell period by 1.5. But,
we also need to consider swell direction.

Swell Direction and the Swell Window at Lahinch
SWELL WINDOW
for Lahinch
Lahinch
Swell directions that work for
Lahinch ~(15-120 degrees)

Writing Pandas DataFrames to Hopsworks as Feature Groups

Data Validation for Feature Groups with Great Expectations

Data Validation with Great Expectations and Hopsworks

Feature
Group
Hopsworks Alerting for Data Validation with Great Expectations
✔
❌
DataFrame Great
Expectations
Hopsworks
Alert

Creating
Training Data From
Feature Groups

beach_id obs_time height min max
1 2004-01-01 10:00 1 1 1
1 2004-01-02 11:00 1.5 1 2
1 2004-01-03 12:00 3 2 4
lahinch_surf_reports updated every 24 hrs
buoy_id hits_at height direction period
62105 2004-01-01 00:00 1.25 88 9.8
62105 2004-01-02 06:00 1.30 92 10.2
62105 2004-01-03 12:00 2.45 100 11.4
noa_swells updated every 6 hrs
obs_time => hits_at height (swell) direction period height (label)
2004-01-01 10:00 1.25 88 9.8 1.5
2004-01-02 11:00 1.30 92 10.2 2
2004-01-03 12:00 2.45 100 11.4 3
Point-in-time Correct JOIN
(no future data leakage)
Join Features to create Point-in-time Correct Training Data
Training Data

Python DSL for Point-in-Time JOINs, transpiled into SQL
query = lahinch.select(['wave_height'])
.join(swells.select(['height','period','direction']))
fv = fs.create_feature_view(name='lahinch_surf',
description="Lahinch surf height features",
version=1,
labels=["wave_height"],
query=query)

Avoid Training/Serving
Skew with Online Models

Maximize Feature Reuse: Transformations after Feature Groups
Feature Store
HOPSWORKS
DataFrames DataFrames/Files
Aggregations
Dimensionality
Reductions
Validations
Normalization
One-hot encoding
SQL

Normalizing numerical features often improves model performance
Normalization of swell height, period, distance
RMSE 5.11
RMSE 7.0

Scikit-Learn Transformation Functions
Training Pipeline
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
joblib.dump(scaler, ‘scaler.pkl')
Online Inference Pipeline
scalar = joblib.load(“/path/to/scalar.pkl”)
from sklearn.preprocessing import MinMaxScaler
X_test_scaled = scaler.fit_transform(X_test)
Ensure
Consistency
with Versioning
& Code Review
& Testing

Online Transformation Functions
Training Pipeline
standard_scaler =
fs.get_transformation_function(name="standard_scaler")
transformation_functions = {
"height": standard_scaler,
"period": standard_scaler,
"direction": standard_scaler,
}
fv = fs.create_feature_view(name='lahinch_surf',
…
transformation_functions=transformation_functions)
X_train,y_train,X_test,y_test = fv.train_test_split(0.1)
Online Inference Pipeline
keys= {“beach_id”: 1}
feature_vector = fv.get_feature_vector(keys)
Transformation
functions (UDFs)
consistent over
training & serving

Lesson Learned:
Refactor Monolithic ML
Pipelines into
Production ML Services

Beyond Notebooks and Monolithic ML Pipelines
Feature
Engineering
Train
Model
Evaluate
Model
Raw Data
● Monolithic ML Pipelines are a single pipeline that transforms raw data
into features and trains and scores the model in one single program
● No easy path to production, so often just thrown over the wall to ops :)

Refactor Monolithic Pipelines into Feature, Training, Inference Pipelines
Feature
Pipeline
Historical Data
Hopsworks
Data Source
Batch
Inference
Pipeline
Training
Pipeline
model
features inference data
training data
predictions
model
Run on a
schedule
Run
on-demand
● A feature pipeline to create features from new live data or to backﬁll features
from historical data
● A training pipeline that can be run when a new model is needed
● An inference pipeline (either batch or online) that takes features from the feature
store, and if the model is online, combines them with online features.
backﬁll
new
data

Online Inference Pipelines are part of Model Serving Infra
● Some features are pre-computed and retrieved from the feature store
(typically those that require history and context information)
● Some features are computed on-demand (at run-time) with
application-supplied data (and possibly also history/context)
Feature
Pipeline
Historical Data
Hopsworks
Batch Source
Model
Serving
features
precomputed
features Application
or Service
request
on-demand
features
Stream Source
Training
Pipeline
model
training data
Run on a
schedule
Run
on-demand
Operational
Service
prediction
backﬁll

Case Study: Iris Flowers as a Batch Prediction Service
iris-feature-
pipeline.ipynb
iris.csv
Hopsworks
Synthetic Data
iris-batch-infere
nce-pipeline
.ipynb
iris-train-knn-
model.ipynb
register
model
features DataFrame
training data
iris_model
GH Actions
Once/day
Colab - run
on-demand
backﬁll
new
data
Github
Pages UI
https://github.com/featurestoreorg/serverless-ml-course/tree/main/src/01-module

SERVERLESS ML
www.serverless-ml.org
September 2022

Serverless ML Flywheels with Hopsworks
PyData London Exclusive: limited registrations now available at:
https://app.hopsworks.ai
Our Promise to you:
Time Unlimited Free Tier

www.hopsworks.ai
Open &
modular
Eﬃciency
At Scale
Compliance
Governance
Twitter: @jim_dowling

_Python Ireland Meetup - Serverless ML - Dowling.pdf

Recommended

Recommended

More Related Content

Similar to _Python Ireland Meetup - Serverless ML - Dowling.pdf

Similar to _Python Ireland Meetup - Serverless ML - Dowling.pdf (20)

More from Jim Dowling

More from Jim Dowling (20)

Recently uploaded

Recently uploaded (20)

_Python Ireland Meetup - Serverless ML - Dowling.pdf