SlideShare a Scribd company logo
Feature store:
Solving anti-patterns
in ML-systems
About
Synerise
Synerise is a European data company that collects,
interprets and leverages online and offline data with
the use of AI to power 1:1 Customer Engagement.
Our technology helps to power brands in all major
B2C verticals including retail, consumer banking,
telecommunications, public and automotive.
AI: a powerful
engine of growth
Customer
Engagement
Empower
Employee
Innovation
Cost
Optimization
Product
Transformation
Challanges
to address
Old
Combine available datasets for each
customer
Perform regression, scoring, ranking,
segmentation, anomaly detection, …
Do all of that in real-time
Support non-stationary, evolving data
distributions
Support evolving feature spaces
1.
2.
3.
4.
5.
Support incremental improvement when new
data sources become available6.
7.
8.
9.
Achieve performance on-par with
or better than dedicated
single use-case
models
Low latency, high throughput!
Data safety - all data can be obfuscated via
hashing, quantization etc.
Observation
Reality of ML
system
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
"…a mature system might end up being
(at most) 5% machine learning code
and (at least) 95% glue code”
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips,
Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
ML Systems
Anti-Patterns
Old
Glue Code1.
ML Systems
Anti-Patterns
Glue Code
Pipeline jungles
Dead experimental code paths
1.
2.
3.
ML Systems
Anti-Patterns
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
ML Systems
Anti-Patterns
Old
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
Data-processing doesn’t scale6.
7. Real-time Feature requires engineers
ML Systems
Anti-Patterns
Old
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
Data-processing doesn’t scale6.
7.
9.
10.
Real-time Feature requires engineers
Lack of Feature discovery
Lack of standardization
Lack of data testing8.
11. Multi-language issue
„Data is the hardest part of ML and the most important piece to get right.
Modelers spend most of their time selecting and transforming Feature at training time
and then building the pipelines to deliver those Feature to production models.”
Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
Machine Learning & Data science are in the same place
where software engineering was 20 years ago...
Remedy
First-class
entity
Machine learning and data science is about data, but often data is not a first-class entity
in such systems.
So:
1. Let's make the data a first-class entity as code is for software engineering
2. Let's make Feature a first-class entity as functions/modules are for software engineering
3. Let's think about models as compiled software libraries
First-class
entity
Let people be creative, do the awesome job, free them from the usual and boring,
but necessary:
o data access & ingestion
o data processing & cleaning
o feature engineering & management
o data modeling & building processing pipelines
First-class
entity
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
Feature
store
Feature store is:
o a place to store unified, versioned, tested and documented Feature
o an interface between data engineering and model development
o an interface for feature discovery and analysis
Raw/Structered
Data
Feature store Models
Future
Engineering
Training & Serving
Feature
store
Model
1
Model
2
Model
3
Data
set 1
Data
set 2
Data
set 3
Feature
engineering 2
Feature
engineering 3
Feature
engineering 1
Feature
store
Model
1
Model
2
Model
3
Data
set 1
Data
set 2
Data
set 3
Feature Store
Feature
store gives:
Old
Feature versioning
Feature trust – can be tested
Feature consistency
Feature discovery and reuse
Feature documentation and analytics
1.
2.
3.
4.
5.
Standardized access to Feature between
training and serving
– also reproducibility of results
6.
8.
9.
Feature can be access controlled
Production model results can be Feature for
other models
Automatic backfilling of Feature –
avoid expensive re computations7.
Feature
store
Avg.CostofaNewML
Project
Num. Curated Feature
in Feature Store
Source: The Feature Store in Hopsworks, Jim Dowling
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
Feature
store - storage:
Old
Clickhouse:
o Scalable big data column-oriented
database
o Easy to use
o Handle large and sparse feature
spaces
o ASOF join - joining sequences with
a non-exact match
1. SSDB2.
o Persistent high performace key-
value database
o Implements Redis protocol
o Designed to store collection data
o Replication(master-slave), load
balance
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
SSDB
Feature
store
Thanks to the Feature store, we are able to:
o cut down new model development time
o cut down model training time
o easily test new ideas
In one word:
focus on interesting and creative parts of machine learning based systems.
Next steps
and future work
o unify streaming part
o implement feature analytics and monitoring
o improve feature documentation
Andrzej Michałowski
Head of AI Research and Development
andrzej.michalowski@synerise.com
Thank you
Questions?

More Related Content

What's hot

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
Databricks
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
Amazon Web Services
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
Databricks
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
James Serra
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
Daniel Marcous
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Altinity Ltd
 
Data mesh
Data meshData mesh
Data mesh
ManojKumarR41
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Márton Kodok
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
ConfluentInc1
 

What's hot (20)

CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBayReal-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
Real-time, Exactly-once Data Ingestion from Kafka to ClickHouse at eBay
 
Data mesh
Data meshData mesh
Data mesh
 
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google CloudVertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
Vertex AI - Unified ML Platform for the entire AI workflow on Google Cloud
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 

Similar to Feature store: Solving anti-patterns in ML-systems

Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
KafkaZone
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
Faizan Javed
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
Data Science Milan
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
Tao Xie
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Databricks
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
John Liu
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
Johann Schleier-Smith
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineering
Mohamed MEJDOUBI
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
Sri Ambati
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
MLconf
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
Laurenz Wuttke
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
Matthew Sinclair
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
Nick Pentreath
 
Technovision
TechnovisionTechnovision
Technovision
SayantanGhosh58
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
Databricks
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
Jan Wiegelmann
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
Miroslaw Staron
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Matt Stubbs
 

Similar to Feature store: Solving anti-patterns in ML-systems (20)

Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineering
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Technovision
TechnovisionTechnovision
Technovision
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
Big Data LDN 2018: HOW AUTOMATION CAN ACCELERATE THE DELIVERY OF MACHINE LEAR...
 

Recently uploaded

Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 

Recently uploaded (20)

Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 

Feature store: Solving anti-patterns in ML-systems

  • 2. About Synerise Synerise is a European data company that collects, interprets and leverages online and offline data with the use of AI to power 1:1 Customer Engagement. Our technology helps to power brands in all major B2C verticals including retail, consumer banking, telecommunications, public and automotive.
  • 3. AI: a powerful engine of growth Customer Engagement Empower Employee Innovation Cost Optimization Product Transformation
  • 4. Challanges to address Old Combine available datasets for each customer Perform regression, scoring, ranking, segmentation, anomaly detection, … Do all of that in real-time Support non-stationary, evolving data distributions Support evolving feature spaces 1. 2. 3. 4. 5. Support incremental improvement when new data sources become available6. 7. 8. 9. Achieve performance on-par with or better than dedicated single use-case models Low latency, high throughput! Data safety - all data can be obfuscated via hashing, quantization etc.
  • 6. Reality of ML system Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 7. "…a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 9. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths 1. 2. 3.
  • 10. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5.
  • 11. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. Real-time Feature requires engineers
  • 12. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. 9. 10. Real-time Feature requires engineers Lack of Feature discovery Lack of standardization Lack of data testing8. 11. Multi-language issue
  • 13. „Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming Feature at training time and then building the pipelines to deliver those Feature to production models.” Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
  • 14. Machine Learning & Data science are in the same place where software engineering was 20 years ago...
  • 16. First-class entity Machine learning and data science is about data, but often data is not a first-class entity in such systems. So: 1. Let's make the data a first-class entity as code is for software engineering 2. Let's make Feature a first-class entity as functions/modules are for software engineering 3. Let's think about models as compiled software libraries
  • 17. First-class entity Let people be creative, do the awesome job, free them from the usual and boring, but necessary: o data access & ingestion o data processing & cleaning o feature engineering & management o data modeling & building processing pipelines
  • 18. First-class entity Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 19. Feature store Feature store is: o a place to store unified, versioned, tested and documented Feature o an interface between data engineering and model development o an interface for feature discovery and analysis Raw/Structered Data Feature store Models Future Engineering Training & Serving
  • 20. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature engineering 2 Feature engineering 3 Feature engineering 1
  • 22. Feature store gives: Old Feature versioning Feature trust – can be tested Feature consistency Feature discovery and reuse Feature documentation and analytics 1. 2. 3. 4. 5. Standardized access to Feature between training and serving – also reproducibility of results 6. 8. 9. Feature can be access controlled Production model results can be Feature for other models Automatic backfilling of Feature – avoid expensive re computations7.
  • 23. Feature store Avg.CostofaNewML Project Num. Curated Feature in Feature Store Source: The Feature Store in Hopsworks, Jim Dowling
  • 24. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  • 25. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  • 26. Feature store - storage: Old Clickhouse: o Scalable big data column-oriented database o Easy to use o Handle large and sparse feature spaces o ASOF join - joining sequences with a non-exact match 1. SSDB2. o Persistent high performace key- value database o Implements Redis protocol o Designed to store collection data o Replication(master-slave), load balance
  • 27. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata SSDB
  • 28. Feature store Thanks to the Feature store, we are able to: o cut down new model development time o cut down model training time o easily test new ideas In one word: focus on interesting and creative parts of machine learning based systems.
  • 29. Next steps and future work o unify streaming part o implement feature analytics and monitoring o improve feature documentation
  • 30. Andrzej Michałowski Head of AI Research and Development andrzej.michalowski@synerise.com Thank you Questions?