SlideShare a Scribd company logo
1 of 30
Feature store:
Solving anti-patterns
in ML-systems
About
Synerise
Synerise is a European data company that collects,
interprets and leverages online and offline data with
the use of AI to power 1:1 Customer Engagement.
Our technology helps to power brands in all major
B2C verticals including retail, consumer banking,
telecommunications, public and automotive.
AI: a powerful
engine of growth
Customer
Engagement
Empower
Employee
Innovation
Cost
Optimization
Product
Transformation
Challanges
to address
Old
Combine available datasets for each
customer
Perform regression, scoring, ranking,
segmentation, anomaly detection, …
Do all of that in real-time
Support non-stationary, evolving data
distributions
Support evolving feature spaces
1.
2.
3.
4.
5.
Support incremental improvement when new
data sources become available6.
7.
8.
9.
Achieve performance on-par with
or better than dedicated
single use-case
models
Low latency, high throughput!
Data safety - all data can be obfuscated via
hashing, quantization etc.
Observation
Reality of ML
system
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
"…a mature system might end up being
(at most) 5% machine learning code
and (at least) 95% glue code”
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips,
Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
ML Systems
Anti-Patterns
Old
Glue Code1.
ML Systems
Anti-Patterns
Glue Code
Pipeline jungles
Dead experimental code paths
1.
2.
3.
ML Systems
Anti-Patterns
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
ML Systems
Anti-Patterns
Old
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
Data-processing doesn’t scale6.
7. Real-time Feature requires engineers
ML Systems
Anti-Patterns
Old
Glue Code
Pipeline jungles
Dead experimental code paths
Reproducibility debt & inconsistency
between training and serving
Multi-model systems
1.
2.
3.
4.
5.
Data-processing doesn’t scale6.
7.
9.
10.
Real-time Feature requires engineers
Lack of Feature discovery
Lack of standardization
Lack of data testing8.
11. Multi-language issue
„Data is the hardest part of ML and the most important piece to get right.
Modelers spend most of their time selecting and transforming Feature at training time
and then building the pipelines to deliver those Feature to production models.”
Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
Machine Learning & Data science are in the same place
where software engineering was 20 years ago...
Remedy
First-class
entity
Machine learning and data science is about data, but often data is not a first-class entity
in such systems.
So:
1. Let's make the data a first-class entity as code is for software engineering
2. Let's make Feature a first-class entity as functions/modules are for software engineering
3. Let's think about models as compiled software libraries
First-class
entity
Let people be creative, do the awesome job, free them from the usual and boring,
but necessary:
o data access & ingestion
o data processing & cleaning
o feature engineering & management
o data modeling & building processing pipelines
First-class
entity
Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd
Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
Feature
store
Feature store is:
o a place to store unified, versioned, tested and documented Feature
o an interface between data engineering and model development
o an interface for feature discovery and analysis
Raw/Structered
Data
Feature store Models
Future
Engineering
Training & Serving
Feature
store
Model
1
Model
2
Model
3
Data
set 1
Data
set 2
Data
set 3
Feature
engineering 2
Feature
engineering 3
Feature
engineering 1
Feature
store
Model
1
Model
2
Model
3
Data
set 1
Data
set 2
Data
set 3
Feature Store
Feature
store gives:
Old
Feature versioning
Feature trust – can be tested
Feature consistency
Feature discovery and reuse
Feature documentation and analytics
1.
2.
3.
4.
5.
Standardized access to Feature between
training and serving
– also reproducibility of results
6.
8.
9.
Feature can be access controlled
Production model results can be Feature for
other models
Automatic backfilling of Feature –
avoid expensive re computations7.
Feature
store
Avg.CostofaNewML
Project
Num. Curated Feature
in Feature Store
Source: The Feature Store in Hopsworks, Jim Dowling
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
Feature
store - storage:
Old
Clickhouse:
o Scalable big data column-oriented
database
o Easy to use
o Handle large and sparse feature
spaces
o ASOF join - joining sequences with
a non-exact match
1. SSDB2.
o Persistent high performace key-
value database
o Implements Redis protocol
o Designed to store collection data
o Replication(master-slave), load
balance
Feature
store architecture
Source Create Ingest Store Access
Event
Stream
Batch
Data
Stream
Transform
Batch
Transform
Ingest
Feature
Storage
ModelAPI
Discovery
API
Model
Serving
Model
Training
Feature
Metadata
SSDB
Feature
store
Thanks to the Feature store, we are able to:
o cut down new model development time
o cut down model training time
o easily test new ideas
In one word:
focus on interesting and creative parts of machine learning based systems.
Next steps
and future work
o unify streaming part
o implement feature analytics and monitoring
o improve feature documentation
Andrzej Michałowski
Head of AI Research and Development
andrzej.michalowski@synerise.com
Thank you
Questions?

More Related Content

What's hot

Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOpsDatabricks
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonIgor Anishchenko
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBconfluent
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMatei Zaharia
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup Omid Vahdaty
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine LearningLogical Clocks
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsFlink Forward
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...Databricks
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...DataWorks Summit
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 

What's hot (20)

Databricks Overview for MLOps
Databricks Overview for MLOpsDatabricks Overview for MLOps
Databricks Overview for MLOps
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
MLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine LearningMLflow: A Platform for Production Machine Learning
MLflow: A Platform for Production Machine Learning
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
 
Managed Feature Store for Machine Learning
Managed Feature Store for Machine LearningManaged Feature Store for Machine Learning
Managed Feature Store for Machine Learning
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit... Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
 
KFServing and Feast
KFServing and FeastKFServing and Feast
KFServing and Feast
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 

Similar to Feature store: Solving anti-patterns in ML-systems

Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...KafkaZone
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learningFaizan Javed
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaData Science Milan
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)Tao Xie
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a ServiceJohn Liu
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringMohamed MEJDOUBI
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoSri Ambati
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15MLconf
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesNick Pentreath
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowDatabricks
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPaige_Roberts
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software EngineeringMiroslaw Staron
 

Similar to Feature store: Solving anti-patterns in ML-systems (20)

Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...Real time data processing and model inferncing platform with Kafka streams (N...
Real time data processing and model inferncing platform with Kafka streams (N...
 
Practical machine learning
Practical machine learningPractical machine learning
Practical machine learning
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Software Analytics: Towards Software Mining that Matters (2014)
Software Analytics:Towards Software Mining that Matters (2014)Software Analytics:Towards Software Mining that Matters (2014)
Software Analytics: Towards Software Mining that Matters (2014)
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineering
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
Shiva Amiri, Chief Product Officer, RTDS Inc. at MLconf SEA - 5/01/15
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Open, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI PipelinesOpen, Secure & Transparent AI Pipelines
Open, Secure & Transparent AI Pipelines
 
Technovision
TechnovisionTechnovision
Technovision
 
Managing the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflowManaging the Machine Learning Lifecycle with MLflow
Managing the Machine Learning Lifecycle with MLflow
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 
AI for Software Engineering
AI for Software EngineeringAI for Software Engineering
AI for Software Engineering
 

Recently uploaded

The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionWave PLM
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAlluxio, Inc.
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024vaibhav130304
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionMohammed Fazuluddin
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityamy56318795
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfVictor Lopez
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Krakówbim.edu.pl
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignNeo4j
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAlluxio, Inc.
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdfkalichargn70th171
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationWave PLM
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Andrea Goulet
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfDeskTrack
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfMehmet Akar
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationHelp Desk Migration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1KnowledgeSeed
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Soroosh Khodami
 

Recently uploaded (20)

The Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion ProductionThe Impact of PLM Software on Fashion Production
The Impact of PLM Software on Fashion Production
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024IT Software Development Resume, Vaibhav jha 2024
IT Software Development Resume, Vaibhav jha 2024
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
SQL Injection Introduction and Prevention
SQL Injection Introduction and PreventionSQL Injection Introduction and Prevention
SQL Injection Introduction and Prevention
 
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purityAPVP,apvp apvp High quality supplier safe spot transport, 98% purity
APVP,apvp apvp High quality supplier safe spot transport, 98% purity
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf10 Essential Software Testing Tools You Need to Know About.pdf
10 Essential Software Testing Tools You Need to Know About.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
Workforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdfWorkforce Efficiency with Employee Time Tracking Software.pdf
Workforce Efficiency with Employee Time Tracking Software.pdf
 
how-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdfhow-to-download-files-safely-from-the-internet.pdf
how-to-download-files-safely-from-the-internet.pdf
 
A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024Secure Software Ecosystem Teqnation 2024
Secure Software Ecosystem Teqnation 2024
 

Feature store: Solving anti-patterns in ML-systems

  • 2. About Synerise Synerise is a European data company that collects, interprets and leverages online and offline data with the use of AI to power 1:1 Customer Engagement. Our technology helps to power brands in all major B2C verticals including retail, consumer banking, telecommunications, public and automotive.
  • 3. AI: a powerful engine of growth Customer Engagement Empower Employee Innovation Cost Optimization Product Transformation
  • 4. Challanges to address Old Combine available datasets for each customer Perform regression, scoring, ranking, segmentation, anomaly detection, … Do all of that in real-time Support non-stationary, evolving data distributions Support evolving feature spaces 1. 2. 3. 4. 5. Support incremental improvement when new data sources become available6. 7. 8. 9. Achieve performance on-par with or better than dedicated single use-case models Low latency, high throughput! Data safety - all data can be obfuscated via hashing, quantization etc.
  • 6. Reality of ML system Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 7. "…a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 9. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths 1. 2. 3.
  • 10. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5.
  • 11. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. Real-time Feature requires engineers
  • 12. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. 9. 10. Real-time Feature requires engineers Lack of Feature discovery Lack of standardization Lack of data testing8. 11. Multi-language issue
  • 13. „Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming Feature at training time and then building the pipelines to deliver those Feature to production models.” Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
  • 14. Machine Learning & Data science are in the same place where software engineering was 20 years ago...
  • 16. First-class entity Machine learning and data science is about data, but often data is not a first-class entity in such systems. So: 1. Let's make the data a first-class entity as code is for software engineering 2. Let's make Feature a first-class entity as functions/modules are for software engineering 3. Let's think about models as compiled software libraries
  • 17. First-class entity Let people be creative, do the awesome job, free them from the usual and boring, but necessary: o data access & ingestion o data processing & cleaning o feature engineering & management o data modeling & building processing pipelines
  • 18. First-class entity Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  • 19. Feature store Feature store is: o a place to store unified, versioned, tested and documented Feature o an interface between data engineering and model development o an interface for feature discovery and analysis Raw/Structered Data Feature store Models Future Engineering Training & Serving
  • 20. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature engineering 2 Feature engineering 3 Feature engineering 1
  • 22. Feature store gives: Old Feature versioning Feature trust – can be tested Feature consistency Feature discovery and reuse Feature documentation and analytics 1. 2. 3. 4. 5. Standardized access to Feature between training and serving – also reproducibility of results 6. 8. 9. Feature can be access controlled Production model results can be Feature for other models Automatic backfilling of Feature – avoid expensive re computations7.
  • 23. Feature store Avg.CostofaNewML Project Num. Curated Feature in Feature Store Source: The Feature Store in Hopsworks, Jim Dowling
  • 24. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  • 25. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  • 26. Feature store - storage: Old Clickhouse: o Scalable big data column-oriented database o Easy to use o Handle large and sparse feature spaces o ASOF join - joining sequences with a non-exact match 1. SSDB2. o Persistent high performace key- value database o Implements Redis protocol o Designed to store collection data o Replication(master-slave), load balance
  • 27. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata SSDB
  • 28. Feature store Thanks to the Feature store, we are able to: o cut down new model development time o cut down model training time o easily test new ideas In one word: focus on interesting and creative parts of machine learning based systems.
  • 29. Next steps and future work o unify streaming part o implement feature analytics and monitoring o improve feature documentation
  • 30. Andrzej Michałowski Head of AI Research and Development andrzej.michalowski@synerise.com Thank you Questions?