Successfully reported this slideshow.

Feature store: Solving anti-patterns in ML-systems

1

Share

Loading in …3
×
1 of 30
1 of 30

Feature store: Solving anti-patterns in ML-systems

1

Share

Download to read offline

Description

Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.

Transcript

  1. 1. Feature store: Solving anti-patterns in ML-systems
  2. 2. About Synerise Synerise is a European data company that collects, interprets and leverages online and offline data with the use of AI to power 1:1 Customer Engagement. Our technology helps to power brands in all major B2C verticals including retail, consumer banking, telecommunications, public and automotive.
  3. 3. AI: a powerful engine of growth Customer Engagement Empower Employee Innovation Cost Optimization Product Transformation
  4. 4. Challanges to address Old Combine available datasets for each customer Perform regression, scoring, ranking, segmentation, anomaly detection, … Do all of that in real-time Support non-stationary, evolving data distributions Support evolving feature spaces 1. 2. 3. 4. 5. Support incremental improvement when new data sources become available6. 7. 8. 9. Achieve performance on-par with or better than dedicated single use-case models Low latency, high throughput! Data safety - all data can be obfuscated via hashing, quantization etc.
  5. 5. Observation
  6. 6. Reality of ML system Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  7. 7. "…a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  8. 8. ML Systems Anti-Patterns Old Glue Code1.
  9. 9. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths 1. 2. 3.
  10. 10. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5.
  11. 11. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. Real-time Feature requires engineers
  12. 12. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. 9. 10. Real-time Feature requires engineers Lack of Feature discovery Lack of standardization Lack of data testing8. 11. Multi-language issue
  13. 13. „Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming Feature at training time and then building the pipelines to deliver those Feature to production models.” Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
  14. 14. Machine Learning & Data science are in the same place where software engineering was 20 years ago...
  15. 15. Remedy
  16. 16. First-class entity Machine learning and data science is about data, but often data is not a first-class entity in such systems. So: 1. Let's make the data a first-class entity as code is for software engineering 2. Let's make Feature a first-class entity as functions/modules are for software engineering 3. Let's think about models as compiled software libraries
  17. 17. First-class entity Let people be creative, do the awesome job, free them from the usual and boring, but necessary: o data access & ingestion o data processing & cleaning o feature engineering & management o data modeling & building processing pipelines
  18. 18. First-class entity Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  19. 19. Feature store Feature store is: o a place to store unified, versioned, tested and documented Feature o an interface between data engineering and model development o an interface for feature discovery and analysis Raw/Structered Data Feature store Models Future Engineering Training & Serving
  20. 20. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature engineering 2 Feature engineering 3 Feature engineering 1
  21. 21. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature Store
  22. 22. Feature store gives: Old Feature versioning Feature trust – can be tested Feature consistency Feature discovery and reuse Feature documentation and analytics 1. 2. 3. 4. 5. Standardized access to Feature between training and serving – also reproducibility of results 6. 8. 9. Feature can be access controlled Production model results can be Feature for other models Automatic backfilling of Feature – avoid expensive re computations7.
  23. 23. Feature store Avg.CostofaNewML Project Num. Curated Feature in Feature Store Source: The Feature Store in Hopsworks, Jim Dowling
  24. 24. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  25. 25. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  26. 26. Feature store - storage: Old Clickhouse: o Scalable big data column-oriented database o Easy to use o Handle large and sparse feature spaces o ASOF join - joining sequences with a non-exact match 1. SSDB2. o Persistent high performace key- value database o Implements Redis protocol o Designed to store collection data o Replication(master-slave), load balance
  27. 27. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata SSDB
  28. 28. Feature store Thanks to the Feature store, we are able to: o cut down new model development time o cut down model training time o easily test new ideas In one word: focus on interesting and creative parts of machine learning based systems.
  29. 29. Next steps and future work o unify streaming part o implement feature analytics and monitoring o improve feature documentation
  30. 30. Andrzej Michałowski Head of AI Research and Development andrzej.michalowski@synerise.com Thank you Questions?

Description

Modern machine learning systems may be very complex and may fall into many pitfalls. It's very easy to unintendedly introduce technical debt into such a complex structure. One of the approaches solving some of anti-patterns is a feature store. Feature store is a missing piece filling a gap between raw data and machine learning models. Not only it will help you to handle technical debt, but even more importantly speeds up time to develop new model.

Transcript

  1. 1. Feature store: Solving anti-patterns in ML-systems
  2. 2. About Synerise Synerise is a European data company that collects, interprets and leverages online and offline data with the use of AI to power 1:1 Customer Engagement. Our technology helps to power brands in all major B2C verticals including retail, consumer banking, telecommunications, public and automotive.
  3. 3. AI: a powerful engine of growth Customer Engagement Empower Employee Innovation Cost Optimization Product Transformation
  4. 4. Challanges to address Old Combine available datasets for each customer Perform regression, scoring, ranking, segmentation, anomaly detection, … Do all of that in real-time Support non-stationary, evolving data distributions Support evolving feature spaces 1. 2. 3. 4. 5. Support incremental improvement when new data sources become available6. 7. 8. 9. Achieve performance on-par with or better than dedicated single use-case models Low latency, high throughput! Data safety - all data can be obfuscated via hashing, quantization etc.
  5. 5. Observation
  6. 6. Reality of ML system Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  7. 7. "…a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code” Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  8. 8. ML Systems Anti-Patterns Old Glue Code1.
  9. 9. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths 1. 2. 3.
  10. 10. ML Systems Anti-Patterns Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5.
  11. 11. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. Real-time Feature requires engineers
  12. 12. ML Systems Anti-Patterns Old Glue Code Pipeline jungles Dead experimental code paths Reproducibility debt & inconsistency between training and serving Multi-model systems 1. 2. 3. 4. 5. Data-processing doesn’t scale6. 7. 9. 10. Real-time Feature requires engineers Lack of Feature discovery Lack of standardization Lack of data testing8. 11. Multi-language issue
  13. 13. „Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming Feature at training time and then building the pipelines to deliver those Feature to production models.” Source: Scaling Machine Learning at Uber with Michelangelo, Jeremy Hermann and Mike Del Balso
  14. 14. Machine Learning & Data science are in the same place where software engineering was 20 years ago...
  15. 15. Remedy
  16. 16. First-class entity Machine learning and data science is about data, but often data is not a first-class entity in such systems. So: 1. Let's make the data a first-class entity as code is for software engineering 2. Let's make Feature a first-class entity as functions/modules are for software engineering 3. Let's think about models as compiled software libraries
  17. 17. First-class entity Let people be creative, do the awesome job, free them from the usual and boring, but necessary: o data access & ingestion o data processing & cleaning o feature engineering & management o data modeling & building processing pipelines
  18. 18. First-class entity Source: Hidden Technical Debt in Machine Learning Systems, D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison
  19. 19. Feature store Feature store is: o a place to store unified, versioned, tested and documented Feature o an interface between data engineering and model development o an interface for feature discovery and analysis Raw/Structered Data Feature store Models Future Engineering Training & Serving
  20. 20. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature engineering 2 Feature engineering 3 Feature engineering 1
  21. 21. Feature store Model 1 Model 2 Model 3 Data set 1 Data set 2 Data set 3 Feature Store
  22. 22. Feature store gives: Old Feature versioning Feature trust – can be tested Feature consistency Feature discovery and reuse Feature documentation and analytics 1. 2. 3. 4. 5. Standardized access to Feature between training and serving – also reproducibility of results 6. 8. 9. Feature can be access controlled Production model results can be Feature for other models Automatic backfilling of Feature – avoid expensive re computations7.
  23. 23. Feature store Avg.CostofaNewML Project Num. Curated Feature in Feature Store Source: The Feature Store in Hopsworks, Jim Dowling
  24. 24. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  25. 25. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata
  26. 26. Feature store - storage: Old Clickhouse: o Scalable big data column-oriented database o Easy to use o Handle large and sparse feature spaces o ASOF join - joining sequences with a non-exact match 1. SSDB2. o Persistent high performace key- value database o Implements Redis protocol o Designed to store collection data o Replication(master-slave), load balance
  27. 27. Feature store architecture Source Create Ingest Store Access Event Stream Batch Data Stream Transform Batch Transform Ingest Feature Storage ModelAPI Discovery API Model Serving Model Training Feature Metadata SSDB
  28. 28. Feature store Thanks to the Feature store, we are able to: o cut down new model development time o cut down model training time o easily test new ideas In one word: focus on interesting and creative parts of machine learning based systems.
  29. 29. Next steps and future work o unify streaming part o implement feature analytics and monitoring o improve feature documentation
  30. 30. Andrzej Michałowski Head of AI Research and Development andrzej.michalowski@synerise.com Thank you Questions?

More Related Content

Similar to Feature store: Solving anti-patterns in ML-systems

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

×