Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

A Practical Enterprise Feature Store on Delta Lake

Download to read offline

The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture.



Key Takeaways:

– Understand the key use cases that motivate the feature store from both a data science and engineering perspective.

– Consider edge cases where there may be opportunities for simplification such as “online” predictions.

– Review a typical logical data model for a feature store and how that can be applied to your business domain.

– Consider options for physical storage of the feature store in the Delta Lake.

– Understand common access patterns including metadata-based feature discovery.

  • Be the first to like this

A Practical Enterprise Feature Store on Delta Lake

  1. 1. A Practical Feature Store on Delta Lake Nathan Buesgens ML Operations Bryan Christian Data Science
  2. 2. Agenda § What is a Feature Store? ▪ MLOps for Acceleration and Governance in the Enterprise ▪ Feature Store: Use Cases ▪ Edge Cases: 80/20 ▪ Relation to the Data Warehouse § Design Reference ▪ Logical Data Model & Access Patterns ▪ Physical Representation in the Delta Lake
  3. 3. What is a Feature Store?
  4. 4. 75% Reduction in Feature Engineering “Data Wrangling” Time 15X Accelerated Model Delivery with MLOps Automation and Governance END-TO-END VALUE DELIVERY TIME TO VALUE & CONCURRENCY SCALABLE INFRASTRUCTURE I.E. AVOID: “PROOF OF CONCEPT FACTORY” MLOps: Data Science at Scale
  5. 5. BOTTLENECK Feature Engineering Modelling The feature store serves as the consumption layer for ML applications. It provides: • Acceleration: pre-”hardened” features reduces data wrangling time for the Data Scientist. • Governance: a common consumptions pattern ensures nothing is lost in the translation to production. Predictions Curated Data Feature Engineering Modelling Feature Engineering Modelling Modelling Modelling Modelling Feature Store Example: Feature Store Infrastructure to support DS + MLE
  6. 6. The Feature Store is built on the following data science requirements that are relevant to predictive analytics in Financial Services use cases. Correct and consistently applied joins across of multiple Delta files without loss of processing speed Aggregations, window functions, and transformations of data Granularity of point in time and level of the prediction (e.g. individual, account, etc.) customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 34567 2021-05-01 0.03 0.92 0.13 45678 2021-05-01 0.42 0.59 0.50 The Feature Store uses “as_of” date for the point in time granularity for both backwards- and forward- facing windows. Code-embedded metadata allows easy removal of future facing windows as “independent” variables to prevent feature leakage. Data Science Use Cases
  7. 7. § Many ML use cases that don’t have an online requirement: Esp. “Human + AI” § Extending the MVP: ▪ Some online use cases can be reframed as streaming use cases. ▪ Online use cases can be met with extension to the Delta Lake design. ▪ See: feast.dev § Low-code & ciGzen science expands user base, doesn’t necessarily accelerate exisGng users. § 80/20 value from: Op#mizing Access vs. Op#mizing ETL Development “Online” Features Ultra-Low-Latency, Ultra-Timely Point Reads Low-Code ETL Configuration Based, AutoML, FeatureFlow, etc. Edge Cases Opportunities to Simplify for an 80/2- Feature Store MVP
  8. 8. ▪ “Golden” aggregates of curated data. ▪ Highly structured, well-defined granularities (esp. as 80/20 solution). ▪ Similar non-functional requirements for strong governance standards, metadata management, discovery, etc. ▪ Different Use Case: BI vs. Modelling ▪ Different Access Patterns, therefore: ▪ Different Data Model ▪ Different Technology Stack ▪ Supervised learning creates complex requirements for: “point in time accurate data” • Differences • Similarities Comparison with Data Warehouse i.e. Dimensional Model
  9. 9. Design
  10. 10. WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  11. 11. Structured Streaming Programming Guide WINDOW FUNCTIONS WATERMARK 1 2 3 FEATURE LEAKAGE Point in Time Accurate Data Three Ways Inconsistency Sneaks In
  12. 12. § The thing being modelled. The “Entity” Term barrowed from Feast Granularity “As of” Every feature for an entity “as of” a date. Columns § Discrete granularity (daily, hourly, etc.), not an “event time”. § 80/20 solution. § For “continuous” granularity see: Feast. Features Un-vectorized (80/20) Targets Necessarily at same granularity as features. Predictions One model’s prediction is often another’s feature. Feature Store Logical Model Data Model for Feature Store Access
  13. 13. No need to rebuild the whole feature store when new features are added. (Certain sets of features might be rebuilt at times, though they will have severely shorter downtime.) The SDK indexes the available features and upon request builds the joins to combine all desired features into one cohesive data frame to provide a production grade feature selection tool. Keyword searching enabled for features so you can find any feature you're looking for using "human" logic Tuning can be specific to each set of features allowing more optimal feature creation. find() select() select_by() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. When you know exactly the features you want Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex Core Functionality SDK for Feature Store
  14. 14. find() To search through all columns and metadata for the features you want to use by giving keys, keywords or regex. regexp kwrds keys kwrds_exclude partial partial_exclude verbose case_sensitive A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, prints out results otherwise just returns them. If True, an exact match is required to return results. Arguments fs.find(regexp="^(?=.*asdf)(?=.*qw erty).+") Your search returned 20 results… feature_name_1: {'comment': 'Flag if asdf > 0.3 at any point within the last 3 months.'} feature_name_qwerty_1: {'comment': 'Average number of widgets customer purchased in the last 0-1 months.'} ... Example Calling the feature store with “fs”, a command could be: With a returned result of… The find method searches through all features given a set of criteria and returns any matches within the name or metadata of columns. It is a great tool to explore the data without pulling in massive datasets Value to Data Scientist Explore what features are in the feature store via metadata and leverage metadata to enforce governance (e.g., no PI, 3rd party data, etc. as needed) SDK for Feature Store
  15. 15. date *features Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} Feature names as strings Arguments dataframe_name = fs.select( "latest", # Give a date {"=": "2021-05-01"} or "latest" for the newest available features “feature_name_last_0-30_days_prior”, “feature_name_last_31-60_days_prior”, “feature_name_next_1-30_days” # List the features you want ) display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select method will return a dataframe of all selected features with the given date. select() When you know exactly the features you want customer_id as_of feature_name_last_0-30_days_prior feature_name_last_31-60_days_prior feature_name_next_1-30_days 12345 2021-05-01 0.43 0.32 0.21 23456 2021-05-01 0.99 0.94 0.98 Consistent way of selecting the same feature set from the feature store – consistent in dev and when deployed in production Value to Data Scientist Consistent way of selecting (in dev and prod) the same feature set from the feature store when creating a dataframe SDK for Feature Store
  16. 16. customer_id as_of feature_name_1 feature_name_qwerty_1 … 12345 2021-05-01 0.43 0.32 … 23456 2021-05-01 0.99 0.94 … select_by() Selecting columns and returning a dataframe you want to use by giving a date, keys, keywords or regex date regexp kwrds keys kwrds_exclude partial partial_exclude case_sensitive Return features given a specific date or use "latest" to return the last updated feature date. For specific dates, please include a dictionary with an operator and a date i.e. {">": "2021-05-01"} A regular expression A list of key words to look for A dictionary of str, any pointing to tags in the metadata of features, ie {"model_output": True,} A list of words to exclude from search If kwrds is used, this decides if it should find all or any of them when searching. If kwrds_exclude is used, this decides if it will exclude all or any of them when searching If True, an exact match is required to return results. Arguments dataframe_name = fs.select_by("=": "2021-05-01“, regexp="^(?=.*asdf)(?=.*qwerty).+") display(dataframe_name) Example Calling the feature store with “fs”, a command could be: With a returned result of… The select_by method searches through all features given a set of criteria and returns a dataframe including all the features that match the criteria within the name or metadata. Value to Data Scientist Consistent way of exploring the feature store and leveraging metadata for selection while simultaneity creating a dataframe with the selected features SDK for Feature Store
  17. 17. Gold BI Consumption: Dimensional Model Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache High Concurrency Data Warehouse Mirror Mirror Implementation on the Data Lake
  18. 18. Bronze Silver ML Consumption: Feature Store The Delta Lake Optional: Consumption Optimized Databases ETL ETL Low Latency Memory Cache Mirror SDK (Data Access Layer) • Consistent view of “online” and “historic” features. • Separation of logical and physical models. • Metadata focused query interface for data science exploration. Historic Feature Queries Online Point Reads Implementation on the Data Lake
  19. 19. § Simplifies “point in .me joins”. § Not as flexible or .mely. Pre-defined time aggregations “As Of” Granularity “Dynamic Point in Time Joins” Demonstrated by Feast More flexible, improved timeliness. Multiple feature tables Technically possible to use a single wide table. § Simplifies: ▪ Schema Migration ▪ Query Planning & Optimization ▪ Scheduling Physical Feature Tables Two Choices
  20. 20. Summary 1 Feature stores accelerate data science & enable better governance. 2 Most design complexity stems from machine learning requirements for point in time accurate data. 3 80/20 solutions possible by carefully considering “online” requirements.
  21. 21. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

The feature store is a data architecture concept used to accelerate data science experimentation and harden production ML deployments. Nate Buesgens and Bryan Christian describe a practical approach to building a feature store on Delta Lake at a large financial organization. This implementation has reduced feature engineering “wrangling” time by 75% and has increased the rate of production model delivery by 15x. The approach described focuses on practicality. It is informed by innovative approaches such as Feast, but our primary goal is evolutionary extensions of existing patterns that can be applied to any Delta Lake architecture. Key Takeaways: – Understand the key use cases that motivate the feature store from both a data science and engineering perspective. – Consider edge cases where there may be opportunities for simplification such as “online” predictions. – Review a typical logical data model for a feature store and how that can be applied to your business domain. – Consider options for physical storage of the feature store in the Delta Lake. – Understand common access patterns including metadata-based feature discovery.

Views

Total views

90

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×