Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

79 views

Published on

Input features are the building blocks for machine learning models. You cannot have a great model without great features. By building on top of Apache Pulsar's infinite retention of events, we built infrastructure to serve features in production and to generate training datasets. It allowed our machine learning teams to change, test, and deploy personalization features at an extraordinary rate to 10s of millions of end-users.

This talk will discuss:
- What event-sourcing is and why it's so powerful for machine learning infrastructure.
- How we built the StreamSQL feature store on top of Pulsar, Flink, and Cassandra.
- How a feature store accelerates ML development.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Feature Stores: Building Machine Learning Infrastructure on Apache Pulsar_Simba khadder

  1. 1. Feature Stores Building ML Infrastructure on Apache Pulsar
  2. 2. Simba Khadder Co-Founder & CEO StreamSQL.io Using Apache Pulsar to power our feature store for >100m MAU
  3. 3. Agenda ● The ML process ● Moving our ML Pipelines w/ Pulsar ● Building a Feature Store on top of Pulsar ● Q&A
  4. 4. Last 5 articles read Current article Top Genre Average Content Length Diversity of reading tastes .... Recommend Next Article Input Features Model Output Machine Learning :: Model(Features) = Output
  5. 5. Feature Engineering > Model Research*
  6. 6. Last 5 articles read Current article Top Category Total time spent reading Diversity of reading tastes .... Recommend Next Article Input Features Model Output Behind every great model is a set of great features
  7. 7. Credit: Microsoft Azure Sales deck
  8. 8. Our ML teams spent >80% of their time building and maintaining ML pipelines for feature generation and feature engineering.
  9. 9. Deploy Feature to Production Validate New Feature Increases Performance Generate Training Dataset with new Feature Hypothesis New Feature The Feature Engineering Cycle
  10. 10. Training Data Online Features Serving Train User ID Feature Set Arr([FeatureSet, Actual])
  11. 11. Last 5 articles read Current article Top Category Total time spent reading Diversity of reading tastes .... Recommend Next Article Input Features Model Output
  12. 12. Generating Point-in-Time Correct Training Data Time read Features at timestamp read Features at timestamp read Features at timestamp Events Storage Training Data
  13. 13. Generating Features for Serving (in a perfect world) Time read Features at timestamp read Features at timestamp read Features at timestamp Event Stream Online FeaturesProcessor
  14. 14. Last 5 articles read Current article Top Category Total time spent reading Diversity of reading tastes .... Recommend Next Article Input Features Model Output Most Features are Stateful
  15. 15. Total time spent reading Input Features Model Output Stateful Features must be Bootstrapped
  16. 16. Bootstrapping Stateful Features with Historical Data in S3 SELECT user, SUM(readtime) FROM read_events GROUP BY user;
  17. 17. Time Persisted in S3 Not in S3, but in Kafka retention period MsgID
  18. 18. Finish bootstrapping & start stream processing from Kafka SELECT user, SUM(readtime) FROM read_events GROUP BY user;
  19. 19. Full Feature Deployment Process
  20. 20. Combine Batch & Stream Processing with an Immutable Ledger ● Each new event appends to the end of the ledger ● Cut at an arbitrary point, and the ledger looks like a batch problem ● Only read from the head of the ledger and it looks like streaming problem
  21. 21. Pulsar Based Architecture with Infinite Retention
  22. 22. Pulsar’s offloading makes Event-Sourcing achievable
  23. 23. Pulsar’s Tiered Architecture enhances Processing on Infinite Retention
  24. 24. Feature are the building blocks of ML models; however, they are developed and maintained in ad-hoc ways. They lack a dedicated system of management.
  25. 25. ML Pipelines < Feature Stores ● No concrete feature definitions, feature logic is split across Flink jobs. ● No feature versioning and rollback. ● No feature sharing, re-use, and discovery. ● No integrations into Tensorflow, Jupyter, etc.
  26. 26. A Platform for features allows for teams to work together. Features are easily defined, shared, and re-used. There exists a single source of truth for features.
  27. 27. Last 5 articles read Current article Top Category Total time spent reading Diversity of reading tastes .... Recommend Next Article Input Features Model Output Models across an organization may benefit from some of these features.
  28. 28. StreamSQL.io accelerates and enhances machine learning development ● Facilitate model development Discover, re-use, and share features across teams and models. ● Deploy with confidence Use a single feature definition for training and serving. ● Limit complexity Unified streaming and batch processing for feature generation. ● Increase model performance Use 3rd party features from text embeddings to weather data.
  29. 29. Fraud Detection Example
  30. 30. 1. Connect and Upload Data
  31. 31. 2. Transform and Join Data
  32. 32. 3. Define and Serve Features
  33. 33. 4. Generate Training Data
  34. 34. Time label Features at timestamp label Features at timestamp label Features at timestamp
  35. 35. Deploy Feature to Production Validate New Feature Increases Performance Generate Training Dataset with new Feature Hypothesis New Feature The Feature Engineering Cycle
  36. 36. StreamSQL The Feature Store for Machine Learning Beta
  37. 37. Simba Khadder simba@streamsql.io

×