Successfully reported this slideshow.
Your SlideShare is downloading. ×

Real-time Feature Engineering with Apache Spark Streaming and Hof

Real-time Feature Engineering with Apache Spark Streaming and Hof

Download to read offline

Feature Stores for machine learning (ML) are a new class of data platform for the organization, governance, and sharing of features within enterprises. A typical feature store is a dual database architecture, where pre-computed features for training are stored in a scalable SQL platform (Delta Lake, Apache Hudi, Apache Hive), while features served to online applications are stored in a low-latency database or key-value store (MySQL Cluster (NDB), Cassandra, or Redis). Feature Stores, however, do not provide a solution for real-time features (such as user-entered data or machine-generated data) that cannot be pre-computed or cached. If the feature engineering code that transforms the raw data into features is embedded in applications, it may need to be duplicated outside the application in pipelines for generating training data.

Feature Stores for machine learning (ML) are a new class of data platform for the organization, governance, and sharing of features within enterprises. A typical feature store is a dual database architecture, where pre-computed features for training are stored in a scalable SQL platform (Delta Lake, Apache Hudi, Apache Hive), while features served to online applications are stored in a low-latency database or key-value store (MySQL Cluster (NDB), Cassandra, or Redis). Feature Stores, however, do not provide a solution for real-time features (such as user-entered data or machine-generated data) that cannot be pre-computed or cached. If the feature engineering code that transforms the raw data into features is embedded in applications, it may need to be duplicated outside the application in pipelines for generating training data.

More Related Content

Similar to Real-time Feature Engineering with Apache Spark Streaming and Hof

Real-time Feature Engineering with Apache Spark Streaming and Hof

  1. 1. Real-time Feature Engineering with Apache Spark Streaming and Hof Fabio Buso Software Engineer @ Logical Clocks AB
  2. 2. Feature stores ▪ Repository for curated ML features ready to be used ▪ Ensure consistency between features used for training and features used for serving ▪ Centralized place to collect: ▪ Metadata ▪ Statistics ▪ Labels/Tag ▪ Spark summit 2020 talk: https://databricks.com/session_na20/building-a-feature-store-arou nd-dataframes-and-apache-spark
  3. 3. Real-Time Feature Engineering ▪ Data arrives at the clients making inference requests ▪ Features cannot be pre-computed and cached in the online feature store ▪ Data needs to be featurized before being sent to the model for prediction ▪ One-hot encode ▪ Normalization and scaling of numerical features ▪ Window aggregates ▪ Real time features needs to be augmented using the Feature Store ▪ Not all features are provided by the client ▪ Construct the feature vector using with features retrieved from in the online feature store
  4. 4. Real-Time Requirements ▪ Hide complexity from clients ▪ Strict response time SLA ▪ Use-cases are usually user facing ▪ Avoid feature engineering in the client ▪ Feature engineering needs to be implemented for each client using the model ▪ Hard to maintain consistency between training and inference
  5. 5. Approach 1: Preprocessing with tf.Transform ▪ Write feature engineering in preprocessing_fn ▪ Transformation is specific to a model ▪ Hard to reuse / keep track of transformations at scale ▪ No support for window aggregations ▪ Doesn’t scale with number of features/requests def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_integerized = tft.compute_and_apply_vocabulary(s) x_centered_times_y_normalized = x_centered * y_normalized return { 'x_centered': x_centered, 'y_normalized': y_normalized, 'x_centered_times_y_normalized': x_centered_times_y_normalized, 's_integerized': s_integerized }
  6. 6. ▪ Deployed as separated service ▪ Duplicated feature engineering code ▪ No support for Window aggregations ▪ No support for feature enrichment from the online feature store ▪ Not easily extended to save featurized data. Approach 2: KFServing Transformer class ImageTransformer(kfserving.KFModel): def __init__(self, name, predictor_host): super().__init__(name) self.predictor_host = predictor_host def preprocess(self, inputs): return {'instances': [image_transform(instance) for instance in inputs['instances']]} def postprocess(self, inputs): return inputs
  7. 7. Approach 3*: Hof ▪ Independent from the model ▪ Pandas UDFs and Spark 3 to scale feature engineering ▪ First class support for online feature store integration ▪ Pluggable to save requests and inference vectors. *Third time’s a charm *Third time lucky *Great things come in threes
  8. 8. Hof ▪ gRPC/HTTP endpoint to submit feature engineering requests ▪ Mostly stateless ▪ Forward request to a message queue (Kafka) ▪ Messages are consumed/processed by Spark Streaming application(s) ▪ Messages are sent back on another queue ▪ Response is forwarded back to the user
  9. 9. ▪ One input topic ▪ N output topics ▪ N is the number of Hof instances running ▪ Message: ▪ Key: Topic to send back ▪ Message: Data to be feature engineered ▪ Topics lifecycle managed automatically ▪ Hof instances talk to Hopsworks REST APIs to create/destroy topics Hof architecture Message queue setup
  10. 10. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  11. 11. Hof architecture Spark Application setup ▪ Hof does not enforce the schema in the request: ▪ Avoid additional deserialization ▪ If requests are self contained, multiple Spark applications can run in parallel ▪ Increase availability and throughput
  12. 12. Hof architecture Addons ▪ Additional Spark applications can be plugged in ▪ Save incoming data on HopsFS/S3: ▪ Make it available for future feature engineering ▪ Save feature engineering output: ▪ Auditing ▪ Model training ▪ Detect skews in incoming data ▪ Trigger alerts and model re-training
  13. 13. Client request { 'streaming': { 'transformation':’fraud’, 'data': { ‘customer_id’: 1 ‘transaction_amount’: 145 } } }
  14. 14. Application code Show example with pandas_udf# Feature group definition import hsfs def stream_function(df): # aggregations return df fg = fs.create_streaming_feature_group("example_streaming", version=1) fg.save(stream_function) # Processing import hsfs fs = connection.get_feature_store() fg = fs.get_streaming_feature_group("example_streaming", version=1) fg.apply()
  15. 15. Hof architecture Streaming + Online Feature Store ▪ Not all the inference vector has to be computed real time ▪ Features can be fetched from the online feature store ▪ Features are referenced using the training dataset
  16. 16. Client request { 'streaming': { 'transformation':’fraud’, 'data': { … }}, ‘online‘: { ‘training_dataset’: { ‘name’: ‘fraud model’, ‘version’: 1 }, `filter`: {`customer_id`:3} } }
  17. 17. DEMO
  18. 18. github.com/logicalclocks hopsworks.ai @logicalclocks
  19. 19. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×