Simplify Feature Engineering in Your Data Warehouse

FeatureByte
Simplify Feature Engineering in your Data Warehouse

Table of contents
What is a Machine Learning Feature?
Feature Engineering
Feature Engineering Challenges
Feature Engineering with FeatureByte
System Architecture
FeatureByte Catalog
Intuitive Data modeling
Track Assets in Catalog
Compute Historical Features
Serving Features
What’s coming next?
Visit us to find out more!

Features makes up the input data used to train machine learning
models and compute predictions.
Each row in the data is an example, which contains:
Features
Model inputs or predictors ( X )
Target (for training only)
Model output ( Y )
Features and targets are functions of an observation:
Observation:
Time of the observation (Point-in-time)
Entities involved in the observation
e.g. Customer, Product, Employee, Transaction etc
Ex X1 X2 X3 Y
1 1.1 0 0.9 1
2 0.7 1 0.1 0
3 2.9 0 0.5 0
X -> Y

Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Example: Use ML to recommend products to existing customers
Train model to predict future purchases based on past observations
Require examples with different targets (purchase / no purchase)

Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1

Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150

Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150
Cust + Prod Features
Time since
last purchase 11d
Time since
last purchase 63d

Observation
Point-in-
time 2020-05-23
Customer
Product
Point-in-
time 2020-05-23
Customer
Product
Customer Features
Age 23
Purchases past
2 weeks 3
Age 47
Purchases past
2 weeks 1
Product Features
Color Pink
Sales past 2
weeks 368
Color Gray
Sales past 2
weeks 150
Cust + Prod Features
Time since
last purchase 11d
Time since
last purchase 63d
Target
Purchase True
Purchase False

Feature Engineering
Process to transform raw data to create features for machine learning

Training pipeline
Populate features and target for a set of observations
Large number of point-in-times (capture seasonality, better
generalization)
Large number of observations
Fit ML model
Serving pipeline
Populate features for a set of observations
Usually one point-in-time
Fewer observations, low latency required for some use cases
Make prediction using ML model
Feature Engineering
Process to transform raw data to create features for machine learning

Feature formulation and materialization
Data wrangling tools not tailored for feature engineering
Time-awareness handling needs to be implemented
Easy to introduce bugs and target leakage
Can be computationally / memory expensive if not
optimized
Transfering large datasets for experimentation
1 # read observations and transactions table
2 transactions_df = pd.read_parquet("transactions.parquet")
3 observation_df = pd.read_parquet("observations.parquet")
4
5 df = observation_df.drop_duplicates(
6 ["AccountID", "POINT_IN_TIME"]
7 ).merge(
8 transactions_df, on="AccountID", how="inner"
9 )
10 mask = (
11 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") &
12 (df.POINT_IN_TIME > df.Timestamp)
13 )
14 features_df = df[mask].groupby(
15 ["AccountID", "POINT_IN_TIME"]
16 )["Amount"].sum()
17
18 observation_df = observation_df.merge(
19 features_df,
20 on=["AccountID", "POINT_IN_TIME"],
21 how="left",
22 )

Training vs serving consistency
Features computed at serving time may not be consistent
with training due to imperfect data availability
Unrealistic training accuracy
Impact can be severe if model depends heavily on very
recent data not available during serving
Serving pipeline may require separate implementation
Longer time-to-production
Inconsistency with training
1 # Is this realistic in serving?
2 mask = (
3 (df.POINT_IN_TIME - df.Timestamp) < pd.Timedelta("7d") &
4 (df.POINT_IN_TIME > df.Timestamp)
5 )
6
7 # Only consider records at least 30 min old?
8 shifted_PIT = df.POINT_IN_TIME - pd.Timedelta("30m")
9 mask = (
10 (shifted_PIT - df.Timestamp) < pd.Timedelta("7d") &
11 (shifted_PIT > df.Timestamp)
12 )

Isolation, inconsistency and redundancy
Sources, features can be used for different use cases
Consistent data semantics, data cleaning in different
projects
Sharing and reuse of features
Sharing code -> code duplication and redundant
computation
Propagation of bug fixes
Feature Stores an emerging trend
Single producer multiple consumer
Addresses many problems above, introduces new
challenges
Sharing limited to feature retrieval, some solutions
manage materialization

Open Source Feature Platform
Centralized platform for feature engineering
Manage, share, reuse and track assets (tables, features, featurelists etc)
Experiment with feature engineering quickly
Create, save and retrieve features
Create feature lists using existing + new features
Compute historical features for training
Deploy feature lists for serving

Feature-centric Design
Feature are self-contained assets
Data sources, cleaning operations, definition and
refresh cadence
Contains everything needed to support training +
serving pipelines
Immutable
Track dependencies, usage, status and changes
Get historical features and deploy for online serving easily
Track request outputs with provenance
1 most_freq_weekday_28d = catalog.get_feature(
2 "Most Frequent weekday Over the Last 28d"
3 )
4 # Get feature info and explicit code
5 most_freq_weekday_28d.info()
6 most_freq_weekday_28d.definition

Materialization in the Data Warehouse
Access source databases in the warehouse
Store feature cache and output tables in the warehouse
Manage storage + compute using SQL
Reduce security risks by avoiding bulk data export /
duplication / exposure
Supports Spark, DataBricks, Snowflake
Storage and computation optimization
Cache partial aggregates for more efficient computation
Store cache and online values instead of all historical
feature values

Python SDK for feature creation
Built-in time-awareness for lookups and joins
Windowed aggregations based on request point-in-time
Automatically emulate serving time behavior in historical
features to minimize train / test inconsistency
Scalable compute in warehouse with optimized SQL
1 # get table views
2 credit_card = catalog.get_view("CREDITCARD")
3 card_transactions = catalog.get_view("CARDTRANSACTIONS")
4
5 # join tables
6 card_transactions = card_transactions.join(credit_card)
7
8 # define spending features
9 cust_spend_features = card_transactions.groupby(
10 "BankCustomerID"
11 ).aggregate_over(
12 value_column="Amount",
13 method=fb.AggFunc.SUM,
14 windows=["7d"],
15 feature_names=["total_spend_7d"]
16 )
17
18 # preview features
19 cust_spend_features.preview(observation_set=observation_set)

System Architecture
Component Packaging Purpose
Python SDK Python Package Connects to the API service to provide feature authoring and management functionality through python classes and functions.
API Service Docker Container REST-API service that validates and executes requests, queries data warehouses, and stores data.
Worker Docker Container Executes asynchronous or scheduled tasks.
MongoDB Docker Container Store metadata for created assets.
Redis Docker Container Broker and queue for workers, messenger service for publishing progress updates.
Query Graph Transpiler Python Package Construct data transformation steps as a query graph, which can be transpiled to platform-specific SQL.
Source Tables Data Warehouse Tables used as data sources for feature engineering.
Feature Store Data Warehouse Database that store data used to support feature serving.

FeatureByte Catalog
A catalog stores tables, entities, features and other ML assets that can be reused, tracked and shared.

Information about data model is captured during table registration and entity tagging
1 # register SCD table from the warehouse
2 credit_card = data_source.get_source_table(
3 "DATASETS", "CREDITCARD", "CREDITCARD"
4 ).create_scd_table(
5 "CREDITCARD",
6 natural_key_column="AccountID", effective_timestamp_column="ValidFrom" end_timestamp_column="ValidTo",
7 )
8
9 # register event table from the warehouse
10 card_transactions = data_source.get_source_table(
11 "DATASETS", "CREDITCARD", "CARDTRANSACTIONS"
12 ).create_event_table(
13 "CARDTRANSACTIONS",
14 event_id_column="CardTransactionId", event_timestamp_column="Timestamp",
15 )
16
17 # tag entities in table columns
18 credit_card.BankCustomerId.as_entity("Customer")
19 credit_card.AccountID.as_entity("Account")
20 card_transactions.AccountID.as_entity("Account")

15 )
5 "CREDITCARD",
7 )
8
16

5 "CREDITCARD",
7 )
8
15 )
16

Entity Relationships
Child-parent for feature serving
Table Relationships:
Primary and foreign keys
Table types
Event, Item, Slowly Changing, Dimension
Determine time-awareness in joins, enforce guardrails
Column Semantics
Timestamp, primary key, timezone offset
Supports timezone handling, smart joins

Track Assets in Catalog
Saved tables
Saved features

Compute Historical Features
Features can be accessed from the warehouse or downloaded to parquet file
1 my_favorite_features.compute_historical_features(
2 observation_set=observation_df
3 )

Serving Features
Deploy a featurelist for serving
Retrieve features using REST-API request
Track feature job status
1 # Create and enable new deployment
2 deployment = my_favorite_features.deploy()
3 deployment.enable()
1 curl -X POST
2 -H 'Content-Type: application/json'
3 -H 'active-catalog-id: 64708919ea4c4876a77d2b80'
4 -d '{"entity_serving_names": [{"GROCERYINVOICEGUID": "d1b5d3ae-f37b-4864-a56d-d70d81641577"}]}'
5 http://featurebyte_service/api/v1/deployment/6478b57bb68c91fb84f1e156/online_features

What’s coming next?
User Defined Functions
Register functions in the data warehouse with the SDK for expanded functionality
Access functions outside of the warehouse (e.g. pre-trained models) using external functions
Target Creation
Define, save and reuse targets
Materialize along with features
Low Latency Serving
Deployments maintain jobs for feature refresh but serving is not scalable or low latency
Low-latency serving will use key-value stores and in-memory processing for on-demand computation
Automated Feature Discovery
Recommend features based on semantics and relationships

Visit us to find out more!
https://github.com/featurebyte/featurebyte
Documentation · Website

Simplify Feature Engineering in Your Data Warehouse

More Related Content

Similar to Simplify Feature Engineering in Your Data Warehouse

Recently uploaded

Simplify Feature Engineering in Your Data Warehouse