Feature Store as a Data Foundation for Machine Learning

Feature Store
as a Data Foundation for ML
Presented by:
Stepan Pushkarev, CTO @ Provectus
Gandhi Raketla, Senior Solutions Architect @ AWS

1. Introductions
2. Modern Data Lakes and Modern ML Infrastructure
3. Emerging Architectural Shifts
4. Feature Store: 200 LOD overview and reference architecture on
AWS
5. AWS Perspective on Feature Store
Agenda

Introductions
Stepan Pushkarev
Chief Technology
Officer, Provectus
Gandhi Raketla
Senior Solutions
Architect, AWS
German Osin
Senior Solutions
Architect, Provectus

Clients ranging from
fast-growing startups
to large enterprises
450 employees and
growing
Established in 2010
HQ in Palo Alto
Offices across the US,
Canada, and Europe
We are obsessed about leveraging cloud, data, and AI to reimagine the
way businesses operate, compete, and deliver customer value
AI-First Consultancy & Solutions
Provider

Innovative Tech Vendors
Seeking for niche expertise to
differentiate and win the market
Midsize to Large Enterprises
Seeking to accelerate innovation,
achieve operational excellence
Our Clients

Challenges of Modern
Data Platforms

Common Challenges:
Data Access and Discoverability
1. Data is scattered across multiple data sources
and technologies
2. Tedious process of managing AWS IAM roles,
Amazon S3 policies, API Gateways, Database
permissions
3. Gets even more complicated in AWS multi-
account setup
4. Metadata is not discoverable
5. As a result - all the investments into Data and
ML are killed by data access issues

1. Lack of ownership and domain context —
A disconnect between data producers
and data consumers
2. Backlogged data team struggling to
keep pace with business demands
3. No Contracts between Data and ML
Engineering
4. As a result, fast end-to-end
experimentation is killed by complex
dependencies between teams
Common Challenges:
Monolithic Data Teams
https://martinfowler.com/articles/data-monolith-to-mesh.html

Common Challenges:
ML Experimentation Infrastructure
1. Inherited issues with Data Discovery and
Data Access
2. Reproducibility of datasets, ML pipelines,
ML Environments, and offline experiments
is still an issue
3. Production Experimentation frameworks
are fairly immature yet
4. As a result, the cost of an end-to-end
experiment from data to production ML
metric is 3-6 months
https://hbr.org/2020/03/building-a-culture-of-experimentation

Common Challenges:
Scaling ML Adoption in Production
1. Online serving. There is no unified and consistent
way to access features during model serving.
2. Impossible to reuse features between multiple
training pipelines and ML applications.
3. Monitoring and maintenance of ML Applications.
4. As a result, time and cost to scale from 1 to 100
models in production is growing exponentially.
What is your cost per
ML Model in Production?

Emerging Architectural Shifts
Data Lake -> Hudi/Delta Lakes
Hudi/Delta Lakes bring managed ingestion, ACID transactions
and point in time queries into traditional Data Lakes
Data Lake -> Data Mesh
Ownership of data domains, data pipelines, metadata, and API
is shifting from centralized teams to product teams
Data Lake -> Data Infrastructure as a platform
Unified reusable platform components and frameworks across
enterprise
Endpoint Protection -> Global Data Governance
Data Security and privacy measures are becoming centralized
as part of Data Platform
Metadata Store -> Global Data Catalog
User Experience around data discovery, lineage, and versioning
requires investments into metadata-rich Data Catalog
Feature Store
Scaling ML Experimentation and Operations requires a
separate data management layer for ML Features
ML Toolkit -> Complete ML Infrastructure
ML capabilities are democratized for ML Engineers and citizen
Data Scientists

ACID Data Lakes
● Managed Ingestion
● Dataset versioning for ML training
● Cheap “Deletes” (common GDPR use case)
● Audit log to any changes in datasets
● Brings ACID transactions in your data lake
● “Upserts” strategy on data ingestion
● Enables schemas to enforce data quality
Delta/Hudi Lakes

Global Data Governance
Accelerate privacy operations with data you already
have.
Automate business processes, data mapping, and PI
discovery and classification for privacy workflows.
Operationalize policies in a central location.
Govern privacy policies to ensure policies are effectively
managed across the enterprise. Define and document
workflows, traceability views, and business process
registers.
Scale compliance across multiple regulations.
Use a platform designed and built with privacy in mind
that is easily extensible to support new regulations.
AWS Config
AWS Lake Formation

Global Data Catalog
Meta-metadata store:
● Does this data exist? Where is it?
● What is the source of truth of this data?
● Do I have access?
● Who is the owner?
● Who are the users of this data?
● Are there existing assets I can reuse?
● Can I trust this data?
* There are no established leaders in open
source

The Core of MLOps and Reproducible
Experimentation Pipelines
Model Code
ML Pipeline Code
Infrastructure
as a Code
Versioned
Dataset
Production
Metrics & Alerts
Model Artifacts
Prediction
Service
ML Metrics
Automated Pipeline Execution
Pipeline Metadata
Alerts Reports
Feature Store
Orchestration: Idempotent Execution
Feedback Loop for Production Data

Feature Store Value Proposition
A data management layer for machine learning features.
1. Better ROI from feature engineering through reduction of
cost per model — Facilitates collaboration, sharing, and
reusing of features
2. Faster time to market for new models through increased
productivity of ML Engineers - Decoupled storage
implementation and features serving API

● Personalization & Recommendation
Engines
● Dynamic Pricing Optimization
● Supply Chain Optimization
● Logistics and Transportation
Optimization
Feature Store: Canonical Use Cases
● Fraud Detection
● Predictive Maintenance
● Demand Forecasting
* All the use cases where ML models need a
stateful ever changing representation of the
system

● Online Feature Store
Online applications look up for a feature
vector that is sent to an ML model for
predictions
● ML specific Metadata
Enables features discoverability and
reuse
Feature Store: Concepts
● ML Specific API and SDK
High level operations for fetching training
feature sets and online access
● Materialized Versioned Datasets
Maintains versions of featuresets used to
train ML models
Raw
Data Feature StoreFeature Engineering
Training
Serving
Discovery

Platform License Supported Platforms
Feast (now backed by Tecton) Apache V2 AWS (in roadmap), GCP
Uber Michelangelo In-house product N/A
Hopsworks AGPL-V3 AWS, GCP, On-Premises
Tecton Enterprise AWS, GCP & Azure (2021)
Airbnb Zipline In-house product N/A
Comcast In-house product N/A
Netflix Metaflow In-house product N/A
Twitter In-house product N/A
Facebook FBLearner In-house product N/A
Pinterest Galaxy In-house product N/A
Feature Store: Market

Pros:
● Battle-tested with GoJek, Farfetch,
Postmates, and Zulily
● Integrated with Kubeflow
● Good community
Cons (to be addressed in the roadmap):
● GCP only
● Infrastructure-heavy
● Lacks composability
● No Data Versioning
* Now backed by Tecton
* https://blog.feast.dev/post/a-state-of-feast
Feast
Offline Store
(BigQuery)
Online
Serving
Historical
Serving
Feature
Registry
Online Store
(Redis)
Ingestion
Training
Discovery
Serving
Ingestion
API
Ingestion

Pros:
● Integrates with most Python libs for
ingestion and training
● Supports offline store with time travel
● AWS / GCP / Azure / On-Prem Ready
Cons:
● Hard to use out of HopsML
infrastructure
● Online store might not fit all latency
requirements
* Online serving is part of Enterprise version
Hopsworks
Feature
Registry
Offline Store
(Hudi/Hive)
Online
Serving
Historical
Serving
Spark
Online Store
(My SQL)
Training
Discovery
Serving
Pandas
Ingestion
API

Raw
Data
Hot
Storage
Event
Data
Stream Processing
BI Tools
API
Batch Processing Cold
Storage
Workflow Automation
Data
Catalog
Data
Quality
Data
Security
Modern Data Infrastructure

Feature Store
Raw Data
Hot
Storage
Event
Data
Stream Feature
Processing
Training
Serving
Batch Feature
Processing
Cold
Storage
Workflow Automation
Data
Catalog
Data
Quality
Data
Security
Data Infrastructure

1. Start with designing consistent ACID Data Lake before investing
into Feature Store
2. Value from existing open source products does not justify
investments into integration and the dependencies they bring
3. Feature Store must not bring about new infrastructure and
data storage solutions. It has to be a lightweight API and SDK
integrated into your existing data infrastructure.
4. Data Catalog, Data Governance, and Data Quality components
are horizontal for the whole Data Infrastructure, including
Feature Store
5. There are no mature open source or cloud solutions for Global
Data Catalog and Data Quality monitoring.
Lessons Learned

Data Infrastructure with Feature Store
Raw
Data
Hot
Storage
Event
Data
Stream Processing
BI Tools
API
Storage
Workflow Automation
Training
Serving
Feature
Store API
Data
Catalog
Data
Quality
Data
Security

Reference Architecture
Raw
Data
Hot
Storage
Event
Data
Stream Processing
BI Tools
API
Storage
Workflow Automation
Training
Serving
Feature
Store API
Data
Catalog
Data
Quality
Data
Security

Reference Architecture:
Components
Cold Storage Hot Storage Data Catalog Data Quality
Great
Expectations
DEEQU
Feature Store API
?Glue Metadata
? ?

Recommendations for going forward with Feature Store:
1. Make sure your existing Data Infrastructure covers
90% of Feature Store requirements (Streaming
Ingestion, Consistency, Catalog, Versioning)
2. Build in-house a lightweight Feature Store API to your
existing storage solutions
3. Collaborate with community and cloud vendors to
maintain compatibility with standards and state of
the art ecosystem
4. Be ready to migrate to managed service or an open
source alternative as the market matures
Recommended Strategy

Trademark 33
AWS Feature Storage Capabilities
✔ Reuse - Use the existing feature store pipeline developed by data engineers
to re-compute and cache features in a feature store
✔ Store - Store the metadata of features such as a description, documentation,
and statistical measures of features in the feature store.
✔ Discover - Make the metadata searchable through an API to ML practitioners
✔ Govern - Add a data management layer on top of the feature store for
governance and access control
✔ Consume - Allow ML practitioners to query and consume features using an
API to export the features for training or real-time inference

Trademark 34
Components Of Feature Store
Storage
• S3
• DynamoDB
• Redis
• Aurora
Catalog
• Glue Crawler
• Glue ETL
• Glue Catalog
Query/API
• Athena
• Lambda

Trademark 35
Storage

Performance
at scale
Consistent, single-digit
millisecond response times
at any scale; build
applications with virtually
unlimited throughput
Serverless architecture
No hardware provisioning,
software patching, or upgrades;
scales up or down
automatically; continuously
backs up your data
Global replication
You can build global
applications with fast access
to local data by easily
replicating tables across
multiple AWS Regions
Enterprise
security
Encrypts all data by
default and fully integrates
with AWS Identity and
Access Management for
robust security
Amazon DynamoDB
Fast and flexible key-value database service for any scale

Read scaling with replicas;
write and memory scaling with
sharding; nondisruptive scaling
Unlimited scale
AWS manages all hardware
and software setup,
configuration, and monitoring
Fully managed
In-memory data store
and cache for sub-millisecond
response times
Consistent high performance
Amazon ElastiCache
Managed, Redis, or Memcached-compatible in-memory data store

Performance
& scalability
5x throughput of standard
MySQL and 3x of standard
PostgreSQL; scale out up
to 15 read replicas
Availability
& durability
Fault-tolerant, self-healing
storage; 6 copies of data across 3
AZs; continuous backup to
Amazon S3
Highly
secure
Network
isolation,
encryption at
rest / in transit
Fully
managed
Managed by Amazon RDS:
On your part, no server provisioning,
software patching, setup,
configuration, or backups
Amazon Aurora
MySQL and PostgreSQL-compatible relational database built for the cloud

AWS Glue: Components
Data Catalog
▪ Hive Metastore compatible with enhanced functionality
▪ Crawlers automatically extracts metadata and creates tables
▪ Integrated with Amazon Athena, Amazon Redshift Spectrum
Job Execution
▪ Run jobs on a serverless Spark platform
▪ Provides flexible scheduling
▪ Handles dependency resolution, monitoring and alerting
Job Authoring
▪ Auto-generates ETL code
▪ Build on open frameworks – Python and Spark
▪ Developer-centric – editing, debugging, sharing

Amazon Athena
Pay per query
Pay only for queries run
Save 30–90% on per-query costs
through compression
Use S3 storage
ANSI SQL
JDBC/ODBC drivers
Multiple formats, compression
types, and complex joins and
data types
SQ
L
Serverless: zero infrastructure,
zero administration
Integrated with QuickSight
EasyQuery instantly
Zero setup cost
Point to S3 and start querying
Serverless, interactive query service
Analytics

Questions, details?
We would be happy to answer!
125 University Avenue
Suite 290, Palo Alto
California, 94301
provectus.com

Feature Store as a Data Foundation for Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Feature Store as a Data Foundation for Machine Learning

Similar to Feature Store as a Data Foundation for Machine Learning (20)

More from Provectus

More from Provectus (20)

Recently uploaded

Recently uploaded (20)

Feature Store as a Data Foundation for Machine Learning