SlideShare a Scribd company logo
MetaConfig driven
FeatureStore@MakeMyTrip
~/Piyush
Head Data Platform Engineering
Namasté
About MakeMyTrip
Deliverables of this presentation:
- Why common feature store?
- Productionizating ML via standardization
- Machine Learning Life Cycle
- Prediction Serving + Challenges
- FeatureStore Components
- Architecture
- Tools
- Next Steps
- References
Motivation
Developing Unified Personalization platform for improving customer experience of millions of Indian
travellers
Business Goal: Through Hyper Personalization
● Raise Engagement
● Drive Conversions + Boost Revenue
● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip)
Tech Goal:
● Machine Learning Models are as good as the data they are trained on. Needs good Data Management.
● ML Systems are trained on set of features, a feature is a input to model which can be a column in a
dataset or complex computed metric or some other model output too
● Feature Store is a central common repository for highly curated features which are described through
well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
Before Feature Store : state of data platform
● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex
data pipelines | Machine Learning if not implemented in right manner creates high tech debt
○ Personalization : Cosmos
○ Customer Segmentation : HYDRA
○ Hotel Ranking / Sequencing + Intendo
○ DP : Dynamic / Differential Pricing : Hotel & Flights
○ Anomaly Detection, Destination trends, Demand Anomalies
● RealTime Features require Data Engineering support from Data Scientists
● Lack of standardization & discovery : Feature definitions are duplicated into the
different data pipelines even if it is same / computed multiple times and change to
definitions means fixing across different pipelines.
● Features used in training and serving were inconsistent
Productionizing ML via Standardization
● MetaConfigs & Feature Catalog : Documentation
● Reusability of features across projects / teams
● Standardized access of features between Training &
Serving | Data Governance + Data Quality
● More Self-serve : Reduces Data Scientist Time on DE
Tasks
● Reduce Time to get to Production for ML Projects
● Reduce Data Tech-Debt & Improved Feature Quality
Feature Store : Online
+ Historical
Data Store 1
Data Store 2
Data Store N
Raw Data
Data Sets 1
Data Sets N
Structured
Data
Feature Engineering
MODEL : TRAINING + DEPLOY
Machine Learning Life Cycle
ML LifeCycle Image source : UCB RISE LABs
Addition : FEATURE PIPELINES
Prediction serving
- ASK : 10 -30 ms / < 30 ms
- Challenges : DNN : Complex models
- Hardware : GPUs / TPUs
- SageMaker provides abstraction / middle layer between applications and complex
models thru docker containers
- Online : SageMaker Endpoints
- Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis
cluster / BoulderDB)
- Problems :
- Requires substantial computation and space
- Example doing the scoring for all customers
- Costly update -> rescore everything
FeatureStore Glossary
Feature : a measurable property of a phenomenon
under observation defined in FSConfig
FSConfig: used for storing config/ DSL + code to
compute features, feature version information,
feature analysis data and feature documentation
FSCompute: Computation Engine developed over
SPARK, supports mosts of the spark APIs for historical
and Online(Streaming)
FeatureStore : serves as a repository of features that
can be used for training and evaluation of machine
learning models.
FeatureGroup: internal to the system, to group
common compute jobs of related features having the
same entity, input data sources and filter conditions,
thereby optimizing the compute process.
FSScheduler: Internal service to create a feature
DAG(with Dependency Resolution) and trigger their
execution while handling retries and back pressure.
FS-DSA : Data Science Automation for Model Training
+ Deployment integrated with Feature Store |
Enables versioned and reproducible experiments.
FSBrokerAPI : Online Serving RESTful API endpoint for
consumer applications
FeatureStore Components & Data Flow
User Funnel Activity
Streams
Client-Side
Server-Side
DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE
Transactional Data
Booking Master
FSConfig :
Feature
Catalog
Master Datastore
Product Master, User
Master, Device
Master
New
Data
Streams
ML Automation
BT-Compute
BATCH Feature
Compute Jobs
RT-Compute
Feature
Compute
SERVING API
Offline Models
Online Models
Batch BULK API
(DataFrame)
Feature Definitions
BoulderDB REDIS
Feature
Storage
Job Scheduler
Sagemaker
TRAIN
Training + HPO
Deploy
Docker / Batch
Transform
FSConfig : Feature Definitions & Metadata
Feature Name :
<Entity>::<Feature_shortname>::<
Data Time Interval>::<Refresh
Frequency>::<Version>
Entity : <UserID>_<profileType> Short Name :
listing_conversion_rank
Versioning : v2 + Process :
RT/BT
FeatureGroup : (System
Generated ID)
8fda73d1_2eee_4cfc_a20f_e9afb1
78fbc3
Entity:
["uuid", "profile_type"]
Features [Array] Time Window(Refresh/
Data - Time duration): (ISO
Time Interval) P1D
Data Source [Array]:
[user_master, txn_search]
Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master,
txn_search]
Data Sink: Serving [Array] Data Store: GLUE
Catalog/S3/Redis/BoulderDb
Database Name :
rocksDB_<WAL Dir Path>
Table Name :
rocksDB_<columnFamily>
Compute Logic DSL + Spark SQL: metric_expr,
group_by_expr, filter_expr,
window_function,
window_function_alias
Code (Python/Scala/Java)
: GIT/Gerrit URI
Model(sagemaker) /
Embedding
Environment: Production Workspace: Dev/Staging/Production Namespace: <Project
Name>
Apache LIVY + Databricks
JOBs API Config
FS Store | online + historical
Output Schema (internal to the system)
● Historical Feature Data schema on S3 Parquet
|-- entity: string (nullable = false)
|-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false)
|-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
..
..
All features in that feature group
● Online Serving Data Schema on REDIS + BoulderDB
○ Serving at Feature Group level
Key -> <Entity_id>#<Feature_group_id>/<Feature_split>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
TimeStamp -> Compute_Processed_Time
○ Serving at Feature Level
Key -> <Entity_id>#<Feature_name>
Value -> Hashes
key -> Feature_name
Value -> Feature_value
SERVING Config
- lambda (batch_feature_name
linkage for RT features)
- Support for linear QUERY DAGs
- MVEL based post-processing on any
feature per service/model if needed
Feature backfill (back_fill_required,
back_fill_duration)
FS-BrokerAPI : Online Feature Serving Framework
Data Access LayerREQUEST HANDLER Orchestration Layer
Orchestration +
Broker
Extractors Transport
Business Logics
+ MVEL
Extractors Transport
<uri>/v1/getFeature
s
(POST Request)
AKKA(Actors)
Request
Validations Feature
Definition
Request
Handler
REDIS
Boulder
DB
FeaturesbyName
FeaturesbyModel
FeaturesbyService
BoulderDB : Online Serving Store
- Build on top of RocksDB (embedded data store: developed by Facebook) : reducing
the distance to data on serving layer.
- Steps added to compute layer: post-processing:
- BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across
various executors into shared object storage : S3
- Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested
into sst file per partition / executor
- Cluster coordinator : Consul
- Atomic switching of DB snapshots
- Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
Tools
Next Steps
- Feature Stats Visualization / Analytics & Monitoring // Feature
Catalog
- Seamless integration with Experimentation Framework
- Per User Databases on top of feature-store for Personalization
- Notebook integration : More better Data Science Tools for Data
Scientists with Python libraries
- Perf Tools : Query Optimization & Analysis
References
- https://www.logicalclocks.com/feature-store/
- https://eng.uber.com/scaling-michelangelo/
- Airbnb : Zipline
- HopsML + Hopsworks
- Go-JEK : FEAST
- The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18
- https://medium.com/makemytrip-engineering
Piyush Kumar
E : piyush.kumar@makemytrip.com
W : www.makemytrip.com
T : https://twitter.com/piykumar
Thank you !!

More Related Content

What's hot

Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
Jim Dowling
 
Integration for real-time Kafka SQL
Integration for real-time Kafka SQLIntegration for real-time Kafka SQL
Integration for real-time Kafka SQL
Amit Nijhawan
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
nagachika t
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
HostedbyConfluent
 
WebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API'sWebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API's
Michael Francis
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
Jim Dowling
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
KafkaZone
 
Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand
WSO2
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
From Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata SingaporeFrom Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata Singapore
Ofir Sharony
 
Apache Gobblin at MZ
Apache Gobblin at MZApache Gobblin at MZ
Apache Gobblin at MZ
Michael Dreibelbis
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
AnswerModules
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
SingleStore
 
All Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZ
confluent
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Guido Schmutz
 
george.farquhar.resume2
george.farquhar.resume2george.farquhar.resume2
george.farquhar.resume2
George Farquhar
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
confluent
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 

What's hot (20)

Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Integration for real-time Kafka SQL
Integration for real-time Kafka SQLIntegration for real-time Kafka SQL
Integration for real-time Kafka SQL
 
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKBigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDK
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
Mind the App: How to Monitor Your Kafka Streams Applications | Bruno Cadonna,...
 
WebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API'sWebAPI::DBIC - Automated RESTful API's
WebAPI::DBIC - Automated RESTful API's
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)Introduction to ksqlDB and stream processing (Vish Srinivasan  - Confluent)
Introduction to ksqlDB and stream processing (Vish Srinivasan - Confluent)
 
Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand Right-size Deployment Instances to Meet Enterprise Demand
Right-size Deployment Instances to Meet Enterprise Demand
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
From Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata SingaporeFrom Kafka to BigQuery - Strata Singapore
From Kafka to BigQuery - Strata Singapore
 
Apache Gobblin at MZ
Apache Gobblin at MZApache Gobblin at MZ
Apache Gobblin at MZ
 
Adopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuiteAdopting AnswerModules ModuleSuite
Adopting AnswerModules ModuleSuite
 
MemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks WebcastMemSQL 201: Advanced Tips and Tricks Webcast
MemSQL 201: Advanced Tips and Tricks Webcast
 
All Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZAll Streams Ahead! ksqlDB Workshop ANZ
All Streams Ahead! ksqlDB Workshop ANZ
 
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQLIngesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
Ingesting and Processing IoT Data - using MQTT, Kafka Connect and KSQL
 
george.farquhar.resume2
george.farquhar.resume2george.farquhar.resume2
george.farquhar.resume2
 
Building a Streaming Platform with Kafka
Building a Streaming Platform with KafkaBuilding a Streaming Platform with Kafka
Building a Streaming Platform with Kafka
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 

Similar to Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
Moritz Meister
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
camunda services GmbH
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
SQUADEX
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
llangit
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Jim Dowling
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
Stepan Pushkarev
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
Bill Liu
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
GoDataDriven
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloud
uEngine Solutions
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
Mostafa
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
ScyllaDB
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
Sam B
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
Aniket Mokashi
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
PAPIs.io
 
GPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep diveGPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep dive
Riccardo Perico
 

Similar to Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar (20)

Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Hamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature StoreHamburg Data Science Meetup - MLOps with a Feature Store
Hamburg Data Science Meetup - MLOps with a Feature Store
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...
 
BI 2008 Simple
BI 2008 SimpleBI 2008 Simple
BI 2008 Simple
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2C19013010 the tutorial to build shared ai services session 2
C19013010 the tutorial to build shared ai services session 2
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
SaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloudSaaS transformation with OCE - uEngineCloud
SaaS transformation with OCE - uEngineCloud
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Azure Data platform
Azure Data platformAzure Data platform
Azure Data platform
 
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
Simplifying the Creation of Machine Learning Workflow Pipelines for IoT Appli...
 
Samuel Bayeta
Samuel BayetaSamuel Bayeta
Samuel Bayeta
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
GPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep diveGPPB2020 - Milan - Power BI dataflows deep dive
GPPB2020 - Milan - Power BI dataflows deep dive
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
Intelisync
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 

Recently uploaded (20)

5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024A Comprehensive Guide to DeFi Development Services in 2024
A Comprehensive Guide to DeFi Development Services in 2024
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 

Data Con LA 2019 - MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip by Piyush Kumar

  • 3. Deliverables of this presentation: - Why common feature store? - Productionizating ML via standardization - Machine Learning Life Cycle - Prediction Serving + Challenges - FeatureStore Components - Architecture - Tools - Next Steps - References
  • 4. Motivation Developing Unified Personalization platform for improving customer experience of millions of Indian travellers Business Goal: Through Hyper Personalization ● Raise Engagement ● Drive Conversions + Boost Revenue ● Migrating Business Rule Engines to ML Models ( across different LOBs @MakeMyTrip) Tech Goal: ● Machine Learning Models are as good as the data they are trained on. Needs good Data Management. ● ML Systems are trained on set of features, a feature is a input to model which can be a column in a dataset or complex computed metric or some other model output too ● Feature Store is a central common repository for highly curated features which are described through well structured configuration. Enables us to scale machine learning workflows @MakeMyTrip.
  • 5. Before Feature Store : state of data platform ● Siloed Data Sets + Serving APIs created per use-case / projects leading to complex data pipelines | Machine Learning if not implemented in right manner creates high tech debt ○ Personalization : Cosmos ○ Customer Segmentation : HYDRA ○ Hotel Ranking / Sequencing + Intendo ○ DP : Dynamic / Differential Pricing : Hotel & Flights ○ Anomaly Detection, Destination trends, Demand Anomalies ● RealTime Features require Data Engineering support from Data Scientists ● Lack of standardization & discovery : Feature definitions are duplicated into the different data pipelines even if it is same / computed multiple times and change to definitions means fixing across different pipelines. ● Features used in training and serving were inconsistent
  • 6. Productionizing ML via Standardization ● MetaConfigs & Feature Catalog : Documentation ● Reusability of features across projects / teams ● Standardized access of features between Training & Serving | Data Governance + Data Quality ● More Self-serve : Reduces Data Scientist Time on DE Tasks ● Reduce Time to get to Production for ML Projects ● Reduce Data Tech-Debt & Improved Feature Quality Feature Store : Online + Historical Data Store 1 Data Store 2 Data Store N Raw Data Data Sets 1 Data Sets N Structured Data Feature Engineering MODEL : TRAINING + DEPLOY
  • 7. Machine Learning Life Cycle ML LifeCycle Image source : UCB RISE LABs Addition : FEATURE PIPELINES
  • 8. Prediction serving - ASK : 10 -30 ms / < 30 ms - Challenges : DNN : Complex models - Hardware : GPUs / TPUs - SageMaker provides abstraction / middle layer between applications and complex models thru docker containers - Online : SageMaker Endpoints - Batch : Scoring : Pre-materialize predictions into a low latency store ( like redis cluster / BoulderDB) - Problems : - Requires substantial computation and space - Example doing the scoring for all customers - Costly update -> rescore everything
  • 9. FeatureStore Glossary Feature : a measurable property of a phenomenon under observation defined in FSConfig FSConfig: used for storing config/ DSL + code to compute features, feature version information, feature analysis data and feature documentation FSCompute: Computation Engine developed over SPARK, supports mosts of the spark APIs for historical and Online(Streaming) FeatureStore : serves as a repository of features that can be used for training and evaluation of machine learning models. FeatureGroup: internal to the system, to group common compute jobs of related features having the same entity, input data sources and filter conditions, thereby optimizing the compute process. FSScheduler: Internal service to create a feature DAG(with Dependency Resolution) and trigger their execution while handling retries and back pressure. FS-DSA : Data Science Automation for Model Training + Deployment integrated with Feature Store | Enables versioned and reproducible experiments. FSBrokerAPI : Online Serving RESTful API endpoint for consumer applications
  • 10. FeatureStore Components & Data Flow User Funnel Activity Streams Client-Side Server-Side DATA CAPTURE COMPUTE + FSConfig SERVING + STORAGE Transactional Data Booking Master FSConfig : Feature Catalog Master Datastore Product Master, User Master, Device Master New Data Streams ML Automation BT-Compute BATCH Feature Compute Jobs RT-Compute Feature Compute SERVING API Offline Models Online Models Batch BULK API (DataFrame) Feature Definitions BoulderDB REDIS Feature Storage Job Scheduler Sagemaker TRAIN Training + HPO Deploy Docker / Batch Transform
  • 11. FSConfig : Feature Definitions & Metadata Feature Name : <Entity>::<Feature_shortname>::< Data Time Interval>::<Refresh Frequency>::<Version> Entity : <UserID>_<profileType> Short Name : listing_conversion_rank Versioning : v2 + Process : RT/BT FeatureGroup : (System Generated ID) 8fda73d1_2eee_4cfc_a20f_e9afb1 78fbc3 Entity: ["uuid", "profile_type"] Features [Array] Time Window(Refresh/ Data - Time duration): (ISO Time Interval) P1D Data Source [Array]: [user_master, txn_search] Data Store: GLUE/S3 Database Name: blueshift Table Name: [user_master, txn_search] Data Sink: Serving [Array] Data Store: GLUE Catalog/S3/Redis/BoulderDb Database Name : rocksDB_<WAL Dir Path> Table Name : rocksDB_<columnFamily> Compute Logic DSL + Spark SQL: metric_expr, group_by_expr, filter_expr, window_function, window_function_alias Code (Python/Scala/Java) : GIT/Gerrit URI Model(sagemaker) / Embedding Environment: Production Workspace: Dev/Staging/Production Namespace: <Project Name> Apache LIVY + Databricks JOBs API Config
  • 12. FS Store | online + historical Output Schema (internal to the system) ● Historical Feature Data schema on S3 Parquet |-- entity: string (nullable = false) |-- uuid_profileType::listing_conv_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::listing_view_rank::P30D::P15M::v1: long (nullable = false) |-- uuid_profileType::cnt_distinct_bk_bankid::P30D::P15M::v1: map (nullable = false) | |-- key: string | |-- value: integer (valueContainsNull = true) .. .. All features in that feature group ● Online Serving Data Schema on REDIS + BoulderDB ○ Serving at Feature Group level Key -> <Entity_id>#<Feature_group_id>/<Feature_split> Value -> Hashes key -> Feature_name Value -> Feature_value TimeStamp -> Compute_Processed_Time ○ Serving at Feature Level Key -> <Entity_id>#<Feature_name> Value -> Hashes key -> Feature_name Value -> Feature_value SERVING Config - lambda (batch_feature_name linkage for RT features) - Support for linear QUERY DAGs - MVEL based post-processing on any feature per service/model if needed Feature backfill (back_fill_required, back_fill_duration)
  • 13. FS-BrokerAPI : Online Feature Serving Framework Data Access LayerREQUEST HANDLER Orchestration Layer Orchestration + Broker Extractors Transport Business Logics + MVEL Extractors Transport <uri>/v1/getFeature s (POST Request) AKKA(Actors) Request Validations Feature Definition Request Handler REDIS Boulder DB FeaturesbyName FeaturesbyModel FeaturesbyService
  • 14. BoulderDB : Online Serving Store - Build on top of RocksDB (embedded data store: developed by Facebook) : reducing the distance to data on serving layer. - Steps added to compute layer: post-processing: - BT-Compute Layer after processing data through SPARK(distributed) - writes into SST Files across various executors into shared object storage : S3 - Split spark dataframe into non-overlapping ranges : individual split is sorted by KEY, then it is ingested into sst file per partition / executor - Cluster coordinator : Consul - Atomic switching of DB snapshots - Data is sharded (helps with proximity by Namespace) and replicated(RF=2)
  • 15. Tools
  • 16. Next Steps - Feature Stats Visualization / Analytics & Monitoring // Feature Catalog - Seamless integration with Experimentation Framework - Per User Databases on top of feature-store for Personalization - Notebook integration : More better Data Science Tools for Data Scientists with Python libraries - Perf Tools : Query Optimization & Analysis
  • 17. References - https://www.logicalclocks.com/feature-store/ - https://eng.uber.com/scaling-michelangelo/ - Airbnb : Zipline - HopsML + Hopsworks - Go-JEK : FEAST - The Design of Systems for Real-time Prediction Serving | DataEngConf SF '18 - https://medium.com/makemytrip-engineering
  • 18. Piyush Kumar E : piyush.kumar@makemytrip.com W : www.makemytrip.com T : https://twitter.com/piykumar Thank you !!