SlideShare a Scribd company logo
GraphReduce
Using graphs for feature engineering pipelines.
Wes Madrigal
Actual customer
Entity graph
But ML likes vectors not graphs...
ML needs this: [1, 6, 33.3, ‘product 5’, ‘opened notification’]
Not this:
Problem
● Prior to training machine learning models we need to generate features
● ML models need vectors of data not fragmented tables scattered across the enterprise
database
● Vectorizing many tables requires table joins, aggregate functions, etc.
● As the number of features grows, the likelihood for one off boilerplate code grows (e.g.,
joins, group bys, etc.)
● Additionally, many features may share the same tables
● Without a reusable, composable interface for building and automating features technical
debt and system complexity increases.
● Note: Feature stores solve some of these problems, but they take a different route
Why does feature engineering complexity matter?
● The failure rate of AI projects is high (85%), therefore experiment speed matters.
○ https://www.gartner.com/en/newsroom/press-releases/2018-02-13-gartner-says-nearly-half-of-cios-ar
e-planning-to-deploy-artificial-intelligence
● The cost of AI projects is high, therefore the reusability, extensibility, and production
readiness of high importance.
○ https://www.phdata.io/blog/what-is-the-cost-to-deploy-and-maintain-a-machine-learning-model/
○ Bare bones without MLOps: $60K
○ With MLOps for 1 model: $95K
● The talent shortage exacerbates the aforementioned
○ https://www.forbes.com/sites/forbestechcouncil/2022/10/11/the-data-science-talent-gap-why-it-exists-
and-what-businesses-can-do-about-it/?sh=3c63f6f23982
● Summary: If you don’t care, your boss does. If they don’t care, their boss does
How do we vectorize
The customer data
graph?
● Customer
○ N 100,000
● Orders
○ 2N = 200,000
● Order Events
○ 6N = 600,000
● Order Products
○ 10N = 1,000,000
● Notifications
○ 1000N = 100,000,000
● Notification Interactions
○ N^2 = 10,000,000,000
How do I update, modify, and maintain this?
select c.id as customer_id, nots.num_notifications,
nots.total_interactions,
nots.avg_interactions,
nots.max_interactions,
nots.min_interactions
from customers c
left join
(
select n.customer_id, count(n.id) as num_notifications, sum(ni.num_interactions) as
total_interactions,
avg(ni.num_interactions) as avg_interactions,
max(ni.num_interactions) as max_interactions,
min(ni.num_interactions) as min_interactions
from notifications n
left join
(
select notification_id,
count(id) as num_interactions
from notification_interactions
group by notification_id
) ni
on n.id = ni.notification_id
group by n.customer_id
) nots
on c.id = nots.customer_id
left join
(
select o.id as order_id, o.customer_id,
oe.num_order_events,
oe.num_type_events
from orders o
left join
(
select order_id,
count(id) as num_order_events,
sum(case when event_type_id = 1 then 1 else 0 end) as
num_type_events
from order_events
group by order_id
) oe
on o.id = oe.order_id
left join (
select order_id,
count(id) as num_order_products,
sum(case when product_type_id = 5 then 1 else 0 end) as
num_expensive_products,
sum(product_price) as product_price_sum,
max(product_price)-min(product_price) as product_price_range
from order_products
group by order_id
) op
on o.id = op.order_id
) ods
on c.id = ods.customer_id
where c.is_high_value = 1
and c.is_test = 0
and c.some_other_filter = 'yes';
What about orientation in time?
WHERE some_col >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’
…
…
select order_id,
count(id) as num_order_products,
sum(case when product_type_id = 5 then 1 else 0 end) as num_expensive_products,
sum(product_price) as product_price_sum,
max(product_price)-min(product_price) as product_price_range
from order_products
where ts > '2023-01-01' and ts < '2023-05-01'
Python: df.filter[df[col] >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’]
Solution
● Needs to work for tabular data
● Batch ML training, not solving for the online case. If you need online feature engineering
maybe consider feature stores.
● Need the following:
○ Reusable and scalable to many tables
○ Composable interface for switching tables used and feature vector computed
○ Orientation in time
○ Abstractions for repetitively implemented logic, such as joins, group bys, filters, etc.
○ Ability to support multiple feature definitions for the same table
○ Production ready interface, no changes between experimentation and production
MLOps
○ Ability to plug with multiple compute backends and extend easily to new backends
● Must flatten arbitrarily large enterprise data graphs
Solution 2
● Graphs can serve as the data structure for this problem by representing
tables as nodes and foreign keys as edges.
● By leveraging graph data structures we can plug into existing open source:
○ https://github.com/networkx
○ https://github.com/WestHealth/pyvis
● Some other companies have taken this approach with GNNs
○ https://kumo.ai
Solution diagram
GraphReduce
● GraphReduce
○ Top-level class that subclasses nx.DiGraph and defines abstractions for
○ Cut dates: the data around which to orient the data
○ Consideration period: the amount of time to consider
○ Compute layer: the compute layer to use
○ Abstractions for enforcing naming conventions and sequence
○ Edges between nodes and edge metadata (e.g., cardinality between nodes)
○ Compute graph specifications, such as whether to reduce a node or not
● GraphReduceNode
○ Custom class for each node, which allows parameterization of the following:
■ Primary key
■ Date key
■ File path
■ File format
■ Compute layer
■ prefix
Demo
https://github.com/wesmadrigal/GraphReduce/blob/master/examples/cust_order_d
emo.ipynb
Case Study: FreightVerify https://freightverify.com
● Customer:
○ Automotive supply chain monitoring SaaS solution called FreightVerify with over 50 million shipments tracked and
10s of billions of events received from carriers. Customer receives billions of coordinate updates per year and
produces billions of ETAs (estimated time of arrival) for their customers’ supply chain.
● Problem:
○ Current models are more than 3 months stale and data sizes have outgrown technological capabilities. Build a
machine learning operations solution with feature engineering pipelines for all current and future ETA model
architectures, and extensible enough for other model architectures outside of just ETA.
● Solution:
○ After digesting the customer’s data layer, built a Spark-based feature engineering solution, with graph architecture,
which abstracted most map/reduce operations, joins, filters, and annotations for feature engineering on more than
20 tables.
● Results:
Allowed for rapid build, test, deployment, and product integration of over 50 models. The time to market for new models is
drastically reduced, performance of models increased, and operational complexity was reduced. The customer is able to have
up-to-date models rebuilt daily for quick reactivity to changing global supply chain conditions.
Next steps
● Reducing boilerplate code required
● Supporting automated feature engineering on undefined nodes
● Dynamic upward propagation of aggregated features
● Potential integration with fugue: https://github.com/fugue-project/fugue
● Enhancements to visualization, graph serialization, and tracking
● Integration with other projects

More Related Content

Similar to Using Graphs for Feature Engineering_ Graph Reduce-2.pdf

Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Citus Data
 
Modern Software Architectures - Overview
Modern Software Architectures - Overview Modern Software Architectures - Overview
Modern Software Architectures - Overview
CodeOps Technologies LLP
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And Design
Naresh Jain
 
Clementine tool
Clementine toolClementine tool
Clementine tool
JAINAM KAPADIYA
 
Machine learning using TensorFlow on DSX
Machine learning using TensorFlow on DSX Machine learning using TensorFlow on DSX
Machine learning using TensorFlow on DSX
Tuhin Mahmud
 
Meta Data Framework
Meta Data FrameworkMeta Data Framework
Meta Data Framework
Mark Nießen
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
NETWAYS
 
Info graphic
Info graphicInfo graphic
Info graphic
Mark Miller
 
CAD theory presentation.pptx .
CAD theory presentation.pptx                .CAD theory presentation.pptx                .
CAD theory presentation.pptx .
Athar739197
 
Project report
Project reportProject report
Project report
AbhinavRawat47
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
Dave Stokes
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
Brian Brazil
 
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docxSix Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
whitneyleman54422
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
BigML, Inc
 

Similar to Using Graphs for Feature Engineering_ Graph Reduce-2.pdf (20)

Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
 
Modern Software Architectures - Overview
Modern Software Architectures - Overview Modern Software Architectures - Overview
Modern Software Architectures - Overview
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And Design
 
Tableau training course
Tableau training courseTableau training course
Tableau training course
 
Clementine tool
Clementine toolClementine tool
Clementine tool
 
Machine learning using TensorFlow on DSX
Machine learning using TensorFlow on DSX Machine learning using TensorFlow on DSX
Machine learning using TensorFlow on DSX
 
Shaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M ResumeShaik Niyas Ahamed M Resume
Shaik Niyas Ahamed M Resume
 
Meta Data Framework
Meta Data FrameworkMeta Data Framework
Meta Data Framework
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Info graphic
Info graphicInfo graphic
Info graphic
 
CAD theory presentation.pptx .
CAD theory presentation.pptx                .CAD theory presentation.pptx                .
CAD theory presentation.pptx .
 
Lec1
Lec1Lec1
Lec1
 
Project report
Project reportProject report
Project report
 
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
PHP Detroit -- MySQL 8 A New Beginning (updated presentation)
 
Resume
ResumeResume
Resume
 
So You Want to Write an Exporter
So You Want to Write an ExporterSo You Want to Write an Exporter
So You Want to Write an Exporter
 
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docxSix Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
Six Sigma ReportHammettSix Sigma DMAIC Project Report Templa.docx
 
BSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 SessionsBSSML16 L10. Summary Day 2 Sessions
BSSML16 L10. Summary Day 2 Sessions
 

Recently uploaded

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 

Recently uploaded (20)

一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 

Using Graphs for Feature Engineering_ Graph Reduce-2.pdf

  • 1. GraphReduce Using graphs for feature engineering pipelines. Wes Madrigal
  • 3. But ML likes vectors not graphs... ML needs this: [1, 6, 33.3, ‘product 5’, ‘opened notification’] Not this:
  • 4. Problem ● Prior to training machine learning models we need to generate features ● ML models need vectors of data not fragmented tables scattered across the enterprise database ● Vectorizing many tables requires table joins, aggregate functions, etc. ● As the number of features grows, the likelihood for one off boilerplate code grows (e.g., joins, group bys, etc.) ● Additionally, many features may share the same tables ● Without a reusable, composable interface for building and automating features technical debt and system complexity increases. ● Note: Feature stores solve some of these problems, but they take a different route
  • 5. Why does feature engineering complexity matter? ● The failure rate of AI projects is high (85%), therefore experiment speed matters. ○ https://www.gartner.com/en/newsroom/press-releases/2018-02-13-gartner-says-nearly-half-of-cios-ar e-planning-to-deploy-artificial-intelligence ● The cost of AI projects is high, therefore the reusability, extensibility, and production readiness of high importance. ○ https://www.phdata.io/blog/what-is-the-cost-to-deploy-and-maintain-a-machine-learning-model/ ○ Bare bones without MLOps: $60K ○ With MLOps for 1 model: $95K ● The talent shortage exacerbates the aforementioned ○ https://www.forbes.com/sites/forbestechcouncil/2022/10/11/the-data-science-talent-gap-why-it-exists- and-what-businesses-can-do-about-it/?sh=3c63f6f23982 ● Summary: If you don’t care, your boss does. If they don’t care, their boss does
  • 6. How do we vectorize The customer data graph? ● Customer ○ N 100,000 ● Orders ○ 2N = 200,000 ● Order Events ○ 6N = 600,000 ● Order Products ○ 10N = 1,000,000 ● Notifications ○ 1000N = 100,000,000 ● Notification Interactions ○ N^2 = 10,000,000,000
  • 7. How do I update, modify, and maintain this? select c.id as customer_id, nots.num_notifications, nots.total_interactions, nots.avg_interactions, nots.max_interactions, nots.min_interactions from customers c left join ( select n.customer_id, count(n.id) as num_notifications, sum(ni.num_interactions) as total_interactions, avg(ni.num_interactions) as avg_interactions, max(ni.num_interactions) as max_interactions, min(ni.num_interactions) as min_interactions from notifications n left join ( select notification_id, count(id) as num_interactions from notification_interactions group by notification_id ) ni on n.id = ni.notification_id group by n.customer_id ) nots on c.id = nots.customer_id left join ( select o.id as order_id, o.customer_id, oe.num_order_events, oe.num_type_events from orders o left join ( select order_id, count(id) as num_order_events, sum(case when event_type_id = 1 then 1 else 0 end) as num_type_events from order_events group by order_id ) oe on o.id = oe.order_id left join ( select order_id, count(id) as num_order_products, sum(case when product_type_id = 5 then 1 else 0 end) as num_expensive_products, sum(product_price) as product_price_sum, max(product_price)-min(product_price) as product_price_range from order_products group by order_id ) op on o.id = op.order_id ) ods on c.id = ods.customer_id where c.is_high_value = 1 and c.is_test = 0 and c.some_other_filter = 'yes';
  • 8. What about orientation in time? WHERE some_col >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’ … … select order_id, count(id) as num_order_products, sum(case when product_type_id = 5 then 1 else 0 end) as num_expensive_products, sum(product_price) as product_price_sum, max(product_price)-min(product_price) as product_price_range from order_products where ts > '2023-01-01' and ts < '2023-05-01' Python: df.filter[df[col] >= ‘YYYY-MM-DD’ AND some_col < ‘YYYY-MM-DD’]
  • 9. Solution ● Needs to work for tabular data ● Batch ML training, not solving for the online case. If you need online feature engineering maybe consider feature stores. ● Need the following: ○ Reusable and scalable to many tables ○ Composable interface for switching tables used and feature vector computed ○ Orientation in time ○ Abstractions for repetitively implemented logic, such as joins, group bys, filters, etc. ○ Ability to support multiple feature definitions for the same table ○ Production ready interface, no changes between experimentation and production MLOps ○ Ability to plug with multiple compute backends and extend easily to new backends ● Must flatten arbitrarily large enterprise data graphs
  • 10. Solution 2 ● Graphs can serve as the data structure for this problem by representing tables as nodes and foreign keys as edges. ● By leveraging graph data structures we can plug into existing open source: ○ https://github.com/networkx ○ https://github.com/WestHealth/pyvis ● Some other companies have taken this approach with GNNs ○ https://kumo.ai
  • 12. GraphReduce ● GraphReduce ○ Top-level class that subclasses nx.DiGraph and defines abstractions for ○ Cut dates: the data around which to orient the data ○ Consideration period: the amount of time to consider ○ Compute layer: the compute layer to use ○ Abstractions for enforcing naming conventions and sequence ○ Edges between nodes and edge metadata (e.g., cardinality between nodes) ○ Compute graph specifications, such as whether to reduce a node or not ● GraphReduceNode ○ Custom class for each node, which allows parameterization of the following: ■ Primary key ■ Date key ■ File path ■ File format ■ Compute layer ■ prefix
  • 14. Case Study: FreightVerify https://freightverify.com ● Customer: ○ Automotive supply chain monitoring SaaS solution called FreightVerify with over 50 million shipments tracked and 10s of billions of events received from carriers. Customer receives billions of coordinate updates per year and produces billions of ETAs (estimated time of arrival) for their customers’ supply chain. ● Problem: ○ Current models are more than 3 months stale and data sizes have outgrown technological capabilities. Build a machine learning operations solution with feature engineering pipelines for all current and future ETA model architectures, and extensible enough for other model architectures outside of just ETA. ● Solution: ○ After digesting the customer’s data layer, built a Spark-based feature engineering solution, with graph architecture, which abstracted most map/reduce operations, joins, filters, and annotations for feature engineering on more than 20 tables. ● Results: Allowed for rapid build, test, deployment, and product integration of over 50 models. The time to market for new models is drastically reduced, performance of models increased, and operational complexity was reduced. The customer is able to have up-to-date models rebuilt daily for quick reactivity to changing global supply chain conditions.
  • 15. Next steps ● Reducing boilerplate code required ● Supporting automated feature engineering on undefined nodes ● Dynamic upward propagation of aggregated features ● Potential integration with fugue: https://github.com/fugue-project/fugue ● Enhancements to visualization, graph serialization, and tracking ● Integration with other projects