SlideShare a Scribd company logo
1 of 62
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How Instacart’s Catalog Flourished While
Hyper-Growing
Alex Charlton
Senior Engineer
Instacart/Catalog Engineering
A N T 3 2 8 - S
World’s largest grocery catalog
Evolution of Instacart’s Catalog
Functional data pipelines
Tools and techniques
Recap
Agenda
Grocery catalogs
High turnaround
Frequent sales
Many inputs
Inconsistent IDs
Legacy integrations
Instacart’s Catalog
× scale
Grocery catalogs
High turnaround
Frequent sales
Many inputs
Inconsistent IDs
Legacy integrations
Instacart’s Catalog
First catalog
Manual entry
Second catalog
Fragile data entry pipeline
Instacart’s Catalog
Current catalog
High volume
Correct
Robust
Flexible
Instacart’s Catalog
Built on a functional pipeline
High volume of stored data
Scalable compute
Instacart’s Catalog
Instacart’s Catalog
Enter Snowflake
Snowflake
Cheap storage separated from
scalable compute
Everything is SQL
Fully managed
Many other nice features
Instacart’s Catalog
Functional
data pipelines
Functional data pipelines
x => f(x) => yA simple function
x1 => f(x) => y1
x2 => f(x) => y2
x3 => f(x) => y3
x4 => f(x) => y4
Inputs change over time
Functional data pipelines
x1 => f(x) => y1
x2 => f(x) => y2
x3 => f(x) => y3
x4 => f(x) => y4
Inputs change over time
Functional data pipelines
A system with no history ? => ? => y4
Functional data pipelines
? => ? => y1
? => ? => y2
? => ? => y3
? => ? => y4
A system with audits
Functional data pipelines
? => ? => y1
? => ? => y2
? => ? => y3
x4 => ? => y4
A system with audits and the last input stored
Functional data pipelines
x1 => ? => y1
x2 => ? => y2
x3 => ? => y3
x4 => ? => y4
A system with audits and full input history
Functional data pipelines
Transparent
Deterministic
Reproducible history
Comprehensive
Functional data pipelines
Tools and techniques
Tracking history: Arranging data
pk value created_at
1001 a 1
1002 b 1
1003 c 2
1001 d 2
1002 e 3
1002 f 4
Data history
pk value created_at
1003 c 2
1001 d 2
1002 f 4
Data history
Tracking history: Arranging data
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from data_source
) where row = 1
Window over data history
Tracking history: Arranging data
Snapshot
pk value created_at
1003 c 2
1001 d 2
1002 f 4
pk value created_at
1003 c 2
1001 d 2
1002 f 4
Data history
Tracking history: Arranging data
Tracking history: Arranging data
Snapshot 1 Snapshot 2 Snapshot 3
New data New data
pk value created_at
1003 c 2
1001 d 2
1002 f 4
Snapshot
+
pk value created_at
1004 g 5
1001 h 6
1003 i 6
Data source history
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
) where row = 1
Tracking history: Arranging data
insert into snapshots
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
) where row = 1
order by cluster_key
Creating an ordered snapshot
Tracking history: Arranging data
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
where cluster_key = ?
)
) where row = 1
Querying an ordered snapshot
Tracking history: Arranging data
Snapshot New data
Tracking history: Arranging data
Other options
Historical + current tables
Materialized views
Tracking history: Arranging data
Whoops, snapshots are state
Tracking history
Configuration and transformations as data
Tracking history
x, config => f(x, config) => y
Tracking history: Configuration and transformations as data
When is it important to track transformations?
Tracking history: Configuration and transformations as data
When is it important to track transformations?
When meaning has been altered
Tracking history: Configuration and transformations as data
Handling bad data
Amending versus removing bad
data
Handling bad data: Amending
pk upstream_id transformation new_value created_at generated_at
1 1001 cc67d17 a 1 2
2 1002 cc67d17 b 1 2
4 1004 cc67d17 c 1 2
3 1005 28e54e5 d 3 4
4 1006 28e54e5 e 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 g 5 6
5 1009 bb2a8b7 h 5 6
pk upstream_id transformation new_value created_at generated_at
2 1002 cc67d17 b 1 2
3 1005 28e54e5 d 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Transformed data history
Snapshot
pk upstream_id transformation new_value created_at generated_at
…
3 1005 bb2a8b7 i 3 7
pk upstream_id transformation new_value created_at generated_at
2 1002 cc67d17 b 1 2
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Snapshot
Transformed data history
Handling bad data: Amending
pk transform_id new_value
61 1 a
62 1 b
63 1 c
61 2 d
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
Transform history
Transformed data history
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
Handling bad data: Amending
pk transform_id new_value
61 1 a
62 1 b
63 1 c
61 2 d
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
Transform history
Transformed data history
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
Handling bad data: Amending
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
5 1003 303 cc67d17
6 1004 303 cc67d17
pk transform_id new_value
…
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
62 5 j
63 5 k
61 5 l
62 6 m
63 6 n
Transform history
Transformed data history
Handling bad data: Amending
Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h
Handling bad data: Removing
Deleted from source
source_id deleted_at context
102 1
Data provider sent
incorrect prices in
error. Liz has said they
will correct the issue
within three days
Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h
Handling bad data: Removing
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from data_source
left join deleted_from_data_source
on deleted_from_data_source.source_id = data_source.source_id
where deleted_from_data_source.source_id is null
) where row = 1
Handling bad data: Removing
What does a snapshot represent?
The correct state for a point in time
or
the actual state at a given point in time
Handling bad data: Removing
Replacing bad snapshots
……
Handling bad data: Removing
Replacing bad snapshots
Remove bad data Replace history
……
Handling bad data: Removing
created at time 2
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 3
relative to time 3
Handling bad data: Removing
……
created at time 2
relative to time 2
created at time 4
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 4
relative to time 3
……
created at time 3
relative to time 3
…
Remove bad data
Handling bad data: Removing
Handling bad data: Bad data in snapshots
Handling bad data: Bad data in snapshots
Continuous integration and deployment for data
Data build systems
Build status
Output tables
New build ID
Input data
Code
Configurations
Validations
Create record of new build
Build process
Monitor for changes
Data build systems
Data build systems Snapshots
Comprehensive Single-purpose optimization
Multiple outputs Single output
Functional Stateful
Data build systems
Recap
Tracking history
Historical data tables
Snapshots as an optimization
Configuration and
transformations as first-class data
Recap
Handling bad data
Amending
Removing
Dealing with the state
introduced by snapshots
Recap
Data build systems
Recap
Functional data pipelines
Transparent
Deterministic
Reproducible history
Comprehensive
Recap
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alex Charlton
alex.charlton@instacart.com
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-exportFAO
 
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas Fittl
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas FittlMonitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas Fittl
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas FittlCitus Data
 
M|18 Understanding the Query Optimizer
M|18 Understanding the Query OptimizerM|18 Understanding the Query Optimizer
M|18 Understanding the Query OptimizerMariaDB plc
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on rAbhik Seal
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In RRsquared Academy
 
(140625) #fitalk sq lite 소개와 구조 분석
(140625) #fitalk   sq lite 소개와 구조 분석(140625) #fitalk   sq lite 소개와 구조 분석
(140625) #fitalk sq lite 소개와 구조 분석INSIGHT FORENSIC
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in InformixBingjie Miao
 
How to Implement Distributed Data Store
How to Implement Distributed Data Store How to Implement Distributed Data Store
How to Implement Distributed Data Store Philip Zhong
 
Histograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQLHistograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQLSergey Petrunya
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionFranck Pachot
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB plc
 
Histograms: Pre-12c and now
Histograms: Pre-12c and nowHistograms: Pre-12c and now
Histograms: Pre-12c and nowAnju Garg
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3MariaDB plc
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RRsquared Academy
 
Using histograms to get better performance
Using histograms to get better performanceUsing histograms to get better performance
Using histograms to get better performanceSergey Petrunya
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In RRsquared Academy
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsGramener
 
MariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit holeMariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit holeSergey Petrunya
 

What's hot (20)

R data-import, data-export
R data-import, data-exportR data-import, data-export
R data-import, data-export
 
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas Fittl
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas FittlMonitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas Fittl
Monitoring Postgres at Scale | PGConf.ASIA 2018 | Lukas Fittl
 
M|18 Understanding the Query Optimizer
M|18 Understanding the Query OptimizerM|18 Understanding the Query Optimizer
M|18 Understanding the Query Optimizer
 
Data manipulation on r
Data manipulation on rData manipulation on r
Data manipulation on r
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
R Programming: Importing Data In R
R Programming: Importing Data In RR Programming: Importing Data In R
R Programming: Importing Data In R
 
(140625) #fitalk sq lite 소개와 구조 분석
(140625) #fitalk   sq lite 소개와 구조 분석(140625) #fitalk   sq lite 소개와 구조 분석
(140625) #fitalk sq lite 소개와 구조 분석
 
Olap Functions Suport in Informix
Olap Functions Suport in InformixOlap Functions Suport in Informix
Olap Functions Suport in Informix
 
How to Implement Distributed Data Store
How to Implement Distributed Data Store How to Implement Distributed Data Store
How to Implement Distributed Data Store
 
Histograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQLHistograms in MariaDB, MySQL and PostgreSQL
Histograms in MariaDB, MySQL and PostgreSQL
 
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory optionStar Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
Star Transformation, 12c Adaptive Bitmap Pruning and In-Memory option
 
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-KompatibilitätMariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
MariaDB Server 10.3 - Temporale Daten und neues zur DB-Kompatibilität
 
Histograms: Pre-12c and now
Histograms: Pre-12c and nowHistograms: Pre-12c and now
Histograms: Pre-12c and now
 
What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3What's New in MariaDB Server 10.3
What's New in MariaDB Server 10.3
 
R Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In RR Programming: Learn To Manipulate Strings In R
R Programming: Learn To Manipulate Strings In R
 
Using histograms to get better performance
Using histograms to get better performanceUsing histograms to get better performance
Using histograms to get better performance
 
R Programming: Export/Output Data In R
R Programming: Export/Output Data In RR Programming: Export/Output Data In R
R Programming: Export/Output Data In R
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Don't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code ReviewsDon't Repeat Yourself, and Automated Code Reviews
Don't Repeat Yourself, and Automated Code Reviews
 
MariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit holeMariaDB Optimizer - further down the rabbit hole
MariaDB Optimizer - further down the rabbit hole
 

Similar to How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfSheba41
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?Miklos Christine
 
Adaptive Query Processing on RAW Data
Adaptive Query Processing on RAW DataAdaptive Query Processing on RAW Data
Adaptive Query Processing on RAW DataManos Karpathiotakis
 
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...In-Memory Computing Summit
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeDatabricks
 
Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineDatabricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Anyscale
 
Change tracking
Change trackingChange tracking
Change trackingSonny56
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisUniversity of Illinois,Chicago
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Petr Zapletal
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
 
UNIT 2 _ Data Processing and Aanalytics.pptx
UNIT 2 _ Data Processing and Aanalytics.pptxUNIT 2 _ Data Processing and Aanalytics.pptx
UNIT 2 _ Data Processing and Aanalytics.pptxYUVARAJS470834
 
Event-Driven Microservices: Back to the Basics
Event-Driven Microservices: Back to the BasicsEvent-Driven Microservices: Back to the Basics
Event-Driven Microservices: Back to the BasicsHostedbyConfluent
 

Similar to How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018 (20)

Unit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdfUnit 5 Time series Data Analysis.pdf
Unit 5 Time series Data Analysis.pdf
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
Adaptive Query Processing on RAW Data
Adaptive Query Processing on RAW DataAdaptive Query Processing on RAW Data
Adaptive Query Processing on RAW Data
 
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
IMC Summit 2016 Innovation - Derek Nelson - PipelineDB: The Streaming-SQL Dat...
 
Containerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta LakeContainerized Stream Engine to Build Modern Delta Lake
Containerized Stream Engine to Build Modern Delta Lake
 
Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time MachineData Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Change tracking
Change trackingChange tracking
Change tracking
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
Pumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency AnalysisPumps, Compressors and Turbine Fault Frequency Analysis
Pumps, Compressors and Turbine Fault Frequency Analysis
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Acunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra AppsAcunu Analytics: Simpler Real-Time Cassandra Apps
Acunu Analytics: Simpler Real-Time Cassandra Apps
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019Change Data Capture - Scale by the Bay 2019
Change Data Capture - Scale by the Bay 2019
 
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...
 
UNIT 2 _ Data Processing and Aanalytics.pptx
UNIT 2 _ Data Processing and Aanalytics.pptxUNIT 2 _ Data Processing and Aanalytics.pptx
UNIT 2 _ Data Processing and Aanalytics.pptx
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Event-Driven Microservices: Back to the Basics
Event-Driven Microservices: Back to the BasicsEvent-Driven Microservices: Back to the Basics
Event-Driven Microservices: Back to the Basics
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How Instacart’s Catalog Flourished While Hyper-Growing Alex Charlton Senior Engineer Instacart/Catalog Engineering A N T 3 2 8 - S
  • 4. Evolution of Instacart’s Catalog Functional data pipelines Tools and techniques Recap Agenda
  • 5. Grocery catalogs High turnaround Frequent sales Many inputs Inconsistent IDs Legacy integrations Instacart’s Catalog
  • 6. × scale Grocery catalogs High turnaround Frequent sales Many inputs Inconsistent IDs Legacy integrations Instacart’s Catalog
  • 7. First catalog Manual entry Second catalog Fragile data entry pipeline Instacart’s Catalog
  • 9. Built on a functional pipeline High volume of stored data Scalable compute Instacart’s Catalog
  • 11. Snowflake Cheap storage separated from scalable compute Everything is SQL Fully managed Many other nice features Instacart’s Catalog
  • 13. Functional data pipelines x => f(x) => yA simple function
  • 14. x1 => f(x) => y1 x2 => f(x) => y2 x3 => f(x) => y3 x4 => f(x) => y4 Inputs change over time Functional data pipelines
  • 15. x1 => f(x) => y1 x2 => f(x) => y2 x3 => f(x) => y3 x4 => f(x) => y4 Inputs change over time Functional data pipelines
  • 16. A system with no history ? => ? => y4 Functional data pipelines
  • 17. ? => ? => y1 ? => ? => y2 ? => ? => y3 ? => ? => y4 A system with audits Functional data pipelines
  • 18. ? => ? => y1 ? => ? => y2 ? => ? => y3 x4 => ? => y4 A system with audits and the last input stored Functional data pipelines
  • 19. x1 => ? => y1 x2 => ? => y2 x3 => ? => y3 x4 => ? => y4 A system with audits and full input history Functional data pipelines
  • 22. Tracking history: Arranging data pk value created_at 1001 a 1 1002 b 1 1003 c 2 1001 d 2 1002 e 3 1002 f 4 Data history
  • 23. pk value created_at 1003 c 2 1001 d 2 1002 f 4 Data history Tracking history: Arranging data
  • 24. select [columns] from ( select [columns], row_number() over ( partition by primary_key order by created_at desc ) row from data_source ) where row = 1 Window over data history Tracking history: Arranging data
  • 25. Snapshot pk value created_at 1003 c 2 1001 d 2 1002 f 4 pk value created_at 1003 c 2 1001 d 2 1002 f 4 Data history Tracking history: Arranging data
  • 26. Tracking history: Arranging data Snapshot 1 Snapshot 2 Snapshot 3 New data New data
  • 27. pk value created_at 1003 c 2 1001 d 2 1002 f 4 Snapshot + pk value created_at 1004 g 5 1001 h 6 1003 i 6 Data source history select [columns] from ( select [columns], row_number() over ( partition by primary_key order by created_at desc ) row from ( select [columns] from snapshots where snapshot_at = [last snapshot] union all select [columns] from data_source where created_at > [last snapshot] ) ) where row = 1 Tracking history: Arranging data
  • 28. insert into snapshots select [columns] from ( select [columns], row_number() over ( partition by primary_key order by created_at desc ) row from ( select [columns] from snapshots where snapshot_at = [last snapshot] union all select [columns] from data_source where created_at > [last snapshot] ) ) where row = 1 order by cluster_key Creating an ordered snapshot Tracking history: Arranging data
  • 29. select [columns] from ( select [columns], row_number() over ( partition by primary_key order by created_at desc ) row from ( select [columns] from ( select [columns] from snapshots where snapshot_at = [last snapshot] union all select [columns] from data_source where created_at > [last snapshot] ) where cluster_key = ? ) ) where row = 1 Querying an ordered snapshot Tracking history: Arranging data
  • 30. Snapshot New data Tracking history: Arranging data
  • 31. Other options Historical + current tables Materialized views Tracking history: Arranging data
  • 32. Whoops, snapshots are state Tracking history
  • 33. Configuration and transformations as data Tracking history
  • 34. x, config => f(x, config) => y Tracking history: Configuration and transformations as data
  • 35. When is it important to track transformations? Tracking history: Configuration and transformations as data
  • 36. When is it important to track transformations? When meaning has been altered Tracking history: Configuration and transformations as data
  • 37. Handling bad data Amending versus removing bad data
  • 38. Handling bad data: Amending pk upstream_id transformation new_value created_at generated_at 1 1001 cc67d17 a 1 2 2 1002 cc67d17 b 1 2 4 1004 cc67d17 c 1 2 3 1005 28e54e5 d 3 4 4 1006 28e54e5 e 3 4 1 1007 bb2a8b7 f 5 6 4 1008 bb2a8b7 g 5 6 5 1009 bb2a8b7 h 5 6 pk upstream_id transformation new_value created_at generated_at 2 1002 cc67d17 b 1 2 3 1005 28e54e5 d 3 4 1 1007 bb2a8b7 f 5 6 4 1008 bb2a8b7 e 5 6 5 1009 bb2a8b7 h 5 6 Transformed data history Snapshot
  • 39. pk upstream_id transformation new_value created_at generated_at … 3 1005 bb2a8b7 i 3 7 pk upstream_id transformation new_value created_at generated_at 2 1002 cc67d17 b 1 2 1 1007 bb2a8b7 f 5 6 4 1008 bb2a8b7 e 5 6 5 1009 bb2a8b7 h 5 6 Snapshot Transformed data history Handling bad data: Amending
  • 40. pk transform_id new_value 61 1 a 62 1 b 63 1 c 61 2 d 62 3 e 63 3 f 61 3 g 62 4 h 63 4 i Transform history Transformed data history id upstream_id config_id transformation 1 1001 301 cc67d17 2 1002 301 cc67d17 3 1003 302 cc67d17 4 1004 302 cc67d17 Handling bad data: Amending
  • 41. pk transform_id new_value 61 1 a 62 1 b 63 1 c 61 2 d 62 3 e 63 3 f 61 3 g 62 4 h 63 4 i Transform history Transformed data history id upstream_id config_id transformation 1 1001 301 cc67d17 2 1002 301 cc67d17 3 1003 302 cc67d17 4 1004 302 cc67d17 Handling bad data: Amending
  • 42. id upstream_id config_id transformation 1 1001 301 cc67d17 2 1002 301 cc67d17 3 1003 302 cc67d17 4 1004 302 cc67d17 5 1003 303 cc67d17 6 1004 303 cc67d17 pk transform_id new_value … 62 3 e 63 3 f 61 3 g 62 4 h 63 4 i 62 5 j 63 5 k 61 5 l 62 6 m 63 6 n Transform history Transformed data history Handling bad data: Amending
  • 43. Historical data source pk source_id value 1 101 a 2 101 b 3 101 c 1 102 d 3 102 e 4 102 f 2 103 g 3 103 h Handling bad data: Removing
  • 44. Deleted from source source_id deleted_at context 102 1 Data provider sent incorrect prices in error. Liz has said they will correct the issue within three days Historical data source pk source_id value 1 101 a 2 101 b 3 101 c 1 102 d 3 102 e 4 102 f 2 103 g 3 103 h Handling bad data: Removing
  • 45. select [columns] from ( select [columns], row_number() over ( partition by primary_key order by created_at desc ) row from data_source left join deleted_from_data_source on deleted_from_data_source.source_id = data_source.source_id where deleted_from_data_source.source_id is null ) where row = 1 Handling bad data: Removing
  • 46. What does a snapshot represent? The correct state for a point in time or the actual state at a given point in time Handling bad data: Removing
  • 48. Replacing bad snapshots Remove bad data Replace history …… Handling bad data: Removing
  • 49. created at time 2 relative to time 2 created at time 1 relative to time 1 Amending bad snapshots created at time 3 relative to time 3 Handling bad data: Removing ……
  • 50. created at time 2 relative to time 2 created at time 4 relative to time 2 created at time 1 relative to time 1 Amending bad snapshots created at time 4 relative to time 3 …… created at time 3 relative to time 3 … Remove bad data Handling bad data: Removing
  • 51. Handling bad data: Bad data in snapshots
  • 52. Handling bad data: Bad data in snapshots
  • 53. Continuous integration and deployment for data Data build systems
  • 54. Build status Output tables New build ID Input data Code Configurations Validations Create record of new build Build process Monitor for changes Data build systems
  • 55. Data build systems Snapshots Comprehensive Single-purpose optimization Multiple outputs Single output Functional Stateful Data build systems
  • 56. Recap
  • 57. Tracking history Historical data tables Snapshots as an optimization Configuration and transformations as first-class data Recap
  • 58. Handling bad data Amending Removing Dealing with the state introduced by snapshots Recap
  • 61. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Charlton alex.charlton@instacart.com
  • 62. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.