How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How Instacart’s Catalog Flourished While
Hyper-Growing
Alex Charlton
Senior Engineer
Instacart/Catalog Engineering
A N T 3 2 8 - S

World’s largest grocery catalog

Evolution of Instacart’s Catalog
Functional data pipelines
Tools and techniques
Recap
Agenda

Grocery catalogs
High turnaround
Frequent sales
Many inputs
Inconsistent IDs
Legacy integrations
Instacart’s Catalog

× scale
Grocery catalogs
High turnaround
Frequent sales
Many inputs
Inconsistent IDs
Legacy integrations

First catalog
Manual entry
Second catalog
Fragile data entry pipeline

Current catalog
High volume
Correct
Robust
Flexible

Built on a functional pipeline
High volume of stored data
Scalable compute

Enter Snowflake

Snowflake
Cheap storage separated from
scalable compute
Everything is SQL
Fully managed
Many other nice features

x => f(x) => yA simple function

x1 => f(x) => y1
x2 => f(x) => y2
x3 => f(x) => y3
x4 => f(x) => y4
Inputs change over time

A system with no history ? => ? => y4

? => ? => y1
? => ? => y2
? => ? => y3
? => ? => y4
A system with audits

? => ? => y1
? => ? => y2
? => ? => y3
x4 => ? => y4
A system with audits and the last input stored

x1 => ? => y1
x2 => ? => y2
x3 => ? => y3
x4 => ? => y4
A system with audits and full input history

Transparent
Deterministic
Reproducible history
Comprehensive

Tracking history: Arranging data
pk value created_at
1001 a 1
1002 b 1
1003 c 2
1001 d 2
1002 e 3
1002 f 4
Data history

pk value created_at
1003 c 2
1001 d 2
1002 f 4
Data history

select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from data_source
) where row = 1
Window over data history

Snapshot
pk value created_at
1003 c 2
1001 d 2
1002 f 4
pk value created_at
1003 c 2
1001 d 2
1002 f 4
Data history

Snapshot 1 Snapshot 2 Snapshot 3
New data New data

pk value created_at
1003 c 2
1001 d 2
1002 f 4
Snapshot
+
pk value created_at
1004 g 5
1001 h 6
1003 i 6
Data source history
) row
from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
) where row = 1

insert into snapshots
) row
from (
union all
)
) where row = 1
order by cluster_key
Creating an ordered snapshot

) row
from (
union all
)
where cluster_key = ?
)
) where row = 1
Querying an ordered snapshot

Snapshot New data

Other options
Historical + current tables
Materialized views

Whoops, snapshots are state
Tracking history

Configuration and transformations as data
Tracking history

x, config => f(x, config) => y
Tracking history: Configuration and transformations as data

When is it important to track transformations?

When is it important to track transformations?
When meaning has been altered

Handling bad data
Amending versus removing bad
data

Handling bad data: Amending
pk upstream_id transformation new_value created_at generated_at
1 1001 cc67d17 a 1 2
2 1002 cc67d17 b 1 2
4 1004 cc67d17 c 1 2
3 1005 28e54e5 d 3 4
4 1006 28e54e5 e 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 g 5 6
5 1009 bb2a8b7 h 5 6
2 1002 cc67d17 b 1 2
3 1005 28e54e5 d 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Transformed data history
Snapshot

…
3 1005 bb2a8b7 i 3 7
2 1002 cc67d17 b 1 2
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Snapshot

pk transform_id new_value
61 1 a
62 1 b
63 1 c
61 2 d
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
Transform history
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17

id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
5 1003 303 cc67d17
6 1004 303 cc67d17
pk transform_id new_value
…
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
62 5 j
63 5 k
61 5 l
62 6 m
63 6 n
Transform history

Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h
Handling bad data: Removing

Deleted from source
source_id deleted_at context
102 1
Data provider sent
incorrect prices in
error. Liz has said they
will correct the issue
within three days
Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h

) row
from data_source
left join deleted_from_data_source
on deleted_from_data_source.source_id = data_source.source_id
where deleted_from_data_source.source_id is null
) where row = 1

What does a snapshot represent?
The correct state for a point in time
or
the actual state at a given point in time

Replacing bad snapshots
……

Replacing bad snapshots
Remove bad data Replace history
……

created at time 2
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 3
relative to time 3
……

created at time 2
relative to time 2
created at time 4
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 4
relative to time 3
……
created at time 3
relative to time 3
…
Remove bad data

Handling bad data: Bad data in snapshots

Continuous integration and deployment for data
Data build systems

Build status
Output tables
New build ID
Input data
Code
Configurations
Validations
Create record of new build
Build process
Monitor for changes
Data build systems

Data build systems Snapshots
Comprehensive Single-purpose optimization
Multiple outputs Single output
Functional Stateful
Data build systems

Tracking history
Historical data tables
Snapshots as an optimization
Configuration and
transformations as first-class data
Recap

Handling bad data
Amending
Removing
Dealing with the state
introduced by snapshots
Recap

Transparent
Deterministic
Reproducible history
Comprehensive
Recap

Thank you!
Alex Charlton
alex.charlton@instacart.com

Please complete the session
survey in the mobile app.
!

How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018

Similar to How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

How Instacart’s Catalog Flourished While Hyper-Growing (ANT328-S) - AWS re:Invent 2018