In this session, we learn how Instacart reimagined its catalog data processing pipeline to utilize Snowflake, the data warehouse built for the cloud. Instacart grew from a hand-entered catalog to one that processes billions of data points daily. Keeping pace with customer demand prompted Instacart to take an entirely new approach to addressing the unique challenges of grocery catalog curation. Through Snowflake’s unique architecture, which separates compute from storage, Instacart has increased their ability to quickly scale while improving the accuracy, traceability, and quality of their reporting. In turn, better information leads to offering more customized grocery catalog options that delight their customers. This session is brought to you by AWS partner, Snowflake Computing.
24. select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from data_source
) where row = 1
Window over data history
Tracking history: Arranging data
25. Snapshot
pk value created_at
1003 c 2
1001 d 2
1002 f 4
pk value created_at
1003 c 2
1001 d 2
1002 f 4
Data history
Tracking history: Arranging data
27. pk value created_at
1003 c 2
1001 d 2
1002 f 4
Snapshot
+
pk value created_at
1004 g 5
1001 h 6
1003 i 6
Data source history
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
) where row = 1
Tracking history: Arranging data
28. insert into snapshots
select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
) where row = 1
order by cluster_key
Creating an ordered snapshot
Tracking history: Arranging data
29. select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from (
select [columns] from (
select [columns] from snapshots
where snapshot_at = [last snapshot]
union all
select [columns] from data_source
where created_at > [last snapshot]
)
where cluster_key = ?
)
) where row = 1
Querying an ordered snapshot
Tracking history: Arranging data
38. Handling bad data: Amending
pk upstream_id transformation new_value created_at generated_at
1 1001 cc67d17 a 1 2
2 1002 cc67d17 b 1 2
4 1004 cc67d17 c 1 2
3 1005 28e54e5 d 3 4
4 1006 28e54e5 e 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 g 5 6
5 1009 bb2a8b7 h 5 6
pk upstream_id transformation new_value created_at generated_at
2 1002 cc67d17 b 1 2
3 1005 28e54e5 d 3 4
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Transformed data history
Snapshot
39. pk upstream_id transformation new_value created_at generated_at
…
3 1005 bb2a8b7 i 3 7
pk upstream_id transformation new_value created_at generated_at
2 1002 cc67d17 b 1 2
1 1007 bb2a8b7 f 5 6
4 1008 bb2a8b7 e 5 6
5 1009 bb2a8b7 h 5 6
Snapshot
Transformed data history
Handling bad data: Amending
40. pk transform_id new_value
61 1 a
62 1 b
63 1 c
61 2 d
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
Transform history
Transformed data history
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
Handling bad data: Amending
41. pk transform_id new_value
61 1 a
62 1 b
63 1 c
61 2 d
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
Transform history
Transformed data history
id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
Handling bad data: Amending
42. id upstream_id config_id transformation
1 1001 301 cc67d17
2 1002 301 cc67d17
3 1003 302 cc67d17
4 1004 302 cc67d17
5 1003 303 cc67d17
6 1004 303 cc67d17
pk transform_id new_value
…
62 3 e
63 3 f
61 3 g
62 4 h
63 4 i
62 5 j
63 5 k
61 5 l
62 6 m
63 6 n
Transform history
Transformed data history
Handling bad data: Amending
43. Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h
Handling bad data: Removing
44. Deleted from source
source_id deleted_at context
102 1
Data provider sent
incorrect prices in
error. Liz has said they
will correct the issue
within three days
Historical data source
pk source_id value
1 101 a
2 101 b
3 101 c
1 102 d
3 102 e
4 102 f
2 103 g
3 103 h
Handling bad data: Removing
45. select [columns] from (
select [columns], row_number() over (
partition by primary_key
order by created_at desc
) row
from data_source
left join deleted_from_data_source
on deleted_from_data_source.source_id = data_source.source_id
where deleted_from_data_source.source_id is null
) where row = 1
Handling bad data: Removing
46. What does a snapshot represent?
The correct state for a point in time
or
the actual state at a given point in time
Handling bad data: Removing
49. created at time 2
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 3
relative to time 3
Handling bad data: Removing
……
50. created at time 2
relative to time 2
created at time 4
relative to time 2
created at time 1
relative to time 1
Amending bad snapshots
created at time 4
relative to time 3
……
created at time 3
relative to time 3
…
Remove bad data
Handling bad data: Removing
54. Build status
Output tables
New build ID
Input data
Code
Configurations
Validations
Create record of new build
Build process
Monitor for changes
Data build systems
55. Data build systems Snapshots
Comprehensive Single-purpose optimization
Multiple outputs Single output
Functional Stateful
Data build systems