This document discusses Wix's solution to automate data warehouse maintenance called Data Warehouse Automation (DWHA). It consists of three components - BI Bank, Metric Collector, and DWHA. BI Bank defines data sources and semantics. Metric Collector extracts metrics from sources efficiently. DWHA understands changes between runs, aggregates data, and handles different table types and changes over time. The presenter demonstrates how DWHA streamlines maintenance by automatically handling differences between runs. They also discuss additional DWHA capabilities and plans for a UI.
2. Hi, I’m Maayan
→ Senior Big Data Engineer at day
→ A metal acapella singer at night
Solving Data Engineers Velocity | June 2023
Have been designing and developing Big
Data platforms for 6 years
→
3. Agenda → DE work today
→ Wix’s Solution
→ Demo of DWH
Automation
→ Additional DWHA Capabilities
Solving Data Engineers Velocity | June 2022
→ What’s next?
→ The 3 Components of the Solution
12. Wix’s Solution
BI Bank
→ Unifies the semantics
→ Defines the sources
Data Warehouse Automation
→ Unifies DWH tables definitions
→ Handles differences, so
maintenance becomes easy
Metric Collector
→ Unifies the way we
collect the sources
→ Efficient
Solving Data Engineers Velocity | June 2023
→ Handles the aggregations
13. Wix Engineering Locations
EU Ukraine Israel ROW
Vilnius Kyiv Tel-Aviv USA
Krakow Dnipro Be’er Sheva Canada
Berlin Lviv Haifa
Amsterdam
Solving Data Engineers Velocity | June 2022
14. BI Bank
Solving Data Engineers Velocity | June 2023
Rules
→ The source - the “from” and
“where” of the query
KPIs
→ Adds a domain to a set of rules
i.e. sites, users
15. • Helps us read from each source
only once
• Used in multiple data platforms
inside the company
Solving Data Engineers Velocity | June 2023
Metric Collector
16. 1
3
5
2
4
Get a list of metrics objects
Extract all relevant sources
Load all sources in parallel
Generate DFs per sources
Validate Metrics (optional)
Store to S3 Return a DF
Regular
OR
Union all DFs
17. DWH Automation
1 Understands the difference between yesterday’s
run and today’s run from Iceberg metadata
2 Reads the raw data using the Metric Collector
3 Aggregates
Solving Data Engineers Velocity | June 2023
Standard flow
4 Adds table-type-related logic
5 Writes - data+metadata
LEAD/LAG for SCD History Join for Dim
18. DWH Automation
→ Aggregation columns additions,
renames and deletions
→ Changes in the sources,
A.K.A the BI bank rules and KPIs
→ Changes in the table configuration:
● Table name
● Start time/days back
● Group by (PK) columns
● Owner
Solving Data Engineers Velocity | June 2023
Difference deepdive
19. Now, what used to take
hours, takes minutes or
seconds.
Solving Data Engineers Velocity | June 2023
Demo time!
20.
21.
22.
23.
24. Additional DWHA Capabilities
Table types
→ Fact
→ Slowly changing type 2
Scale
→ 5,000-10,000 GB
processed per day
→ 370,000,000,000 rows
for the largest table
Solving Data Engineers Velocity | June 2023
→ Dimension
→ Fact Unpivot
→ Hundreds of tables
Handels Differences in
→ Column name, addition,
deletion
→ Primary key change
→ Table start time/days back
→ Source change
→ Table name, owner
Additional Columns
→ Update date, incremental date
→ Slowly changing - start, stop, status
25. We are now
working to create
a DWH
Automation UI
Solving Data Engineers Velocity | June 2023
What’s Next?