1. How to build
data accessibility
for everyone by open source?
Karen Hsieh, 2022/7/31
2. Karen Hsieh
A product manager builds
company-wide data literacy and
empowers the product team to
create values for people and grow
the company to profit.
Welcome connect 👋
www.linkedin.com/in/karenhsieh/
● Contribute Using Metabase for Self-service
product analytics to Metabase Community.
● Moderator of #dbt-local-taipei.
4. Prerequisites
● Has data-informed culture.
○ You let data act as a check on your intuition.
● Some people doing spreadsheets feel tired to repeat the work.
○ “My computer is so slow 🤬 ! “ (When opening a spreadsheet.)
○ “😩 I spend 2 hours to produce the weekly report.” (The report is generated by multiple
spreadsheets.)
7. Why don’t we let everyone access raw data?
Let everyone accesses raw data
● Everyone needs to understand the raw
data
○ Raw data are not that clean 🥹
○ Effort on documentation
● Everyone needs to know how to write SQL
○ Require them to learn a new skill
Everyone accesses transferred data
● It’s more clear and easy to understand
● It’s much easier to generate reports from
there, e.g. create a pivot table in
spreadsheets
Why don’t we expect everyone access raw
data?
8. Goal 💪
Empowers everyone to do
self-serve analysis.
● Understand data
● Access data easily
● Build reports easily
Subscription Business
Subscription
channel analysis
Monthly subscription
Subscription coupon
usage
9. How do we do
1. What reports do people want?
2. What raw data do we have?
○ 🤯 Mostly ask someone who work here for a long time. (Time for archeology. ⛏)
3. Back and forwarth between 1 and 2 = How to transfer data?
○ 🤯🤯 Make sure the numbers are consistent with the previous data that they manually
counted so the users are comfortable and confident to use the transferred data. (May find
out some manual data have errors. 😰)
10. Data models (detail in this Miro board)
order_user
Raw data
Transferred data
stage
Transferred data
mart
Reports
subscriptions orders coupons channels users
order_revenue
subscription_user
Subscription
channel
Monthly
subscriptions
Subscription
coupon usage
1. Understand needs
2. What we have
3.
11. Data models (detail in this Miro board)
1 table
Raw data
Transferred data
stage
Transferred data
mart
reports
subscriptions orders coupons channels users
order_revenue
subscription_user
Subscription
channel
Monthly
subscriptions
Subscription
coupon usage
More..
order_user
12. Data pipeline from ETL ELT
● Extract
● Transfer
● Load
Due to cloud storage was expensive, so we want
to make sure we only load valuable data.
● Extract
● Load
● Transfer
Since cloud storage and computing are easy and
cheaper, we can load everything we extract then
do the transfer later.
13. R&R
Engineers
build the data pipeline
● Knowledge of data & platform
structure
● Setup the environment,
including data warehouse and
BI tool
Analysts
do data transfer & single
source of truth
● dbt, github, data warehouse
● SQL
● Understand business logic &
doc
Everyone
uses the transferred data
● Advanced - build reports
○ SQL
○ Know transferred data
● Basic - use reports
○ BI Tool
Note: Analytics Engineers provide clean data sets to end users
14. Data models (detail in this Miro board)
order_user
Raw data
Transferred data
stage
Transferred data
mart
Reports
subscriptions orders coupons channels users
order_revenue
subscription_user
Subscription
channel
Monthly
subscriptions
Subscription
coupon usage
3. Everyone for reports
1. Engineers for EL
2. Analysts for T
15. Open Source Tools
for data transfer
- with Github and the data
warehouse
the BI tool
16.
17. Modularized SQL query
● Use ref() or source()
● Auto generated DAG
Source: On DAGs, Hierarchies, and IDEs
Don’t throw 🗑 your query away. 💎 It’s reusable.
See the upstream and downstream relationships.
19. Sync dbt doc to Metabase
● persist_docs
○ Sync doc to data warehouses.
● dbt_metabase
○ Model synchronization from dbt
to Metabase.
● Source data is not supported.
It’s easy to keep doc posted.
The doc is usable only if it is updated.
20. dbt test
● Ensure data quality.
● tests:
- unique
- not_null
- relationships
- accepted_values
Source: Tests
Everyone trusts the data. Earn the trust.
21. dbt seed
Some data are manualled input. Seeds are CSV files in your dbt project.
dbt seed makes the CSV files into models. Manually input is included in the data source.
23. Config incremental models
An incremental run will be the rows in your source data that have been created or
updated since the last time dbt ran
Source: Configuring incremental models
Save the cost and decrease the errors.
24. Version control by Github
● Collaborate SQL
● Enabling CI
Source: Enabling CI
34. 🤩 Wow~ I like to do this!
Engineer: I want to get rid of checking data errors.
Data user: I don’t want to wait for someone providing the data.
35. Build data accessibility to everyone
Raw
Data
Transferred
Data
Engineers and Analysts make sure the data
quality and keeps the data pipeline
Everyone owns the reports
and does self-serve data analysis.
🤝
😄
📊Business Intelligence,
BI Tool
36. Reinforce the data-informed culture
= Raise the data literacy
Self-serve analysis is easy and quick
Many data with good quality.
like to check the data.
😄 📊
37. How do we do
1. What reports do people want?
2. What raw data do we have?
3. Transferred data
4. Advocate SQL
5. Share how to use Metabase
Recurring reports are send out
automatically. 🤖
Self-served ad hoc questions. 🎉
42. Examples - transferred data
Before:
● A operation staff who did 20 revenue reports monthly.
● She waited 6 hours for checking + 1 day for importing per report.
After:
● 5 min to import 1 report.
43. Examples - transferred data
Before:
● Waited 10 mins to open a spreadsheet with >10 tabs and >10K rows.
● Email attached the reports to the partner.
After:
● Automated update data to the dashboard on Data Studio.
● Share the dashboard to the partner. They can check it anytime.