2. Conclusion
http://kelli-arena.com/hadoop-data-warehouse-architecture/
0.5K ETLs, 1B Rows Daily
My Case You May
# of Engineers 1 1
# of Months 2.5 0.5
Upfront Costs 0 0
I have experienced several trial and errors.
If you skip these, you can do it in half a month.
• trial and error #1?
• trial and error #2?
• trial and error #3?
Data Warehousing
with Google BigQuery
2/8
3. Data Warehouse Architecture
ds1
ds2
ds3
ds4
ETL query
dim_*
fact_*
ods4 ods5
dm1 dm2 dm3 dm4
ods1 ods2 ods3
ods6
staging1
data analysts
production
databases
database replicas
ETL servers
dw2
summary
GA
ds5
ds6 snapshots
staging2
(ELT)
Variety of Sources Ingestion/Processing Layer Storage/Analytics Layer Visualization Apps
3/8
4. Design Concepts
① Not append but
replace
② Not resume but
reset/restart
③ Divide into several
blocks properly and
let each block perfect
Idempotence (冪等)
④ Not unify database
schemas
⑤ Not disturb service
developers
⑥ Not invent new things
Let it be/go (無爲)
⑦ Classify simply
⑧ Reduce # of switches
⑨ Sort alphabetically
Simplicity (單純)
4/8
5. Classifying Source Tables
Criterion 1) Is data row-rangeable? If yes, Partition rows horizontally and load partition by partition.
Criterion 2) Is data change-traceable? If yes, Apply changed rows only instead of loading all rows.
Criterion 2) Is data mutable? If yes, Reload all rows every day.
Rangeab
le
Mutable Type Data Loadings Daily Source Table Destination Table (on BigQuery)
yes yes P • all rows
• n ETLs by range
customers --// date-partitiond table
customers$19691231, customers$19700101,
customers$20110101, … ,
customers$20170101, customers$20170401,
customers$20170701
no W • rows of last x days
• n ETLs by date
orders --// wildcard table
orders_19691231, orders_20170622,
orders_20170623, … ,
orders_20170702, orders_20170703,
orders_20170704, orders_20170705
no yes S • all rows
• ETL at once
products --// single table
productsno
5/8
7. I have experienced several trial and errors.
If you skip these, you can do it in half a month.
• #1 Attempt to sync RDBMS to BigQuery
• #2 Attempt to make a new ETL engine
• #3 Formula hacks for data anonymization
(MariaDB, MySQL, PostgreSQL)
0.5K ETLs, 1B Rows Daily
My Case You May
# of Engineers 1 1
# of Months 2.5 0.5
Upfront Costs 0 0
Conclusion’
http://kelli-arena.com/hadoop-data-warehouse-architecture/
Data Warehousing
with Google BigQuery
MWA;
Microwarehouses Architecture
7/8