SlideShare a Scribd company logo
1 of 8
Download to read offline
Data Warehousing
with Google BigQuery
Stanley Choi
stanley@repute.io
Conclusion
http://kelli-arena.com/hadoop-data-warehouse-architecture/
0.5K ETLs, 1B Rows Daily
My Case You May
# of Engineers 1 1
# of Months 2.5 0.5
Upfront Costs 0 0
I have experienced several trial and errors.
If you skip these, you can do it in half a month.
• trial and error #1?
• trial and error #2?
• trial and error #3?
Data Warehousing
with Google BigQuery
2/8
Data Warehouse Architecture
ds1
ds2
ds3
ds4
ETL query
dim_*
fact_*
ods4 ods5
dm1 dm2 dm3 dm4
ods1 ods2 ods3
ods6
staging1
data analysts
production
databases
database replicas
ETL servers
dw2
summary
GA
ds5
ds6 snapshots
staging2
(ELT)
Variety of Sources Ingestion/Processing Layer Storage/Analytics Layer Visualization Apps
3/8
Design Concepts
① Not append but
replace
② Not resume but
reset/restart
③ Divide into several
blocks properly and
let each block perfect
Idempotence (冪等)
④ Not unify database
schemas
⑤ Not disturb service
developers
⑥ Not invent new things
Let it be/go (無爲)
⑦ Classify simply
⑧ Reduce # of switches
⑨ Sort alphabetically
Simplicity (單純)
4/8
Classifying Source Tables
Criterion 1) Is data row-rangeable? If yes, Partition rows horizontally and load partition by partition.
Criterion 2) Is data change-traceable? If yes, Apply changed rows only instead of loading all rows.
Criterion 2) Is data mutable? If yes, Reload all rows every day.
Rangeab
le
Mutable Type Data Loadings Daily Source Table Destination Table (on BigQuery)
yes yes P • all rows
• n ETLs by range
customers --// date-partitiond table
customers$19691231, customers$19700101,
customers$20110101, … ,
customers$20170101, customers$20170401,
customers$20170701
no W • rows of last x days
• n ETLs by date
orders --// wildcard table
orders_19691231, orders_20170622,
orders_20170623, … ,
orders_20170702, orders_20170703,
orders_20170704, orders_20170705
no yes S • all rows
• ETL at once
products --// single table
productsno
5/8
Data Loading Sequence
6/8
I have experienced several trial and errors.
If you skip these, you can do it in half a month.
• #1 Attempt to sync RDBMS to BigQuery
• #2 Attempt to make a new ETL engine
• #3 Formula hacks for data anonymization
(MariaDB, MySQL, PostgreSQL)
0.5K ETLs, 1B Rows Daily
My Case You May
# of Engineers 1 1
# of Months 2.5 0.5
Upfront Costs 0 0
Conclusion’
http://kelli-arena.com/hadoop-data-warehouse-architecture/
Data Warehousing
with Google BigQuery
MWA;
Microwarehouses Architecture
7/8
Q&A
https://facebook.com/groups/bigquery
stanley@repute.io

More Related Content

Recently uploaded

Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 

Recently uploaded (20)

WSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AIWSO2CON 2024 Slides - Unlocking Value with AI
WSO2CON 2024 Slides - Unlocking Value with AI
 
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
Abortion Pill Prices Boksburg [(+27832195400*)] 🏥 Women's Abortion Clinic in ...
 
WSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid EnvironmentsWSO2Con2024 - Software Delivery in Hybrid Environments
WSO2Con2024 - Software Delivery in Hybrid Environments
 
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
WSO2Con2024 - Facilitating Broadband Switching Services for UK Telecoms Provi...
 
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & InnovationWSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
WSO2CON 2024 - OSU & WSO2: A Decade Journey in Integration & Innovation
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of TransformationWSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
WSO2CON 2024 - Designing Event-Driven Enterprises: Stories of Transformation
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
WSO2Con2024 - Navigating the Digital Landscape: Transforming Healthcare with ...
 
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million PeopleWSO2Con2024 - Unleashing the Financial Potential of 13 Million People
WSO2Con2024 - Unleashing the Financial Potential of 13 Million People
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Data Warehousing with Google BigQuery

  • 1. Data Warehousing with Google BigQuery Stanley Choi stanley@repute.io
  • 2. Conclusion http://kelli-arena.com/hadoop-data-warehouse-architecture/ 0.5K ETLs, 1B Rows Daily My Case You May # of Engineers 1 1 # of Months 2.5 0.5 Upfront Costs 0 0 I have experienced several trial and errors. If you skip these, you can do it in half a month. • trial and error #1? • trial and error #2? • trial and error #3? Data Warehousing with Google BigQuery 2/8
  • 3. Data Warehouse Architecture ds1 ds2 ds3 ds4 ETL query dim_* fact_* ods4 ods5 dm1 dm2 dm3 dm4 ods1 ods2 ods3 ods6 staging1 data analysts production databases database replicas ETL servers dw2 summary GA ds5 ds6 snapshots staging2 (ELT) Variety of Sources Ingestion/Processing Layer Storage/Analytics Layer Visualization Apps 3/8
  • 4. Design Concepts ① Not append but replace ② Not resume but reset/restart ③ Divide into several blocks properly and let each block perfect Idempotence (冪等) ④ Not unify database schemas ⑤ Not disturb service developers ⑥ Not invent new things Let it be/go (無爲) ⑦ Classify simply ⑧ Reduce # of switches ⑨ Sort alphabetically Simplicity (單純) 4/8
  • 5. Classifying Source Tables Criterion 1) Is data row-rangeable? If yes, Partition rows horizontally and load partition by partition. Criterion 2) Is data change-traceable? If yes, Apply changed rows only instead of loading all rows. Criterion 2) Is data mutable? If yes, Reload all rows every day. Rangeab le Mutable Type Data Loadings Daily Source Table Destination Table (on BigQuery) yes yes P • all rows • n ETLs by range customers --// date-partitiond table customers$19691231, customers$19700101, customers$20110101, … , customers$20170101, customers$20170401, customers$20170701 no W • rows of last x days • n ETLs by date orders --// wildcard table orders_19691231, orders_20170622, orders_20170623, … , orders_20170702, orders_20170703, orders_20170704, orders_20170705 no yes S • all rows • ETL at once products --// single table productsno 5/8
  • 7. I have experienced several trial and errors. If you skip these, you can do it in half a month. • #1 Attempt to sync RDBMS to BigQuery • #2 Attempt to make a new ETL engine • #3 Formula hacks for data anonymization (MariaDB, MySQL, PostgreSQL) 0.5K ETLs, 1B Rows Daily My Case You May # of Engineers 1 1 # of Months 2.5 0.5 Upfront Costs 0 0 Conclusion’ http://kelli-arena.com/hadoop-data-warehouse-architecture/ Data Warehousing with Google BigQuery MWA; Microwarehouses Architecture 7/8