Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT

Cloud Data Warehousing
What - Why - How & Compare
Rogier Werschkull, RogerData
@rwerschkull
nl.linkedin.com/in/rogierwerschkull

The story begins: before…
@rwerschkull

▪ The system was reaching its storage limit
▪ And extending it was very expensive
▪ All kinds of interesting performance challenges started appearing…
▪ Loading windows, projections…
▪ New use cases started to appear:
▪ “We need near real-time for LiveOps”
▪ Move to mobile gaming: potential unplannable growth
▪ The system became a bit unstable…
▪ Not well distributed data
▪ Out of memory / failing nodes
Until…
@rwerschkull

Not a Vertica problem!…
@rwerschkull

“How to fix
this
and prepare for the
future?”
@rwerschkull

▪Upgrade Vertica
▪ Huge initial investment.
• But what to choose when your data growth is unpredictable?
• And what about the changing use cases?
▪Store less data
▪ Try selling that to the business…
▪ And what about the data growth being unpredictable?
▪ Switch technology.............................
Choices…
@rwerschkull

You guessed it!
Streaming Ingest /
Message log
Data warehousing
Cloud
Pub/Sub
Streaming pipeline
Cloud
Dataflow
BigQuery
Analytics
Batch Ingest File Storage
Cloud
Storage
@rwerschkull

3. when you want your costs to scale linear with usage
2. when your current DWH is maxed out and difficult or
expensive to expand
▪ When you need potential endless storage and compute scaling (more
real-time being one possibility)
▪ When you require better workload separation
1. when you want a system that is simpler to setup, config,
maintain and operate
▪ Where a lot of DBA / maintenance work you are still doing now ‘comes
for free’
The takeaway: move your DWH to the
Cloud…
@rwerschkull

1. When everything mentioned on the previous slide does not
apply!
2. When you have a lot of data (10+ Terrabyte), and it is not in
the cloud already (ingress)
3. When your data amounts are small (<1TB) or in the multiple
Petabyte range (effect on costs)
4. When you have ultra sensitive data and don’t trust the
measures taken by the cloud providers to prevent this ‘from
going wrong’
The takeaway: when maybe not?
@rwerschkull

So which of
these 14 to
pick from then?
(Forrester Wave
for Cloud Data warehouses
Q4-2018)
@rwerschkull

The takeaway: start with either…
OR
@rwerschkull

The takeaway: consider:
Azure SQL
data warehouse
@rwerschkull

But WHY
Data-
warehousing?
@rwerschkull

But we’ll just do
this by building a
BiG data lake,
right?
Photo credit: Lake Public Domain, http://www.writeups.org/star-trek-brent-spiner-data/
@rwerschkull

Becoming truly
Data-driven
is HARD...
@rwerschkull

Failure…
Photo credit: https://highfiveexports.wordpress.com/2010/06/25/3000-pieces-lego-mix-specialty-pieces-rare-pieces-bricks-blocks-
parts-more-ultimate-lot-of-lego-parts-pieces-lego-for-sale-lego-batman-lego-starwars-lego-technic-lego-minifigur/
@rwerschkull

80-90%
@rwerschkull

“Every single company I've worked at and talked to has the
same problem without a single exception so far:
poor data quality...
Either there's incomplete data, missing data,
duplicative data.”
Ruslan Belkin, former VP of Engineering @ Twitter and Salesforce
@rwerschkull

@rwerschkull

“if you can’t build a
data warehouse you
shouldn’t do AI”
@rwerschkull

I don’t believe in
Cloud
Data warehousing!
@rwerschkull

(as the answer to all your Data warehousing
woes)
@rwerschkull

“The primary purpose of a data warehouse is to
transform data from
an application state into an integrated corporate
state”
Bill Inmon, the father of datawarehousing
@rwerschkull

This is
what we
still
SHOULD
want to
build
Subject
Oriented
Integrated
Time Variant Non-Volatile
DWH
@rwerschkull

Build a Data warehouse!
@rwerschkull

But
Different...
Photo credit: Public Domain
@rwerschkull

By splitting the work..
@rwerschkull
From
x(ETL)
To
EL+x(Tl)

So…
Subject
Oriented
Integrated
Time Variant Non-Volatile
DWH
@rwerschkull

Time
Variant
&
Non Volatile
Subject
Oriented
&
Integrated
Modern
DWH
@rwerschkull
EL x(Tl)

‘Virtual by
design’?
• Focus on the
transformation logic
• Not on storing / updating /
deleting data structures
• Simplifies backfilling /
changing the DWH
@rwerschkull
nl.linkedin.com/in/rogierwerschkullPhoto credit: Public Domain

Data &
History
Tagging &
Search
Integrate
Data into
meaningful
and useful
stuff
Modern
DWH
@rwerschkull
This is what we did in BigQuery!
Adress basic
DQ issues
Adress complex
DQ issues

I do believe in
Cloud
Data warehouse
@rwerschkull
technology!

I do believe in
Cloud based
analytical
databases @rwerschkull

What
Tool do we
choose?
@rwerschkull

Forrester Wave
for Cloud Data
warehouses
Q4-2018
@rwerschkull

▪Build on the Google Dremel execution engine (open sourced
as Apache Drill)
▪Available since October 2011
▪Cloud native: Born in the Cloud
▪Key unique feature:
▪ The only full-on DWAAS: No nodes, no cpu, no ram, nothing to
configure
BigQuery:
@rwerschkull

▪Based in PostgreSQL 8.0.2, rebuild as cloud based MPP
▪Based on legacy cloud DWH
▪Key unique features:
▪ Most implementations
▪ Best SQL support
Redshift:
@rwerschkull

▪New kid on the block, started by ex Oracle employees
▪Cloud native: Born in the Cloud
▪ The only cloud agnostic DWH: AWS, Azure and Google (eary
2020)
▪ No downtime auto scaling
▪ Metadata based data cloning (clone your production to DTA, no
extra storage!)
Snowflake
@rwerschkull

▪Based on SQL Server Parallel Data Warehouse (PDW)
▪Based on legacy cloud DWH
▪Available since 2015 (gen1) and may 2018 (gen2)
▪ Getting stronger quickly since the gen2 release
▪ Vast supporting ecosystem of GUI’s and ETL tools
Azure SQL datwarehouse
@rwerschkull

So why then and what
is different?
Comparing the top 3
benefits…
@rwerschkull

3- Costs:
Low entry point
/ Pay-for-use
Photo by Joel Filipe on Unsplash
@rwerschkull

Feature Redshift Azure SQL DWH BigQuery Snowflake
Fixed Start/Licence
Costs? NO NO NO NO
Separation storage
and -compute NO YES YES YES
Costs easy to
calculate?
Bit of work Bit of work YES Bit of work
Good predictability?
YES YES
NO (on demand)
YES (flatrate)
Depends on auto-
scaling
Storage Costs /1TB
month (USD) 374* 149
20/10
UNCOMPRESSED
23
CPU / Usage costs
Depends Depends
5 per TB read
UNCOMPRESSED
Depends
Comparing Cost features..
* For current gen dense storage ds2.8xlarge cluster running all year @rwerschkull

▪ Strong:
▪ All start at 0 costs, no fixed licence fee
▪ All employ a pay for use model
▪ Snowflake has the cheapest storage (even more with metadata-based cloning)
▪ Weak:
▪ Redshift doesn’t seperate storage and compute
▪ BigQuery DWAAS model makes charges cpu costs based on data queried: can get
out of control when you don’t set limits!
This applies to on demand only!
▪ BigQuery’s limited end user data caching can lead to a rise in costs, depending on
the usage pattern (solution in development)
This applies to on demand only!
Costs: Solution strong / weak points
@rwerschkull

• A cloud DWH will not always cheaper than on-prem!
• Costs change from CAPEX to OPEX
• Requires a different operating model
• Cost can be unpredictable, can be seen as a problem
• And remember the TCO!
Costs: don’t forget…
@rwerschkull

2- (Almost) Infinite scaling
Photo by Joel Filipe on Unsplash
@rwerschkull

Type Based on legacy Based on legacy Cloud native Cloud native
Max Storage size
2PB
Gen1:240TB
Gen2: 240TB row,
Unlimited columnstore
Unlimited Unlimited
Storage resizing COMPLEX Doable N.A. N.A.
Dynamic Node-
Resizing
Doable Doable N.A. EASY
Concurrency Resizing
COMPLEX Doable
Default 50,
then contact google
EASY
No-downtime
Auto-Scaling
NO NO N.A YES
Hibernate Compute NO YES N.A YES
Data caching Hot data SSD cache +
exact query
Hot data SSD cache +
exact query
Only exact query
+ in development
Hot data SSD cache +
exact query
Comparing Scale and Speed features..
@rwerschkull

▪Strong:
▪ BigQuery and Snowflake have unlimited storage
▪ BigQuery on-demand is always very powerful with 2000 slots.
Scaling is not relevant here!
▪ Snowflake has the best cluster and concurrency scaling options
▪Weak:
▪ Redshift is complex to resize, scale and cannot hibernate
▪ BigQuery’s DWAAS nature and limited caching options almost
always incur 2-3 seconds of query startup time
Scaling: Strong / weak points
@rwerschkull

• This could be ‘Your last DWH migration’: choose wisely!
• The power of this technology is an enabler for the modern
data warehousing methodology: Virtualize!
• An Infinate scale also increases the risk for:
• Infinite costs
• An infinite data mess
So take your Data Management (even more!) seriously!
Scaling: don’t forget…
@rwerschkull

Fivetran benchmark 10 Sept 2018
• 99 TPC-DS Queries
• Run only once
• Calculated with system
being idle 82% of time
• Factor 10 difference in
size of cluster and
dataset
• No usage of:
• Partitioning
• Sort keys
• Clustering
SOURCE: https://fivetran.com/blog/warehouse-benchmark @rwerschkull

SOURCE: https://fivetran.com/blog/warehouse-benchmark
Histogram of costs for 99 TPC-DS queries with geometric mean
@rwerschkull

SOURCE: https://fivetran.com/blog/warehouse-benchmark
Histogram of performance for 99 TPC-DS queries with geometric mean
@rwerschkull

1-Ease of deployment, development and
maintenance
@rwerschkull

Setup process COMPLEX AVERAGE N.A EASY
Managing data on cluster COMPLEX AVERAGE N.A EASY
Separation storage and -compute NO YES YES YES
Time travel
(auto-backup)
8 hours + configurable
8 hours + User-Defined
Restore Points
YES, 7 days, 2 after delete
YES
1 to 90 days + fail safe
Metadata only data cloning NO NO NO YES
SQL DDL support EXCELLENT GOOD LIMITED GOOD
SQL DML support EXCELLENT OK GOOD OK*
Stored procedure support GOOD GOOD NO GOOD
UDF support GOOD GOOD AVERAGE: not centrally GOOD
Materialized view support NO YES (preview) NO YES (limited)
PK/FK support as metadata NO NO as metadata
Quality GUI / SQL interface GOOD GOOD, but no web UI OK, web GOOD, web
JSON Parsing capabilities AVERAGE In preview OK GOOD
ETL dev / scheduling GOOD:
AWS GLUE,
Coding
GOOD,
Data Factory, SSIS, coding
GOOD,
Cloud data Fusion, Scheduled
query, Cloud composer
OK,
Coding or ext. ETL tool in AWS
/ Azure
Comparing deployment, development and maintenance
* Analytical functions are not fully mature yet:
https://medium.com/@jthandy/how-compatible-are-redshift-and-snowflakes-sql-syntaxes-c2103a43ae84c
@rwerschkull

▪All have integrated replication and backup
▪BigQuery has no config / maintenance work at all
▪Snowflake has just enough simple configurability
▪BigQuery and Snowflake support time travel
▪Snowflake has metadata based database cloning
▪Redshift has the best SQL support
▪SQL datawarehouse has the best supporting ecosystem of
GUI’s and ETL tools
Using: Strong points
@rwerschkull

▪Redshift and SQL datawarehouse require you to choose
distributions keys and create / update statistics
▪Redshift requires DBA work to reclaim space when deleting
data
▪BigQuery’s SQL DDL support is limited
▪BigQuery’s has no stored procedure, materialized views
and limited UDF support
▪No-one has proper primary or foreign key support!
Using: Weak points
@rwerschkull

• The data storing related work that remains: thinking about
your partitioning and clustering (sorting-data) strategy
• You still need to use a good datawarehousing
methodology!
• The basic skills and competences needed don’t
change!
Using: don’t forget…
@rwerschkull

So, what about..
Security?Photo: My own……
@rwerschkull

Data in EU YES YES YES YES
Encryption at rest Optional Optional YES, always YES, always
Customer managed Key YES YES YES YES
MFA YES YES YES YES
Row level security
NO YES
YES, authorized
views
YES, authorized
views
ISO 27001 (Information
Security Management)
YES YES YES YES
ISO 27017 (Cloud Security YES YES YES YES
ISO 27018 (Cloud Privacy) YES YES YES YES
SOC 1,2 & 3 YES YES YES YES
EU Model Contract Clause
(Data protection directive)
YES YES YES YES
Privacy Shield YES YES YES YES
Comparing privacy / security / regulatory

Thank you!
@rwerschkull

About me...
▪ Rogier Werschkull
▪ Independent DWH-BI consulant since 9/2018:
RogerData
▪ Data architecture, data modeling, data engineering
▪ Blogger
▪ BI/DWH Training
▪ Graph enthusiast
▪ Contact details:
▪ nl.linkedin.com/in/rogierwerschkull
▪ rogier@rogerdata.nl
▪ cloudanalyticsnow.nl
▪ @rwerschkull

▪ PDF: Sonra - a comparison of cloud data warehouse platforms
▪ PDF: Snowflake - The Forrester Wave - Cloud DWH Q4 2018
▪ PDF: GigaOm-sector-roadmap-cloud-analytic-databases-2017
▪ gigaom.com/report/data-warehouse-in-the-cloud-benchmark/
▪ tech.marksblogg.com/benchmarks.html
▪ dzone.com/articles/choosing-between-modern-data-warehouses
▪ www.periscopedata.com/blog/interactive-analytics-redshift-bigquery-snowflake
▪ fivetran.com/blog/warehouse-benchmark
▪ medium.com/@jthandy/how-compatible-are-redshift-and-snowflakes-sql-
syntaxes-c2103a43ae84
Resources-1
@rwerschkull

▪aws.amazon.com/redshift/faqs/
▪azure.microsoft.com/en-us/blog/azure-sql-data-warehouse-
releases-new-capabilities-for-performance-and-security/
▪cloud.google.com/security/compliance/#/regions=Europe
▪docs.microsoft.com/en-us/sql/tools/overview-sql-
tools?view=sql-server-2017
Resources-2
@rwerschkull

Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT

Similar to Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT (20)

More from Patrick Van Renterghem

More from Patrick Van Renterghem (20)

Recently uploaded

Recently uploaded (20)

Cloud Data Warehousing presentation by Rogier Werschkull, including tips, best practices and a BigQuery, Redshift, Snowflake and Azure SQL DWH comparison, delivered at #BIDASUMMIT