Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2-time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444
© All rights reserved. Matillion 20 21
A ve n d or p e rsp e ctive from Matillion
Modern Data Storage
Evolutions
© All rights reserved. Matillion 20 21
2
Paul Lacey
Sr. Dire ctor P rod u ct Marke tin g
Matillion
Sp e ake r
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
History of Data Warehousing
1960
DBMS
1970
RDBMS
1980
SQL
1988
Da ta
W a re h ou se
1992
P u b lish e d :
Bu ild in g th e
Da ta W a re h ou se
1996
P u b lish e d :
Th e Da ta
W a re h ou se
Toolkit
2005
Ha d oop
Big Da ta
2013
Clou d Da ta
W a re h ou se
2015
Clou d ETL
2017
La ke h ou se
3
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
History of Data Warehousing
2005
Ha d oop
‘Big Da ta ’
2011 2013 2014 2017
Clou d ETL
2015
Sp e ctru m
2017 2019
2013 2014
2011
4
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Architectural Paradigms
Effort to Reward
Innovation
Original Big Data Stack
Pipeline 2.0
Hybrid Storage
Lakehouse
2005
2013
2015
2017
5
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
The Original Big Data Stack
6
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Pipeline 2.0
7
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Hybrid Storage
8
© All rights reserved. Matillion 20 21
ML & BI Se p arate d an d Siloe d
Data
En g in e e rin g
Data Scie n ce
W ran g lin g ETL Data P re p
Storag e Data W are h ou se Data Lake
P roce ssin g Scala P an d as
Orch e stration Airflow
Ju p yte r
Note b ooks
Visu alization Tab le au Matp lotlib
9
9
© All rights reserved. Matillion 20 21
Com b in e d ata
scie n ce an d d ata
e n g in e e rin g
w orkflow s
The
Lakehouse
Approach
1
10
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Lake h ou se – b e st of b oth w orld s
Data W are h ou se Data Lake
Stre am in g
An alytics
BI Data
Scie n ce
Mach in e
Le arn in g
Stru ctu re d , Se m i-Stru ctu re d an d
Un stru ctu re d Data
11
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Lakehouse
12
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Fam iliar In te rface s
On e Datase t Analysts
Matillion ELT Logic and Orchestration
SQL
ACID
PySpark
Spark
Data
Scientists
Data
Engineers
Data Eng Data Science
Integrators Business Users
13
© All rights reserved. Matillion 20 21
© Matillion. All rights reserved 2021
Acce ssib le
In te g ration
Brin g s Un ifie d
An alytics
14
© All rights reserved. Matillion 2021
More In fo:
m atillion .com
15
© All rights reserved. Matillion 20 21
Matillion # Te am Gre e n Matillion # te am g re e n
m atillion .com
Th an k You !
16
Performance
• Performance is a critical point of interest when it
comes to selecting an analytics platform.
• To measure data warehouse performance, we use
similarly priced specifications across data
warehouse competitors.
• Usually when people say they care about
performance, it is the ultimate metric of
price/performance.
• The realities of creating fair tests can be
overwhelming to many shops, and is a task usually
underestimated.
2
The Perils of Performance Alone
• A modern workload is less
frequently a set number of
queries, but more of an
interactive variable number of
queries
• A lack of certain key features
and functions in the chosen
platform leads to increased
time spent
• There can be some hidden
downsides to some data
warehouse platforms have
features that appear beneficial
and desirable
3
Enterprise Analytic Platforms
Data is FFP when it is…
• In a leveragable platform
• In an appropriate platform for its profile and
usage
• With high non-functionals (Availability,
performance, scalability, stability, durability,
secure)
• Data is captured at the most granular level
• Data is at a data quality standard (as defined by
Data Governance)
5
Product Setup
6
Cost Predictability and Transparency
• The cost profile options for cloud databases
are straightforward if you accept the defaults
for simple workload or proof-of-concept (POC)
environments
• Initial entry costs and inadequately scoped
environments can artificially lower
expectations of the true costs of jumping into
a cloud data warehouse environment.
• For some, you pay for compute resources as a
function of time, but you also choose the
hourly rate based on certain enterprise
features you need.
7
Cost Consciousness and Licensing Structure
• Be on the lookout for cost optimizations like not
paying when the system is idle, compression to
save storage costs, and moving or isolating
workloads to avoid contention.
• Look for the ability to directly operate on compact
open file formats Parquet and ORC
• Also, costs can spin out of control if you have to
pay a separate license for each deployment option
or each machine learning algorithm.
8
Easy Administration
• Overall costs, time, as well as storage and compute
resources are affected by the simplicity of
configurability and overall use.
• The platform should have embraced a self-sufficiency
model for its customers and be well into the process of
automating repetitive tasks.
• Easy administration starts with setup that is a simple
process of asking basic information and providing
helpful information for selecting the storage and node
configurations.
9
Optimizer Robustness
• The data warehouse should be designed for
complex decision support and machine learning
activity in a multi-user, mixed workload, highly
concurrent environment.
• Check on conditional parallelism and what the
causes are of variations in the parallelism
deployed.
• Check on dynamic and controllable
prioritization of resources for queries.
10
Dedicated Compute
• The dedicated compute category represents the heart of
the analytics stack—the data warehouse itself.
• A modern cloud data warehouse must have separate
compute and storage architecture.
• The power to scale compute and storage independently of
one another has transitioned from an industry trend to an
industry standard.
11
Dedicated Storage
• The dedicated storage category represents
storage of the enterprise data.
• In former days, this data was tightly-coupled to
the data warehouse itself, but modern cloud
architecture allows for the data to be stored
separately (and priced separately).
12
Data Integration
• The data integration category represents the
movement of enterprise data from source to
the target data warehouse through
conventional ETL (extract-transform-load) and
ELT (extract-load-transform) methods.
13
Data Access
• Azure Synapse and Google BigQuery have a “serverless” pricing model that allows
users to run queries and only pay for the data they scan and not an hourly rate for
compute.
• Redshift has the Spectrum service to scan data in S3 without loading it into the data
warehouse; however, you pay for the data scanned, plus you need a running Redshift
cluster at an additional charge.
• For Snowflake, you pay for the compute, but not for data scanned. For all these
scenarios (except Snowflake), we assumed 500TB scanned per month for the
Medium-tier enterprise and 2,000TB scanned for Large organizations.
14
Data Lake
• The data lake category represents the use of a
data lake that is separate from the data. This is
common in many modern data-driven
organizations as a way to store and analyze
massive data sets of “colder” data that don’t
necessarily belong in the data warehouse.
15
Sample Breakout (AWS)
16
Dedicated
Compute
43%
Storage
0%
Data Integration
14%
Streaming
4%
Spark Analytics
3%
Data Exploration
6%
Data Lake
20%
BI
5%
Machine
Learning
5%
Identity
Management
0%
Data Catalog
0%
Product Utilization
17
Concurrency Scaling
• If the database has concurrency limitations, designing
around them is difficult at best, and limiting to effective
data usage.
• If the data warehouse automatically scales up to
overcome concurrency limitations, this may be costly if
the data warehouse charges by compute node.
• If the data warehouse charges per user, costs will also
increase as the data warehouse is put to more use in the
company.
• Look for a data warehouse to provide linear scaling in
overall query workload performance as concurrent users
are added.
18
Resource Elasticity
• A data warehouse needs to be able to scale up and down and
take advantage of the elastic compute and storage
capabilities in the cloud, public or private, without disruption
or delay.
• The more the customer needs to be involved in resource
determination and provisioning, the less elastic, and less
modern, the solution is.
• One thing to watch for in elasticity scaling is keeping the
amount of money spent by the customer under the
customer’s control.
19
Machine Learning
• Today, data warehouse query languages need to be extended to include machine
learning, or firms may find the programming required will be too challenging to keep
pace.
• Data warehouses today need to weave machine learning into their data processing
workflows.
• Vendors must accommodate and extend SQL to include machine learning functions
and algorithms to expand the capabilities of those tools and users.
• If your database does not include machine learning, there are many extra things to
be concerned with.
• Other components will be needed to complete the toolbox and get the job done.
• Ideally, security for machine learning will be the same as database security.
• The data warehouse also needs to be able to operate at scale, beyond sampling.
20
Data Storage Format Alternatives
• Cloud object storage is relatively inexpensive making data storage at high
scale affordable.
• On-premises, specialized private cloud storage options such as Pure
Flashblades tend to offer similar data type storage flexibility
• To take full advantage of the elasticity of the cloud without driving up costs,
data warehouse compute and storage need to be scaled separately.
• To take full advantage of the many types of data available, such as Apache
ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses
need to be able to analyze that data without moving or altering it.
• A unified analytics warehouse that supports these various data formats
means you have the benefit of querying them directly, without greatly
expanding the hierarchical complex data types to a standard tabular data
structure for analysis.
• You should also be able to import data directly from these formats
• The ability to join data for analysis between the various internal and external
data formats provides the highest level of analytic flexibility.
21
Hadoop Sequence File and Parquet File
22
Graph Databases
Bridge
vertex
Bridge
vertex
23
• Subject: John R Peterson Predicate: Knows Object: Frank T Smith
• Subject: Triple #1 Predicate: Confidence Percent Object: 70
• Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
USAGE UNDERSTANDING BY THE BUILDERS
DATA
CULTIVATION
Data Warehouse
Data Lake
Balance of Analytics
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
DW
Design Your Test
• What are you benchmarking?
– Query performance
– Load performance
– Query performance with concurrency
– Ease of use
• Competition
• Queries, Schema, Data
• Scale
• Cost
• Query Cut-Off
• Number of runs/cache
• Number of nodes
• Tuning allowed
• Vendor Involvement
• Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server)
• Any not-free third party, SaaS, or on-demand software
• Instance type of nodes
• Measure Price/Performance!
26
Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2 time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444

Using Data Platforms That Are Fit-For-Purpose

  • 1.
    Using Data Platforms thatare Fit-For-Purpose Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group A 2-time Inc. 5000 Company @williammcknight www.mcknightcg.com (214) 514-1444
  • 2.
    © All rightsreserved. Matillion 20 21 A ve n d or p e rsp e ctive from Matillion Modern Data Storage Evolutions
  • 3.
    © All rightsreserved. Matillion 20 21 2 Paul Lacey Sr. Dire ctor P rod u ct Marke tin g Matillion Sp e ake r
  • 4.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 History of Data Warehousing 1960 DBMS 1970 RDBMS 1980 SQL 1988 Da ta W a re h ou se 1992 P u b lish e d : Bu ild in g th e Da ta W a re h ou se 1996 P u b lish e d : Th e Da ta W a re h ou se Toolkit 2005 Ha d oop Big Da ta 2013 Clou d Da ta W a re h ou se 2015 Clou d ETL 2017 La ke h ou se 3
  • 5.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 History of Data Warehousing 2005 Ha d oop ‘Big Da ta ’ 2011 2013 2014 2017 Clou d ETL 2015 Sp e ctru m 2017 2019 2013 2014 2011 4
  • 6.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Architectural Paradigms Effort to Reward Innovation Original Big Data Stack Pipeline 2.0 Hybrid Storage Lakehouse 2005 2013 2015 2017 5
  • 7.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 The Original Big Data Stack 6
  • 8.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Pipeline 2.0 7
  • 9.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Hybrid Storage 8
  • 10.
    © All rightsreserved. Matillion 20 21 ML & BI Se p arate d an d Siloe d Data En g in e e rin g Data Scie n ce W ran g lin g ETL Data P re p Storag e Data W are h ou se Data Lake P roce ssin g Scala P an d as Orch e stration Airflow Ju p yte r Note b ooks Visu alization Tab le au Matp lotlib 9 9
  • 11.
    © All rightsreserved. Matillion 20 21 Com b in e d ata scie n ce an d d ata e n g in e e rin g w orkflow s The Lakehouse Approach 1 10
  • 12.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Lake h ou se – b e st of b oth w orld s Data W are h ou se Data Lake Stre am in g An alytics BI Data Scie n ce Mach in e Le arn in g Stru ctu re d , Se m i-Stru ctu re d an d Un stru ctu re d Data 11
  • 13.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Lakehouse 12
  • 14.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Fam iliar In te rface s On e Datase t Analysts Matillion ELT Logic and Orchestration SQL ACID PySpark Spark Data Scientists Data Engineers Data Eng Data Science Integrators Business Users 13
  • 15.
    © All rightsreserved. Matillion 20 21 © Matillion. All rights reserved 2021 Acce ssib le In te g ration Brin g s Un ifie d An alytics 14
  • 16.
    © All rightsreserved. Matillion 2021 More In fo: m atillion .com 15
  • 17.
    © All rightsreserved. Matillion 20 21 Matillion # Te am Gre e n Matillion # te am g re e n m atillion .com Th an k You ! 16
  • 18.
    Performance • Performance isa critical point of interest when it comes to selecting an analytics platform. • To measure data warehouse performance, we use similarly priced specifications across data warehouse competitors. • Usually when people say they care about performance, it is the ultimate metric of price/performance. • The realities of creating fair tests can be overwhelming to many shops, and is a task usually underestimated. 2
  • 19.
    The Perils ofPerformance Alone • A modern workload is less frequently a set number of queries, but more of an interactive variable number of queries • A lack of certain key features and functions in the chosen platform leads to increased time spent • There can be some hidden downsides to some data warehouse platforms have features that appear beneficial and desirable 3
  • 20.
  • 21.
    Data is FFPwhen it is… • In a leveragable platform • In an appropriate platform for its profile and usage • With high non-functionals (Availability, performance, scalability, stability, durability, secure) • Data is captured at the most granular level • Data is at a data quality standard (as defined by Data Governance) 5
  • 22.
  • 23.
    Cost Predictability andTransparency • The cost profile options for cloud databases are straightforward if you accept the defaults for simple workload or proof-of-concept (POC) environments • Initial entry costs and inadequately scoped environments can artificially lower expectations of the true costs of jumping into a cloud data warehouse environment. • For some, you pay for compute resources as a function of time, but you also choose the hourly rate based on certain enterprise features you need. 7
  • 24.
    Cost Consciousness andLicensing Structure • Be on the lookout for cost optimizations like not paying when the system is idle, compression to save storage costs, and moving or isolating workloads to avoid contention. • Look for the ability to directly operate on compact open file formats Parquet and ORC • Also, costs can spin out of control if you have to pay a separate license for each deployment option or each machine learning algorithm. 8
  • 25.
    Easy Administration • Overallcosts, time, as well as storage and compute resources are affected by the simplicity of configurability and overall use. • The platform should have embraced a self-sufficiency model for its customers and be well into the process of automating repetitive tasks. • Easy administration starts with setup that is a simple process of asking basic information and providing helpful information for selecting the storage and node configurations. 9
  • 26.
    Optimizer Robustness • Thedata warehouse should be designed for complex decision support and machine learning activity in a multi-user, mixed workload, highly concurrent environment. • Check on conditional parallelism and what the causes are of variations in the parallelism deployed. • Check on dynamic and controllable prioritization of resources for queries. 10
  • 27.
    Dedicated Compute • Thededicated compute category represents the heart of the analytics stack—the data warehouse itself. • A modern cloud data warehouse must have separate compute and storage architecture. • The power to scale compute and storage independently of one another has transitioned from an industry trend to an industry standard. 11
  • 28.
    Dedicated Storage • Thededicated storage category represents storage of the enterprise data. • In former days, this data was tightly-coupled to the data warehouse itself, but modern cloud architecture allows for the data to be stored separately (and priced separately). 12
  • 29.
    Data Integration • Thedata integration category represents the movement of enterprise data from source to the target data warehouse through conventional ETL (extract-transform-load) and ELT (extract-load-transform) methods. 13
  • 30.
    Data Access • AzureSynapse and Google BigQuery have a “serverless” pricing model that allows users to run queries and only pay for the data they scan and not an hourly rate for compute. • Redshift has the Spectrum service to scan data in S3 without loading it into the data warehouse; however, you pay for the data scanned, plus you need a running Redshift cluster at an additional charge. • For Snowflake, you pay for the compute, but not for data scanned. For all these scenarios (except Snowflake), we assumed 500TB scanned per month for the Medium-tier enterprise and 2,000TB scanned for Large organizations. 14
  • 31.
    Data Lake • Thedata lake category represents the use of a data lake that is separate from the data. This is common in many modern data-driven organizations as a way to store and analyze massive data sets of “colder” data that don’t necessarily belong in the data warehouse. 15
  • 32.
    Sample Breakout (AWS) 16 Dedicated Compute 43% Storage 0% DataIntegration 14% Streaming 4% Spark Analytics 3% Data Exploration 6% Data Lake 20% BI 5% Machine Learning 5% Identity Management 0% Data Catalog 0%
  • 33.
  • 34.
    Concurrency Scaling • Ifthe database has concurrency limitations, designing around them is difficult at best, and limiting to effective data usage. • If the data warehouse automatically scales up to overcome concurrency limitations, this may be costly if the data warehouse charges by compute node. • If the data warehouse charges per user, costs will also increase as the data warehouse is put to more use in the company. • Look for a data warehouse to provide linear scaling in overall query workload performance as concurrent users are added. 18
  • 35.
    Resource Elasticity • Adata warehouse needs to be able to scale up and down and take advantage of the elastic compute and storage capabilities in the cloud, public or private, without disruption or delay. • The more the customer needs to be involved in resource determination and provisioning, the less elastic, and less modern, the solution is. • One thing to watch for in elasticity scaling is keeping the amount of money spent by the customer under the customer’s control. 19
  • 36.
    Machine Learning • Today,data warehouse query languages need to be extended to include machine learning, or firms may find the programming required will be too challenging to keep pace. • Data warehouses today need to weave machine learning into their data processing workflows. • Vendors must accommodate and extend SQL to include machine learning functions and algorithms to expand the capabilities of those tools and users. • If your database does not include machine learning, there are many extra things to be concerned with. • Other components will be needed to complete the toolbox and get the job done. • Ideally, security for machine learning will be the same as database security. • The data warehouse also needs to be able to operate at scale, beyond sampling. 20
  • 37.
    Data Storage FormatAlternatives • Cloud object storage is relatively inexpensive making data storage at high scale affordable. • On-premises, specialized private cloud storage options such as Pure Flashblades tend to offer similar data type storage flexibility • To take full advantage of the elasticity of the cloud without driving up costs, data warehouse compute and storage need to be scaled separately. • To take full advantage of the many types of data available, such as Apache ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses need to be able to analyze that data without moving or altering it. • A unified analytics warehouse that supports these various data formats means you have the benefit of querying them directly, without greatly expanding the hierarchical complex data types to a standard tabular data structure for analysis. • You should also be able to import data directly from these formats • The ability to join data for analysis between the various internal and external data formats provides the highest level of analytic flexibility. 21
  • 38.
    Hadoop Sequence Fileand Parquet File 22
  • 39.
    Graph Databases Bridge vertex Bridge vertex 23 • Subject:John R Peterson Predicate: Knows Object: Frank T Smith • Subject: Triple #1 Predicate: Confidence Percent Object: 70 • Subject: Triple #1 Predicate: Provenance Object: Mary L Jones
  • 40.
    USAGE UNDERSTANDING BYTHE BUILDERS DATA CULTIVATION Data Warehouse Data Lake
  • 41.
    Balance of Analytics AnalyticApplications DW Data Lake Analytic Applications DW Data Lake Analytic Applications DW Data Lake DW
  • 42.
    Design Your Test •What are you benchmarking? – Query performance – Load performance – Query performance with concurrency – Ease of use • Competition • Queries, Schema, Data • Scale • Cost • Query Cut-Off • Number of runs/cache • Number of nodes • Tuning allowed • Vendor Involvement • Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server) • Any not-free third party, SaaS, or on-demand software • Instance type of nodes • Measure Price/Performance! 26
  • 43.
    Using Data Platforms thatare Fit-For-Purpose Presented by: William McKnight “#1 Global Influencer in Data Warehousing” Onalytica President, McKnight Consulting Group A 2 time Inc. 5000 Company @williammcknight www.mcknightcg.com (214) 514-1444