Using Data Platforms That Are Fit-For-Purpose

Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2-time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444

© All rights reserved. Matillion 20 21
A ve n d or p e rsp e ctive from Matillion
Modern Data Storage
Evolutions

2
Paul Lacey
Sr. Dire ctor P rod u ct Marke tin g
Matillion
Sp e ake r

© Matillion. All rights reserved 2021
History of Data Warehousing
1960
DBMS
1970
RDBMS
1980
SQL
1988
Da ta
W a re h ou se
1992
P u b lish e d :
Bu ild in g th e
Da ta W a re h ou se
1996
P u b lish e d :
Th e Da ta
W a re h ou se
Toolkit
2005
Ha d oop
Big Da ta
2013
Clou d Da ta
W a re h ou se
2015
Clou d ETL
2017
La ke h ou se
3

History of Data Warehousing
2005
Ha d oop
‘Big Da ta ’
2011 2013 2014 2017
Clou d ETL
2015
Sp e ctru m
2017 2019
2013 2014
2011
4

Architectural Paradigms
Effort to Reward
Innovation
Original Big Data Stack
Pipeline 2.0
Hybrid Storage
Lakehouse
2005
2013
2015
2017
5

The Original Big Data Stack
6

Pipeline 2.0
7

Hybrid Storage
8

ML & BI Se p arate d an d Siloe d
Data
En g in e e rin g
Data Scie n ce
W ran g lin g ETL Data P re p
Storag e Data W are h ou se Data Lake
P roce ssin g Scala P an d as
Orch e stration Airflow
Ju p yte r
Note b ooks
Visu alization Tab le au Matp lotlib
9
9

Com b in e d ata
scie n ce an d d ata
e n g in e e rin g
w orkflow s
The
Lakehouse
Approach
1
10

Lake h ou se – b e st of b oth w orld s
Data W are h ou se Data Lake
Stre am in g
An alytics
BI Data
Scie n ce
Mach in e
Le arn in g
Stru ctu re d , Se m i-Stru ctu re d an d
Un stru ctu re d Data
11

Lakehouse
12

Fam iliar In te rface s
On e Datase t Analysts
Matillion ELT Logic and Orchestration
SQL
ACID
PySpark
Spark
Data
Scientists
Data
Engineers
Data Eng Data Science
Integrators Business Users
13

Acce ssib le
In te g ration
Brin g s Un ifie d
An alytics
14

© All rights reserved. Matillion 2021
More In fo:
m atillion .com
15

Matillion # Te am Gre e n Matillion # te am g re e n
m atillion .com
Th an k You !
16

Performance
• Performance is a critical point of interest when it
comes to selecting an analytics platform.
• To measure data warehouse performance, we use
similarly priced specifications across data
warehouse competitors.
• Usually when people say they care about
performance, it is the ultimate metric of
price/performance.
• The realities of creating fair tests can be
overwhelming to many shops, and is a task usually
underestimated.
2

The Perils of Performance Alone
• A modern workload is less
frequently a set number of
queries, but more of an
interactive variable number of
queries
• A lack of certain key features
and functions in the chosen
platform leads to increased
time spent
• There can be some hidden
downsides to some data
warehouse platforms have
features that appear beneficial
and desirable
3

Data is FFP when it is…
• In a leveragable platform
• In an appropriate platform for its profile and
usage
• With high non-functionals (Availability,
performance, scalability, stability, durability,
secure)
• Data is captured at the most granular level
• Data is at a data quality standard (as defined by
Data Governance)
5

Cost Predictability and Transparency
• The cost profile options for cloud databases
are straightforward if you accept the defaults
for simple workload or proof-of-concept (POC)
environments
• Initial entry costs and inadequately scoped
environments can artificially lower
expectations of the true costs of jumping into
a cloud data warehouse environment.
• For some, you pay for compute resources as a
function of time, but you also choose the
hourly rate based on certain enterprise
features you need.
7

Cost Consciousness and Licensing Structure
• Be on the lookout for cost optimizations like not
paying when the system is idle, compression to
save storage costs, and moving or isolating
workloads to avoid contention.
• Look for the ability to directly operate on compact
open file formats Parquet and ORC
• Also, costs can spin out of control if you have to
pay a separate license for each deployment option
or each machine learning algorithm.
8

Easy Administration
• Overall costs, time, as well as storage and compute
resources are affected by the simplicity of
configurability and overall use.
• The platform should have embraced a self-sufficiency
model for its customers and be well into the process of
automating repetitive tasks.
• Easy administration starts with setup that is a simple
process of asking basic information and providing
helpful information for selecting the storage and node
configurations.
9

Optimizer Robustness
• The data warehouse should be designed for
complex decision support and machine learning
activity in a multi-user, mixed workload, highly
concurrent environment.
• Check on conditional parallelism and what the
causes are of variations in the parallelism
deployed.
• Check on dynamic and controllable
prioritization of resources for queries.
10

Dedicated Compute
• The dedicated compute category represents the heart of
the analytics stack—the data warehouse itself.
• A modern cloud data warehouse must have separate
compute and storage architecture.
• The power to scale compute and storage independently of
one another has transitioned from an industry trend to an
industry standard.
11

Dedicated Storage
• The dedicated storage category represents
storage of the enterprise data.
• In former days, this data was tightly-coupled to
the data warehouse itself, but modern cloud
architecture allows for the data to be stored
separately (and priced separately).
12

Data Integration
• The data integration category represents the
movement of enterprise data from source to
the target data warehouse through
conventional ETL (extract-transform-load) and
ELT (extract-load-transform) methods.
13

Data Access
• Azure Synapse and Google BigQuery have a “serverless” pricing model that allows
users to run queries and only pay for the data they scan and not an hourly rate for
compute.
• Redshift has the Spectrum service to scan data in S3 without loading it into the data
warehouse; however, you pay for the data scanned, plus you need a running Redshift
cluster at an additional charge.
• For Snowflake, you pay for the compute, but not for data scanned. For all these
scenarios (except Snowflake), we assumed 500TB scanned per month for the
Medium-tier enterprise and 2,000TB scanned for Large organizations.
14

Data Lake
• The data lake category represents the use of a
data lake that is separate from the data. This is
common in many modern data-driven
organizations as a way to store and analyze
massive data sets of “colder” data that don’t
necessarily belong in the data warehouse.
15

Sample Breakout (AWS)
16
Dedicated
Compute
43%
Storage
0%
Data Integration
14%
Streaming
4%
Spark Analytics
3%
Data Exploration
6%
Data Lake
20%
BI
5%
Machine
Learning
5%
Identity
Management
0%
Data Catalog
0%

Concurrency Scaling
• If the database has concurrency limitations, designing
around them is difficult at best, and limiting to effective
data usage.
• If the data warehouse automatically scales up to
overcome concurrency limitations, this may be costly if
the data warehouse charges by compute node.
• If the data warehouse charges per user, costs will also
increase as the data warehouse is put to more use in the
company.
• Look for a data warehouse to provide linear scaling in
overall query workload performance as concurrent users
are added.
18

Resource Elasticity
• A data warehouse needs to be able to scale up and down and
take advantage of the elastic compute and storage
capabilities in the cloud, public or private, without disruption
or delay.
• The more the customer needs to be involved in resource
determination and provisioning, the less elastic, and less
modern, the solution is.
• One thing to watch for in elasticity scaling is keeping the
amount of money spent by the customer under the
customer’s control.
19

Machine Learning
• Today, data warehouse query languages need to be extended to include machine
learning, or firms may find the programming required will be too challenging to keep
pace.
• Data warehouses today need to weave machine learning into their data processing
workflows.
• Vendors must accommodate and extend SQL to include machine learning functions
and algorithms to expand the capabilities of those tools and users.
• If your database does not include machine learning, there are many extra things to
be concerned with.
• Other components will be needed to complete the toolbox and get the job done.
• Ideally, security for machine learning will be the same as database security.
• The data warehouse also needs to be able to operate at scale, beyond sampling.
20

Data Storage Format Alternatives
• Cloud object storage is relatively inexpensive making data storage at high
scale affordable.
• On-premises, specialized private cloud storage options such as Pure
Flashblades tend to offer similar data type storage flexibility
• To take full advantage of the elasticity of the cloud without driving up costs,
data warehouse compute and storage need to be scaled separately.
• To take full advantage of the many types of data available, such as Apache
ORC, Apache Parquet, JSON, Apache Avro, etc., modern data warehouses
need to be able to analyze that data without moving or altering it.
• A unified analytics warehouse that supports these various data formats
means you have the benefit of querying them directly, without greatly
expanding the hierarchical complex data types to a standard tabular data
structure for analysis.
• You should also be able to import data directly from these formats
• The ability to join data for analysis between the various internal and external
data formats provides the highest level of analytic flexibility.
21

Hadoop Sequence File and Parquet File
22

Graph Databases
Bridge
vertex
Bridge
vertex
23
• Subject: John R Peterson Predicate: Knows Object: Frank T Smith
• Subject: Triple #1 Predicate: Confidence Percent Object: 70
• Subject: Triple #1 Predicate: Provenance Object: Mary L Jones

USAGE UNDERSTANDING BY THE BUILDERS
DATA
CULTIVATION
Data Warehouse
Data Lake

Balance of Analytics
Analytic Applications
DW
Data Lake
DW
Data Lake
DW
Data Lake
DW

Design Your Test
• What are you benchmarking?
– Query performance
– Load performance
– Query performance with concurrency
– Ease of use
• Competition
• Queries, Schema, Data
• Scale
• Cost
• Query Cut-Off
• Number of runs/cache
• Number of nodes
• Tuning allowed
• Vendor Involvement
• Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server)
• Any not-free third party, SaaS, or on-demand software
• Instance type of nodes
• Measure Price/Performance!
26

Using Data
Platforms that are
Fit-For-Purpose
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
A 2 time Inc. 5000 Company
@williammcknight
www.mcknightcg.com
(214) 514-1444

Using Data Platforms That Are Fit-For-Purpose

More Related Content

What's hot

Similar to Using Data Platforms That Are Fit-For-Purpose

More from DATAVERSITY

Recently uploaded

Using Data Platforms That Are Fit-For-Purpose