Comparing the Enterprise
Analytic Solutions
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444
Proprietary + Confidential
Powering Data
Experiences
to Drive Growth
Joel McKelvey,
Looker Product Management
Google Cloud
Proprietary + Confidential
1
https://www.forrester.com/report/InsightsDriven+Businesses+Set+The+Pace+For+Global+Growth/-/E-RES130848
“Insights-driven businesses
harness and implement digital
insights strategically and at scale
to drive growth and create
differentiating experiences,
products, and services.”
7x
Faster growth than
global GDP
30%
Growth or more using
advanced analytics in a
transformational way
2.3x
More likely to succeed
during disruption
Proprietary + Confidential
*Source: https://emtemp.gcom.cloud/ngw/globalassets/en/information-technology/documents/trends/gartner-2019-cio-agenda-key-takeaways.pdf
Rebalance Your Technology Portfolio Toward
Digital Transformation
Gartner: Digital-fueled growth is the top investment priority for technology leaders.*
Percent of respondents
increasing investment
Percent of respondents
decreasing investment
Cyber/information security 40%
1%
Cloud services or solutions (Saas, Paa5, etc.) 33%
2%
Core system improvements/transformation 31%
10%
How to implement product-centric delivery (by percentage of respondents)
Business Intelligence or data analytics solution 45%
1%
Digital
Transformation
Proprietary + Confidential
Governed metrics | Best-in-class APIs | In-database | Git version-control | Security | Cloud
Integrated Insights
Sales reps enter discussions
equipped with more context
and usage data embedded
within Salesforce.
Data-driven Workflows
Reduce customer churn
with automated email
campaigns if customer
health drops
Custom Applications
Maintain optimal inventory levels
and pricing with merchandising
and supply chain management
application
Modern BI & Analytics
Self-service analytics for
install operations, sales
pipeline management,
and customer operations
SQL In Results Back
Proprietary + Confidential
‘API-first’ extensibility
Technology Layers
Semantic modeling layer
In-database architecture
Built on the cloud
strategy of your choice
Proprietary + Confidential
1 in 2
customers integrate
insights/experiences beyond
Looker
2000+
Customers
5000+
Developers
Empower People with the Smarter Use of Data
© 2020 Looker. All rights reserved. Confidential.
BEACON Digital, Part III
BI Modernization March 23, 9:30am PST
Embedded Analytics March 24, 9:30am PST
looker.com/beacon
Proprietary + Confidential
Thank you
Comparing the Enterprise
Analytic Solutions
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444
William McKnight
President, McKnight Consulting Group
• Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade,
Teva Pharmaceuticals, Verizon, and many other Global
1000 companies
• Frequent keynote speaker and trainer internationally
• Hundreds of articles, blogs and white papers in publication
• Focused on delivering business value and solving business
problems utilizing proven, streamlined approaches to
information management
• Former Database Engineer, Fortune 50 Information
Technology executive and Ernst&Young Entrepreneur of
Year Finalist
• Owner/consultant: Data strategy and implementation
consulting firm
2
McKnight Consulting Group Client Portfolio
Preparing the Organization for the
Future
Priorities
Data Success Measurement
User Satisfaction
Business ROI and
growth instigated
Data Maturity
(Long-term User Sat
and Bus ROI)
Misc.
Data Profile vs. Usage Profile
Best Category and Top Tool Picked
Best Category Picked
Top 2 Category Picked
Same Ol’ Platform
80%
70%
60%
50%
Increasing Probability that Platform
Selection Leads to Success
Analytic Database Data Stores
No More
10
• One Size Fits All
• The DW for everything
Modern Use Cases
Data Lake
Data Warehouse Data Lake
Machine
Learning
Categorical Model
(e.g. Decision Tree)
Categorical Data Quantitative Data
Split
Quantitative Model
(e.g. Regression)
Train Train
Score Score
Evaluate
Historical
Transaction Data
Deploy
Scores
Real Time
Transactions
Actions
Analytics Reference Architecture
Logs
(Apps, Web,
Devices)
User
tracking
Operational
Metrics
Offload
data
Raw Data Topics
JSON, AVRO
Processed
Data Topics
Sensors
Transactional
/ Context
Data
OLTP/ODS
ETL
Or
EL with
T in Spark
Batch
Low
Latency
Applications
Files
Reach
through
or ETL/ELT
or
Import
Stream
Processing
Stream
Processing
Q
Q
Distributed
Analytical
Warehouse
Governed Data Lake
Data Governance
USAGE UNDERSTANDING BY THE BUILDERS
DATA
CULTIVATION
Data Warehouse
Data Lake
Balance of Analytics
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
DW
Cloud Analytic Databases
Beyond Performance Checklist
• Cost Predictability and
Transparency
• Multi-Cluster Costs
• In-Database Machine Learning
• SQL Compatibility
• Provisioning Workloads with
Security Controls
• ML Security same as Database
Security
• Resource Elasticity
• Automated Resource Elasticity
• Granular Resource Elasticity
• Licensing Structure
• Cost Conscious Features
• Data Storage Alternatives
• Unstructured and Semi-Structured
Data Support
• Streaming Data Support
• Connectivity with standard ETL and
Data Visualization software
• Concurrency Scaling
• Seamless Upgrades
• Hot Pluggable Components
• Single Point of Entry for System
Administration
• Easy Administration
• Optimizer Robustness
• Disaster Recovery
• Workload Isolation
16
• Login to AWS Console https://console.aws.amazon.com/
• Create and Launch EC2 Instance
– Choose your Amazon Machine Instance
– Choose your Instance Type
– Add Storage
– Configure Security Group
– PEM Key Pair
– Connect/SSH to Instance
– Access keys
• Set up S3 Storage
– Create bucket
• Set up Redshift
– Identity & Access Management
– Create Role and Attach Policies
– Configure Cluster
– Launch Cluster
• Load Data
• Query Data
Enterprise Analytic Solutions Setup (i.e.,
EC2, S3, Redshift)
• Create Statistics
• Manual & Automatic Snapshots
• Distribution Keys
• Elastic Resize
• Vacuum Tables
• Cluster Parameter Group (for Workload
Management)
• Short Query Acceleration
Other Concepts (Redshift example)
• Azure SQL Data Warehouse is scaled by Data Warehouse Units (DWUs) which are
bundled combinations of CPU, memory, and I/O. According to Microsoft, DWUs
are “abstract, normalized measures of compute resources and performance.”
• Amazon Redshift uses EC2-like instances with tightly-coupled compute and
storage nodes which is a “node” in a more conventional sense
• Snowflake “nodes” are loosely defined as a measure of virtual compute
resources. Their architecture is described as “a hybrid of traditional shared-disk
database architectures and shared-nothing database architectures.” Thus, it is
difficult to infer what a “node” actually is.
• Google BigQuery does not use the concept of a node at all, but instead refers to
“slots” as “a unit of computational capacity required to execute SQL queries"
Different Terminology
Sample Enterprise Analytic Platforms
Sample Enterprise Analytic Solutions
Enterprise Analytic Solutions
• Actian Avalanche
• AWS Redshift
• Azure Synapse
• Cloudera
• Google BigQuery
• IBM Db2 Warehouse on Cloud
and Cloud Pak for Data
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
Enterprise Analytic Solutions
• Microfocus Vertica
• Oracle Autonomous Data
Warehouse
• Snowflake
• Teradata
• Yellowbrick
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
DISCLOSURE: PAST/CURRENT CLIENT
Actian Avalanche
• MPP relational columnar database built to deliver high performance at low TCO both in the cloud and
on-prem for BI and operational analytics use cases.
• Actian Avalanche is based on its underlying technology, known as Vector. The basic architecture of
Actian Avalanche is the Actian patented X100 engine, which utilizes a concept known as "vectorized
query execution" where processing of data is done in chunks of cache-fitting vectors.
• Avalanche performs “single instruction, multiple data” processes by leveraging the same operation on
multiple data simultaneously and exploiting the parallelism capabilities of modern hardware. It
reduces overhead found in conventional "one-row-at-a-time processing" found in other platforms.
Additionally, the compressed column-oriented format uses a scan-optimized buffer manager.
• The measure of Actian Avalanche compute power is known as Avalanche Units (AU). The price is per
AU per hour and includes both compute and cluster storage.
• It’s a pure column store
• Compression is typically 5:1
• Multi-Core Parallelism
• CPU Cache is Used as Execution Memory – Process data in chip cache not RAM
• Storage Indexes are created automatically by quickly identifying candidate data blocks for solving
queries
• Fat and cost-effective
Amazon Redshift
• Amazon Redshift was the first managed data warehouse service and continues to get a high
level of mindshare in this category.
• One of the interesting features of Redshift is result set caching.
• At the enterprise class, Redshift dense compute nodes (dc2.8xlarge) have 2.56TB per node of
solid state drives (SSD) local storage. Their dense storage nodes (ds2.8xlarge) have 16TB per
node, but it is on spinning hard disks (HDD) with slower I/O performance.
• Redshift has some future- proofing (like Spectrum and short query acceleration) that a modern
data engineering approach might utilize. Short query acceleration uses machine learning to
provide higher performance, faster results, and better predictability of query execution times.
• Amazon Redshift is a fit for organizations needing a data warehouse with a clear, consistent
pricing model. Amazon Web Services supports most of the databases in this report, and then
some. Redshift is not the only analytic database on AWS, although sometimes this gets
convoluted.
Azure Synapse
• Azure SQL Data Warehouse made its debut for public use in mid-2016. This is a managed
service, dedicated data warehouse offering from the DATAllegro/PDW/APS legacy. Azure SQL
Data Warehouse Gen 2, optimized for compute, is a massive parallel processing and shared
nothing architecture on cluster nodes each running Azure SQL Database—which shares the
same codebase as Microsoft SQL Server.
• Azure SQL Data Warehouse supports 128 concurrent queries, a nice, high relative number.
• Microsoft also has a deep partnership with Databricks, which is becoming very popular in the
data science community. The partnership uses Azure Active Directory to log into the database.
• Overall, Azure SQL Data Warehouse continues to be an excellent choice for companies needing a
high-performance and scalable analytical database in the cloud or to augment the current, on-
premises offering with a hybrid architecture at a reasonable cost.
Cloudera Data Warehouse Service
• Cloudera Data Warehouse (CDW) boasts flexibility through support for both data center
and multiple public cloud deployments, as well as capabilities across analytical,
operational, data lake, data science, security, and governance needs.
• CDW is part of CDP, a secure and governed cloud service platform that offers a broad set
of enterprise data cloud services with the key data functionality for the modern
enterprise. CDP was designed to address multi-faceted needs by offering multi-function
data management and analytics to solve an enterprise’s most pressing data and analytic
challenges in a streamlined fashion.
• The architecture and deployment of CDP begins with the Management Console, where
several important tasks are performed. First, the preferred cloud environment (for
example, AWS or Azure) is set up. Second, data warehouse clusters and machine learning
(ML) workspaces are launched. Third, additional services, such as Data Catalog, Workload
Experience Manager, and Replication Manager are utilized, if required.
• The Cloudera Data Warehouse service provides self-service independent virtual
warehouses running on top of the data kept in a cloud object store, such as S3.
Google BigQuery
• Google BigQuery has the most distinctive approach to cloud analytic databases, with an
ecosystem of products for data ingestion and manipulation and a unique pricing apparatus.
• The back end is abstracted, BigQuery acts as a RESTful front end to all the Google Cloud storage
needed, with all data replicated geographically and Google managing where queries execute.
(The customer can choose the jurisdictions of their storage according to their safe harbor and
cross-border restrictions.)
• Pricing is by data and query, including Data Definition Language (DDL), or by flat rate pricing by
“slot,” a unit of computational capacity required to execute SQL queries. This price model may
make sense for high-data usage customers. Google also lowers the cost of unused storage.
• Google Marketing Platform data (including the former DoubleClick), Salesforce.com,
AccuWeather, Dow Jones, and 70+ other public data sets out there can be included in the
BigQuery dataset
• Billing is based on the amount of data you query and store. Customers can pre-purchase flat-
rate computation “slots” or units in increments per month per 500 compute units. However,
Google recently introduced Flex Slots, which allow slot reservations as short as one minute and
billed by the hour. There is a separate charge for active storage of data.
Microfocus Vertica in Eon Mode
• Vertica is owned by Micro Focus. They also introduced their Vertica in
Eon Mode deployment as the way to set up a Vertica cluster in the
cloud. Vertica in Eon mode is a fully ANSI SQL compliant relational
database management system that separates compute from storage.
• Vertica is built on Massively Parallel Processing (MPP) and columnar-
based architecture that scales and provides high-speed analytics.
• Vertica offers two deployment modes – Vertica in Enterprise Mode
and Vertica in Eon Mode. Vertica in Eon Mode uses a dedicated
Amazon S3 bucket for storage, with a varying number of compute
nodes spun up as necessary to meet the demands of the workloads.
• Vertica in Eon Mode also allows the database to be turned “off”
without cluster disruption when turned back “on.” Vertica in Eon
Mode also has workload management and its compute nodes can
access ORC and Parquet formats in other S3 clusters.
Snowflake Data Warehouse
• Snowflake Computing was founded in 2012 as the first data warehouse purpose-built for the
cloud. Snowflake has seen tremendous adoption, including international accounts and
deployment on Azure cloud.
• Snowflake’s compute scales in full cluster increments with node counts in powers of two.
Spinning up or down is instant and requires no manual intervention, resulting in leaner
operations. Snowflake scales linearly with the cluster (i.e., for a four node cluster, moving to the
next incremental size will result in a four node expansion).
• Regarding billing, you pay per second only for the compute in use.
• On Amazon AWS, Snowflake is architected to use Amazon S3 as its storage layer and has a native
advantage of being able to access an S3 bucket within the COPY command syntax. On Microsoft
Azure, it uses Azure Blob store.
• The UI is well regarded.
Teradata Vantage
• Teradata is available on Amazon Web Services, Teradata
Cloud, VMware, Microsoft Azure, on-premises, and
IntelliFlex – Teradata’s latest MPP architecture with
separate storage and compute.
• With Vantage, Teradata is still the gold standard in
complex mixed workload query situations for enterprise-
level, worry-free concurrency as well as scaling
requirements and predictably excellent performance
featuring top notch non-functional requirements.
• Dynamic resource prioritization and workload
management.
Understanding Pricing 1/2
• The price-performance metric is dollars per query-hour ($/query-hour).
– This is defined as the normalized cost of running a workload.
– It is calculated by multiplying the rate offered by the cloud platform vendor times the number of computation nodes used
in the cluster and by dividing this amount by the aggregate total of the execution time
• To determine pricing, each platform has different options. Buyers should be
aware of all their pricing options.
• For Azure SQL Data Warehouse, you pay for compute resources as a function
of time.
– The hourly rate for SQL Data Warehouse various slightly by region.
– Also add the separate storage charge to store the data (compressed) at a rate of $ per TB
per hour.
• For Amazon Redshift, you also pay for compute resources (nodes) as a
function of time.
– Redshift also has reserved instance pricing, which can be substantially cheaper than on-
demand pricing, available with 1 or 3-year commitments and is cheapest when paid in full
upfront.
Understanding Pricing 2/2
• For Snowflake, you pay for compute resources as a function of time—just
like SQL Data Warehouse and Redshift.
– However you chose the hourly rate based on certain enterprise features you need
(“Standard”, “Premier”, “Enterprise”/multi-cluster, “Enterprise for Sensitive Data” and
“Virtual Private Snowflake”)
• With Google BigQuery, one option is to pay for bytes processed at $ per TB
– There’s also BigQuery flat rate
• Azure SQL Data Warehouse pricing was found at https://azure.microsoft.com/en-us/pricing/details/sql-data-
warehouse/gen2/.
• Amazon Redshift pricing was found at https://aws.amazon.com/redshift/pricing/.
• Snowflake pricing was found at https://www.snowflake.com/pricing/.
• Google BigQuery pricing was found at https://cloud.google.com/bigquery/pricing.
Design Your Benchmark
• What are you benchmarking?
– Query performance
– Load performance
– Query performance with concurrency
– Ease of use
• Competition
• Queries, Schema, Data
• Scale
• Cost
• Query Cut-Off
• Number of runs/cache
• Number of nodes
• Tuning allowed
• Vendor Involvement
• Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server)
• Any not-free third party, SaaS, or on-demand software
• Instance type of nodes
• Measure Price/Performance!
37
Summary
• Data professionals are sitting on the future of the organization
• Data architecture is an essential organizational skill
• Artificial intelligence will drive the organization for the future
• All need a high-standard data warehouse
• Cloud analytic databases are for most organizational workloads
• Adopt a columnar orientation to data for analytic workloads
• Data lakes are becoming essential
• Use cloud storage or managed Hadoop for the data lake
• Keep an eye on developments in information management and how
they apply to your organization
38
Comparing the Enterprise
Analytic Solutions
Presented by: William McKnight
President, McKnight Consulting Group
williammcknight
www.mcknightcg.com
(214) 514-1444

ADV Slides: Comparing the Enterprise Analytic Solutions

  • 1.
    Comparing the Enterprise AnalyticSolutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444
  • 2.
    Proprietary + Confidential PoweringData Experiences to Drive Growth Joel McKelvey, Looker Product Management Google Cloud
  • 3.
    Proprietary + Confidential 1 https://www.forrester.com/report/InsightsDriven+Businesses+Set+The+Pace+For+Global+Growth/-/E-RES130848 “Insights-drivenbusinesses harness and implement digital insights strategically and at scale to drive growth and create differentiating experiences, products, and services.” 7x Faster growth than global GDP 30% Growth or more using advanced analytics in a transformational way 2.3x More likely to succeed during disruption
  • 4.
    Proprietary + Confidential *Source:https://emtemp.gcom.cloud/ngw/globalassets/en/information-technology/documents/trends/gartner-2019-cio-agenda-key-takeaways.pdf Rebalance Your Technology Portfolio Toward Digital Transformation Gartner: Digital-fueled growth is the top investment priority for technology leaders.* Percent of respondents increasing investment Percent of respondents decreasing investment Cyber/information security 40% 1% Cloud services or solutions (Saas, Paa5, etc.) 33% 2% Core system improvements/transformation 31% 10% How to implement product-centric delivery (by percentage of respondents) Business Intelligence or data analytics solution 45% 1% Digital Transformation
  • 5.
    Proprietary + Confidential Governedmetrics | Best-in-class APIs | In-database | Git version-control | Security | Cloud Integrated Insights Sales reps enter discussions equipped with more context and usage data embedded within Salesforce. Data-driven Workflows Reduce customer churn with automated email campaigns if customer health drops Custom Applications Maintain optimal inventory levels and pricing with merchandising and supply chain management application Modern BI & Analytics Self-service analytics for install operations, sales pipeline management, and customer operations SQL In Results Back
  • 6.
    Proprietary + Confidential ‘API-first’extensibility Technology Layers Semantic modeling layer In-database architecture Built on the cloud strategy of your choice
  • 7.
    Proprietary + Confidential 1in 2 customers integrate insights/experiences beyond Looker 2000+ Customers 5000+ Developers Empower People with the Smarter Use of Data
  • 8.
    © 2020 Looker.All rights reserved. Confidential. BEACON Digital, Part III BI Modernization March 23, 9:30am PST Embedded Analytics March 24, 9:30am PST looker.com/beacon
  • 9.
  • 10.
    Comparing the Enterprise AnalyticSolutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444
  • 11.
    William McKnight President, McKnightConsulting Group • Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva Pharmaceuticals, Verizon, and many other Global 1000 companies • Frequent keynote speaker and trainer internationally • Hundreds of articles, blogs and white papers in publication • Focused on delivering business value and solving business problems utilizing proven, streamlined approaches to information management • Former Database Engineer, Fortune 50 Information Technology executive and Ernst&Young Entrepreneur of Year Finalist • Owner/consultant: Data strategy and implementation consulting firm 2
  • 12.
    McKnight Consulting GroupClient Portfolio
  • 13.
  • 14.
  • 15.
    Data Success Measurement UserSatisfaction Business ROI and growth instigated Data Maturity (Long-term User Sat and Bus ROI) Misc.
  • 16.
    Data Profile vs.Usage Profile
  • 17.
    Best Category andTop Tool Picked Best Category Picked Top 2 Category Picked Same Ol’ Platform 80% 70% 60% 50% Increasing Probability that Platform Selection Leads to Success
  • 18.
  • 19.
    No More 10 • OneSize Fits All • The DW for everything
  • 20.
    Modern Use Cases DataLake Data Warehouse Data Lake Machine Learning Categorical Model (e.g. Decision Tree) Categorical Data Quantitative Data Split Quantitative Model (e.g. Regression) Train Train Score Score Evaluate Historical Transaction Data Deploy Scores Real Time Transactions Actions
  • 21.
    Analytics Reference Architecture Logs (Apps,Web, Devices) User tracking Operational Metrics Offload data Raw Data Topics JSON, AVRO Processed Data Topics Sensors Transactional / Context Data OLTP/ODS ETL Or EL with T in Spark Batch Low Latency Applications Files Reach through or ETL/ELT or Import Stream Processing Stream Processing Q Q Distributed Analytical Warehouse Governed Data Lake Data Governance
  • 22.
    USAGE UNDERSTANDING BYTHE BUILDERS DATA CULTIVATION Data Warehouse Data Lake
  • 23.
    Balance of Analytics AnalyticApplications DW Data Lake Analytic Applications DW Data Lake Analytic Applications DW Data Lake DW
  • 24.
  • 25.
    Beyond Performance Checklist •Cost Predictability and Transparency • Multi-Cluster Costs • In-Database Machine Learning • SQL Compatibility • Provisioning Workloads with Security Controls • ML Security same as Database Security • Resource Elasticity • Automated Resource Elasticity • Granular Resource Elasticity • Licensing Structure • Cost Conscious Features • Data Storage Alternatives • Unstructured and Semi-Structured Data Support • Streaming Data Support • Connectivity with standard ETL and Data Visualization software • Concurrency Scaling • Seamless Upgrades • Hot Pluggable Components • Single Point of Entry for System Administration • Easy Administration • Optimizer Robustness • Disaster Recovery • Workload Isolation 16
  • 26.
    • Login toAWS Console https://console.aws.amazon.com/ • Create and Launch EC2 Instance – Choose your Amazon Machine Instance – Choose your Instance Type – Add Storage – Configure Security Group – PEM Key Pair – Connect/SSH to Instance – Access keys • Set up S3 Storage – Create bucket • Set up Redshift – Identity & Access Management – Create Role and Attach Policies – Configure Cluster – Launch Cluster • Load Data • Query Data Enterprise Analytic Solutions Setup (i.e., EC2, S3, Redshift)
  • 27.
    • Create Statistics •Manual & Automatic Snapshots • Distribution Keys • Elastic Resize • Vacuum Tables • Cluster Parameter Group (for Workload Management) • Short Query Acceleration Other Concepts (Redshift example)
  • 28.
    • Azure SQLData Warehouse is scaled by Data Warehouse Units (DWUs) which are bundled combinations of CPU, memory, and I/O. According to Microsoft, DWUs are “abstract, normalized measures of compute resources and performance.” • Amazon Redshift uses EC2-like instances with tightly-coupled compute and storage nodes which is a “node” in a more conventional sense • Snowflake “nodes” are loosely defined as a measure of virtual compute resources. Their architecture is described as “a hybrid of traditional shared-disk database architectures and shared-nothing database architectures.” Thus, it is difficult to infer what a “node” actually is. • Google BigQuery does not use the concept of a node at all, but instead refers to “slots” as “a unit of computational capacity required to execute SQL queries" Different Terminology
  • 29.
  • 30.
  • 31.
    Enterprise Analytic Solutions •Actian Avalanche • AWS Redshift • Azure Synapse • Cloudera • Google BigQuery • IBM Db2 Warehouse on Cloud and Cloud Pak for Data DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT
  • 32.
    Enterprise Analytic Solutions •Microfocus Vertica • Oracle Autonomous Data Warehouse • Snowflake • Teradata • Yellowbrick DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT DISCLOSURE: PAST/CURRENT CLIENT
  • 33.
    Actian Avalanche • MPPrelational columnar database built to deliver high performance at low TCO both in the cloud and on-prem for BI and operational analytics use cases. • Actian Avalanche is based on its underlying technology, known as Vector. The basic architecture of Actian Avalanche is the Actian patented X100 engine, which utilizes a concept known as "vectorized query execution" where processing of data is done in chunks of cache-fitting vectors. • Avalanche performs “single instruction, multiple data” processes by leveraging the same operation on multiple data simultaneously and exploiting the parallelism capabilities of modern hardware. It reduces overhead found in conventional "one-row-at-a-time processing" found in other platforms. Additionally, the compressed column-oriented format uses a scan-optimized buffer manager. • The measure of Actian Avalanche compute power is known as Avalanche Units (AU). The price is per AU per hour and includes both compute and cluster storage. • It’s a pure column store • Compression is typically 5:1 • Multi-Core Parallelism • CPU Cache is Used as Execution Memory – Process data in chip cache not RAM • Storage Indexes are created automatically by quickly identifying candidate data blocks for solving queries • Fat and cost-effective
  • 34.
    Amazon Redshift • AmazonRedshift was the first managed data warehouse service and continues to get a high level of mindshare in this category. • One of the interesting features of Redshift is result set caching. • At the enterprise class, Redshift dense compute nodes (dc2.8xlarge) have 2.56TB per node of solid state drives (SSD) local storage. Their dense storage nodes (ds2.8xlarge) have 16TB per node, but it is on spinning hard disks (HDD) with slower I/O performance. • Redshift has some future- proofing (like Spectrum and short query acceleration) that a modern data engineering approach might utilize. Short query acceleration uses machine learning to provide higher performance, faster results, and better predictability of query execution times. • Amazon Redshift is a fit for organizations needing a data warehouse with a clear, consistent pricing model. Amazon Web Services supports most of the databases in this report, and then some. Redshift is not the only analytic database on AWS, although sometimes this gets convoluted.
  • 35.
    Azure Synapse • AzureSQL Data Warehouse made its debut for public use in mid-2016. This is a managed service, dedicated data warehouse offering from the DATAllegro/PDW/APS legacy. Azure SQL Data Warehouse Gen 2, optimized for compute, is a massive parallel processing and shared nothing architecture on cluster nodes each running Azure SQL Database—which shares the same codebase as Microsoft SQL Server. • Azure SQL Data Warehouse supports 128 concurrent queries, a nice, high relative number. • Microsoft also has a deep partnership with Databricks, which is becoming very popular in the data science community. The partnership uses Azure Active Directory to log into the database. • Overall, Azure SQL Data Warehouse continues to be an excellent choice for companies needing a high-performance and scalable analytical database in the cloud or to augment the current, on- premises offering with a hybrid architecture at a reasonable cost.
  • 36.
    Cloudera Data WarehouseService • Cloudera Data Warehouse (CDW) boasts flexibility through support for both data center and multiple public cloud deployments, as well as capabilities across analytical, operational, data lake, data science, security, and governance needs. • CDW is part of CDP, a secure and governed cloud service platform that offers a broad set of enterprise data cloud services with the key data functionality for the modern enterprise. CDP was designed to address multi-faceted needs by offering multi-function data management and analytics to solve an enterprise’s most pressing data and analytic challenges in a streamlined fashion. • The architecture and deployment of CDP begins with the Management Console, where several important tasks are performed. First, the preferred cloud environment (for example, AWS or Azure) is set up. Second, data warehouse clusters and machine learning (ML) workspaces are launched. Third, additional services, such as Data Catalog, Workload Experience Manager, and Replication Manager are utilized, if required. • The Cloudera Data Warehouse service provides self-service independent virtual warehouses running on top of the data kept in a cloud object store, such as S3.
  • 37.
    Google BigQuery • GoogleBigQuery has the most distinctive approach to cloud analytic databases, with an ecosystem of products for data ingestion and manipulation and a unique pricing apparatus. • The back end is abstracted, BigQuery acts as a RESTful front end to all the Google Cloud storage needed, with all data replicated geographically and Google managing where queries execute. (The customer can choose the jurisdictions of their storage according to their safe harbor and cross-border restrictions.) • Pricing is by data and query, including Data Definition Language (DDL), or by flat rate pricing by “slot,” a unit of computational capacity required to execute SQL queries. This price model may make sense for high-data usage customers. Google also lowers the cost of unused storage. • Google Marketing Platform data (including the former DoubleClick), Salesforce.com, AccuWeather, Dow Jones, and 70+ other public data sets out there can be included in the BigQuery dataset • Billing is based on the amount of data you query and store. Customers can pre-purchase flat- rate computation “slots” or units in increments per month per 500 compute units. However, Google recently introduced Flex Slots, which allow slot reservations as short as one minute and billed by the hour. There is a separate charge for active storage of data.
  • 38.
    Microfocus Vertica inEon Mode • Vertica is owned by Micro Focus. They also introduced their Vertica in Eon Mode deployment as the way to set up a Vertica cluster in the cloud. Vertica in Eon mode is a fully ANSI SQL compliant relational database management system that separates compute from storage. • Vertica is built on Massively Parallel Processing (MPP) and columnar- based architecture that scales and provides high-speed analytics. • Vertica offers two deployment modes – Vertica in Enterprise Mode and Vertica in Eon Mode. Vertica in Eon Mode uses a dedicated Amazon S3 bucket for storage, with a varying number of compute nodes spun up as necessary to meet the demands of the workloads. • Vertica in Eon Mode also allows the database to be turned “off” without cluster disruption when turned back “on.” Vertica in Eon Mode also has workload management and its compute nodes can access ORC and Parquet formats in other S3 clusters.
  • 39.
    Snowflake Data Warehouse •Snowflake Computing was founded in 2012 as the first data warehouse purpose-built for the cloud. Snowflake has seen tremendous adoption, including international accounts and deployment on Azure cloud. • Snowflake’s compute scales in full cluster increments with node counts in powers of two. Spinning up or down is instant and requires no manual intervention, resulting in leaner operations. Snowflake scales linearly with the cluster (i.e., for a four node cluster, moving to the next incremental size will result in a four node expansion). • Regarding billing, you pay per second only for the compute in use. • On Amazon AWS, Snowflake is architected to use Amazon S3 as its storage layer and has a native advantage of being able to access an S3 bucket within the COPY command syntax. On Microsoft Azure, it uses Azure Blob store. • The UI is well regarded.
  • 40.
    Teradata Vantage • Teradatais available on Amazon Web Services, Teradata Cloud, VMware, Microsoft Azure, on-premises, and IntelliFlex – Teradata’s latest MPP architecture with separate storage and compute. • With Vantage, Teradata is still the gold standard in complex mixed workload query situations for enterprise- level, worry-free concurrency as well as scaling requirements and predictably excellent performance featuring top notch non-functional requirements. • Dynamic resource prioritization and workload management.
  • 41.
    Understanding Pricing 1/2 •The price-performance metric is dollars per query-hour ($/query-hour). – This is defined as the normalized cost of running a workload. – It is calculated by multiplying the rate offered by the cloud platform vendor times the number of computation nodes used in the cluster and by dividing this amount by the aggregate total of the execution time • To determine pricing, each platform has different options. Buyers should be aware of all their pricing options. • For Azure SQL Data Warehouse, you pay for compute resources as a function of time. – The hourly rate for SQL Data Warehouse various slightly by region. – Also add the separate storage charge to store the data (compressed) at a rate of $ per TB per hour. • For Amazon Redshift, you also pay for compute resources (nodes) as a function of time. – Redshift also has reserved instance pricing, which can be substantially cheaper than on- demand pricing, available with 1 or 3-year commitments and is cheapest when paid in full upfront.
  • 42.
    Understanding Pricing 2/2 •For Snowflake, you pay for compute resources as a function of time—just like SQL Data Warehouse and Redshift. – However you chose the hourly rate based on certain enterprise features you need (“Standard”, “Premier”, “Enterprise”/multi-cluster, “Enterprise for Sensitive Data” and “Virtual Private Snowflake”) • With Google BigQuery, one option is to pay for bytes processed at $ per TB – There’s also BigQuery flat rate • Azure SQL Data Warehouse pricing was found at https://azure.microsoft.com/en-us/pricing/details/sql-data- warehouse/gen2/. • Amazon Redshift pricing was found at https://aws.amazon.com/redshift/pricing/. • Snowflake pricing was found at https://www.snowflake.com/pricing/. • Google BigQuery pricing was found at https://cloud.google.com/bigquery/pricing.
  • 43.
    Design Your Benchmark •What are you benchmarking? – Query performance – Load performance – Query performance with concurrency – Ease of use • Competition • Queries, Schema, Data • Scale • Cost • Query Cut-Off • Number of runs/cache • Number of nodes • Tuning allowed • Vendor Involvement • Any free third party, SaaS, or on-demand software (e.g., Apigee or SQL Server) • Any not-free third party, SaaS, or on-demand software • Instance type of nodes • Measure Price/Performance! 37
  • 44.
    Summary • Data professionalsare sitting on the future of the organization • Data architecture is an essential organizational skill • Artificial intelligence will drive the organization for the future • All need a high-standard data warehouse • Cloud analytic databases are for most organizational workloads • Adopt a columnar orientation to data for analytic workloads • Data lakes are becoming essential • Use cloud storage or managed Hadoop for the data lake • Keep an eye on developments in information management and how they apply to your organization 38
  • 45.
    Comparing the Enterprise AnalyticSolutions Presented by: William McKnight President, McKnight Consulting Group williammcknight www.mcknightcg.com (214) 514-1444