Comparison of MPP
Data Warehouse Platforms
- David Portnoy-
- 312.970.9740-
http://LinkedIn.com/in/DavidPortnoy
© 2013-2014
What’s MPP in data warehousing?
MPP (massively parallel processing) data warehouse systems
are different from SMP (symmetric multiprocessing)
databases:
1. Shared-nothing architectures, with no single point of failure
and often hot-swappable components
2. Scale horizontally by adding nodes, rather than moving to a
server with more CPUs or higher storage capacity
3. Breaks a large queries across nodes for simultaneous
processing
4. Capable of higher data ingestion rates through parallelized
data movement
Who are the players?
Previously, we discussed just the specialized MPP data warehouse vendors:
 Teradata
 Netezza
 Vertica
 Greenplum
…But We should keep in mind that most major database vendors also have
their own MPP products for data warehousing. Examples include:
 Microsoft PDW (Parallel Data Warehouse)
 DB2 UDB with Database Partitioning Feature (DPF)
 Oracle Big Data Appliance, which just provides a gateway between Hadoop to
their SMP RDBMS platform
Finally, we need to consider the emergence of SQL-oriented, low-latency
Hadoop solutions. Examples include:
 Impala; Stinger; Apache Drill; Phoenix; Shark; Hadapt
 Teradata’s SQL-H (Aster Data); EMC’s HAWQ; IBM’s BigSQL
See related writeup: http://www.slideshare.net/DavidPortnoy/hybrid-data-
warehouse-hadoop-implementations
How to the architectures compare?
Looking at the specialized MPP data warehouse vendors
Teradata Netezza Greenplum Vertica
Hardware Custom MPP, Shared
Nothing
Custom MPP: SPU +
FPGA logic
Commodity hardware Custom Hybrid MPP,
Shared Everything
Type of
processing
OLTP or OLAP,
Can handle high user
load
OLAP,
Assumes few users for
heavy analytics
OLAP OLAP optimized for
large fact tables
Inception /
Maturity
1979
From Caltech
2000
By Saxena & Hinshaw
2003
From Metapa & Didera
2005
By MIT’s Stonebaker
Performance &
maintenance
Auto-recommended
optimization,
columnar compression
available
No need for
performance tuning,
Must manually reclaim
space
Based on
PostgreSQL, but
optimized for MPP and
enterprise maint.
Column oriented
optimization for
ingestion,
storage/compression,
and access
Hardware Proprietary Proprietary Commodity Commodity
Definitions
* OLAP: Online Analytical Processing
* OLTP: Online Transaction Processing
The industry is moving towards open, commodity solutions
Traditional database servers, such as IBM DB2, Oracle Exadata and
Microsoft SQL Server, license proprietary software, but run on
commodity hardware. Although the nature of SMP architecture typically
favors having a few large expensive servers.
But the biggest MPP data warehouse vendors all have proprietary
software. That’s despite the fact that Netezza and Vertica were on the
open source PostgreSQL database. Teradata and Netezza even
implement custom hardware, which drives up the price.
Hadoop has open sourced the software component leading to a vibrant
ecosystem of tools and applications. And with built in redundancy, it’s
easy to deploy on cheap commodity servers.
Specialized
Hardware
Commodity
Hardware
Open Source,
Standardized Software
Proprietary Software
So the trend looks something like this
Hadoop
** While up-front cost of Hadoop may be lower, the TCO (total cost of ownership)
could be relatively much higher. This is due to the maturity of product, complexity of
solutions and scarcity of talent.
Traditional
Database
MPP Data
Warehouse
Teradata
Hardware and licenses the most
expensive of all options. Staff costs can
be expensive and it takes a great deal of
effort to configure and administer.
IBM
Netezza
Hardware and licenses used to be much
less than Teradata, but prices have been
converging. Some of the highest staff
cost due to scarcity, but that’s tempered
by lower effort for configuration and
admin of single purpose appliance.
Greenplum
Commodity hardware. Moderately priced
licenses. Few Greenplum specialists, but
can be staffed by PostgreSQL DBAs and
developers.
Vertica Commodity hardware. Moderately priced
licenses, but special purpose orientation
limits usefulness. Few specialists, but
can be staffed by traditional DBAs and
developerss.
Hadoop
HBase
Commodity hardware and no license
cost, resulting in lowest up-front cost.
Likely to buy more hardware for
redundancy and load. But requires
highly technical staff and implementation
is less productive than more mature
options.
So lets look at the relative cost breakdown
Hardware & Licenses Development
Hardware Licenses Development
Hardware & Licenses Development
Hardware Development
Hardware Licenses Development
What’s their relative adoption today?
Comparing the supply and demand for administrators and developers can
be a proxy for the strength and staying power of a platform.
Teradata has been around for many years longer than the alternatives and
still dominates the market in terms of install base (3 times next rival) and
vibrant development community (6 times next rival).
But in recent years Hadoop solutions have outstripped Teradata by a
significant margin. (Of course, it should be noted that Hadoop includes use
cases outside of
traditional data
warehousing.)
Over time, interest in market leader Teradata has been consistent, but flat
While Netezza, Vertica, and Greenplum have grown, they didn’t take significant
market share away from Teradata.
(The spike in Netezza interest is attributed to its acquisition by IBM.)
But when Hadoop is added into the mix, the picture changes drastically
Interest in Hadoop has quickly overtaken even traditional Teradata
Which might explain why Teradata has been on an acquisition spree for
Hadoop related products and services, such as Aster Data
The future of its next biggest rival, Netezza, is uncertain as it seeks its
niche within IBM’s product lineup.
Related Reading
Hybrid Data Warehouse-Hadoop Implementations:
http://www.slideshare.net/DavidPortnoy/hybrid-data-warehouse-
hadoop-implementations
Agile Business Intelligence:
http://www.slideshare.net/DavidPortnoy/agile-bi-18491924
Blog:
http://david.portnoy.us

Comparison of MPP Data Warehouse Platforms

  • 1.
    Comparison of MPP DataWarehouse Platforms - David Portnoy- - 312.970.9740- http://LinkedIn.com/in/DavidPortnoy © 2013-2014
  • 2.
    What’s MPP indata warehousing? MPP (massively parallel processing) data warehouse systems are different from SMP (symmetric multiprocessing) databases: 1. Shared-nothing architectures, with no single point of failure and often hot-swappable components 2. Scale horizontally by adding nodes, rather than moving to a server with more CPUs or higher storage capacity 3. Breaks a large queries across nodes for simultaneous processing 4. Capable of higher data ingestion rates through parallelized data movement
  • 3.
    Who are theplayers? Previously, we discussed just the specialized MPP data warehouse vendors:  Teradata  Netezza  Vertica  Greenplum …But We should keep in mind that most major database vendors also have their own MPP products for data warehousing. Examples include:  Microsoft PDW (Parallel Data Warehouse)  DB2 UDB with Database Partitioning Feature (DPF)  Oracle Big Data Appliance, which just provides a gateway between Hadoop to their SMP RDBMS platform Finally, we need to consider the emergence of SQL-oriented, low-latency Hadoop solutions. Examples include:  Impala; Stinger; Apache Drill; Phoenix; Shark; Hadapt  Teradata’s SQL-H (Aster Data); EMC’s HAWQ; IBM’s BigSQL See related writeup: http://www.slideshare.net/DavidPortnoy/hybrid-data- warehouse-hadoop-implementations
  • 4.
    How to thearchitectures compare? Looking at the specialized MPP data warehouse vendors Teradata Netezza Greenplum Vertica Hardware Custom MPP, Shared Nothing Custom MPP: SPU + FPGA logic Commodity hardware Custom Hybrid MPP, Shared Everything Type of processing OLTP or OLAP, Can handle high user load OLAP, Assumes few users for heavy analytics OLAP OLAP optimized for large fact tables Inception / Maturity 1979 From Caltech 2000 By Saxena & Hinshaw 2003 From Metapa & Didera 2005 By MIT’s Stonebaker Performance & maintenance Auto-recommended optimization, columnar compression available No need for performance tuning, Must manually reclaim space Based on PostgreSQL, but optimized for MPP and enterprise maint. Column oriented optimization for ingestion, storage/compression, and access Hardware Proprietary Proprietary Commodity Commodity Definitions * OLAP: Online Analytical Processing * OLTP: Online Transaction Processing
  • 5.
    The industry ismoving towards open, commodity solutions Traditional database servers, such as IBM DB2, Oracle Exadata and Microsoft SQL Server, license proprietary software, but run on commodity hardware. Although the nature of SMP architecture typically favors having a few large expensive servers. But the biggest MPP data warehouse vendors all have proprietary software. That’s despite the fact that Netezza and Vertica were on the open source PostgreSQL database. Teradata and Netezza even implement custom hardware, which drives up the price. Hadoop has open sourced the software component leading to a vibrant ecosystem of tools and applications. And with built in redundancy, it’s easy to deploy on cheap commodity servers.
  • 6.
    Specialized Hardware Commodity Hardware Open Source, Standardized Software ProprietarySoftware So the trend looks something like this Hadoop ** While up-front cost of Hadoop may be lower, the TCO (total cost of ownership) could be relatively much higher. This is due to the maturity of product, complexity of solutions and scarcity of talent. Traditional Database MPP Data Warehouse
  • 7.
    Teradata Hardware and licensesthe most expensive of all options. Staff costs can be expensive and it takes a great deal of effort to configure and administer. IBM Netezza Hardware and licenses used to be much less than Teradata, but prices have been converging. Some of the highest staff cost due to scarcity, but that’s tempered by lower effort for configuration and admin of single purpose appliance. Greenplum Commodity hardware. Moderately priced licenses. Few Greenplum specialists, but can be staffed by PostgreSQL DBAs and developers. Vertica Commodity hardware. Moderately priced licenses, but special purpose orientation limits usefulness. Few specialists, but can be staffed by traditional DBAs and developerss. Hadoop HBase Commodity hardware and no license cost, resulting in lowest up-front cost. Likely to buy more hardware for redundancy and load. But requires highly technical staff and implementation is less productive than more mature options. So lets look at the relative cost breakdown Hardware & Licenses Development Hardware Licenses Development Hardware & Licenses Development Hardware Development Hardware Licenses Development
  • 8.
    What’s their relativeadoption today? Comparing the supply and demand for administrators and developers can be a proxy for the strength and staying power of a platform. Teradata has been around for many years longer than the alternatives and still dominates the market in terms of install base (3 times next rival) and vibrant development community (6 times next rival). But in recent years Hadoop solutions have outstripped Teradata by a significant margin. (Of course, it should be noted that Hadoop includes use cases outside of traditional data warehousing.)
  • 9.
    Over time, interestin market leader Teradata has been consistent, but flat While Netezza, Vertica, and Greenplum have grown, they didn’t take significant market share away from Teradata. (The spike in Netezza interest is attributed to its acquisition by IBM.)
  • 10.
    But when Hadoopis added into the mix, the picture changes drastically Interest in Hadoop has quickly overtaken even traditional Teradata Which might explain why Teradata has been on an acquisition spree for Hadoop related products and services, such as Aster Data The future of its next biggest rival, Netezza, is uncertain as it seeks its niche within IBM’s product lineup.
  • 11.
    Related Reading Hybrid DataWarehouse-Hadoop Implementations: http://www.slideshare.net/DavidPortnoy/hybrid-data-warehouse- hadoop-implementations Agile Business Intelligence: http://www.slideshare.net/DavidPortnoy/agile-bi-18491924 Blog: http://david.portnoy.us