Criteria Greenplum Amazon Redshift Vectorwise
Description Analytic Database platform built on PostgreSQL. Full name is Pivotal
Greenplum Database
From <https://db-engines.com/en/system/Greenplum%3BIngres>
URL :-
https://pivotal.io/pivotal-greenplum
http://greenplum.org/
http://gpdb.docs.pivotal.io/43160/common/welcome.html
Large scale data warehouse service for use
with business intelligence tools
From <https://db-
engines.com/en/system/Amazon+Redshift%
3BGreenplum>
Actian Vector is a relational database engine
designed for high performance analytics. Actian
Vector was designed from the ground up to exploit
performance features in today’s x86 CPUs such as
vectorization and larger chip caches enabling in-chip
analytics. Actian Vector’s record breaking speed
delivers results faster than any of its competitors.
From <https://www.actian.com/analytic-database/vector-smp-analytic-database/>
URL :
https://www.actian.com/analytic-database/vector-smp-analytic-
database/
Vendor Pivotal Software Inc. It is a division of Dell EMC
From <https://db-engines.com/en/system/Greenplum%3BIngres>
Amazon Actian Corporation
From <https://db-engines.com/en/system/Greenplum%3BIngres>
DB Engine
Ranking
Score: 11.41
Rank#35
Overall#22
Relational DBMS
From <https://db-engines.com/en/system/Greenplum%3BIngres>
PostgreSQL at #4
This ranking does not make sense because this is apple versus oranges
comparison. We should not be comparing relational databases with
columnar databases in OLTP versus OLAP scenarios.
Score : 13.04
Rank : 32
Overall : 20
Score: 0.66
Rank#151
Overall#76
Relational DBMS
From <https://db-engines.com/en/system/Actian+Vector>
Ingress is ranked 52 versus PostgreSQL at #4
Informix is also ranked quite low at #25
https://db-engines.com/en/ranking
IT Central
Station
Ranking
Ranked at No. 6, top 5 are Oracle Exadata, Teradata, HPE
Vertica, Netezza and SAP IQ which are all commercial
solution
Actian offers Parcels which is rated 18th. We are not
using Parcels and can't comment on that. Actian
Vectorwise is not listed here
Release
History
Initial Release : 2005
Current Release : 4.3.11.1, January 2017
Beta Release : 5.x beta is out, it is expected they will improve PostgreSQL
version support.
https://gpdb.docs.pivotal.io/500Beta/relnotes/GPDB_500
_README.html#topic_yxx_bq2_lx
Vector 5.0, July 2016
From <https://db-engines.com/en/system/Actian+Vector>
Licensing Both Open Source and Licensed Version Available Cloud based, SaaS solution, based on usage Licensed Version Only
Technical
Support
including bug
fixing
support
Available with Licensed Version
Technical Documentation
gpdb.docs.pivotal.io
From <https://db-engines.com/en/system/Greenplum%3BIngres>
Available
Community Open Source, Greenplum community, PostgreSQL community, lots of
information on youtube
Yes Lacks community, closed source, have to rely on technical support
Licensing
Cost
Free in case of open-source(Aplache License 2.0)
In case of technical support subscription, un-official estimate is :-
100 CPU cores : 1000 $ CPU core, per year
Commercial Commercial
Architecture Columnar, Shared Nothing with MPP Support
Supported using EMC appliance as well as off the shelf suitable hardware
Columnar Columnar, MPP is supported. Although it depends on our license, not
sure whether we have
the license for this.
Hardware
and Setup
Could be implemented using commodity hardware, DCA appliance is also
available. Only enterprise MPP that can be run on commodity hardware.
Cloud deployment on Amazon and Microsoft is available
Cloud Setup only Linux based deployment
GitHub https://github.com/greenplum-db/gpdb Not available
Storage POLYMORPHIC DATA STORAGE AND EXECUTION
The table or partition storage, execution, and compression settings can be
configured to suit the way data is accessed. Users have the choice of row or
column-oriented storage and processing for any table or partition.
From <http://greenplum.org/>
Hadoop is supported but requires separately licensed product
Replication
Methods
Master-Slave Yes Yes,
Partitioning
Methods
Sharding Sharding
Compression Upto 30%
ACID
Compliance
Yes Yes Yes
Greenplum Versus Redshift and Actian Vectorwise
Comparison
Wednesday, August 23, 2017 10:39 AM
GreenPlum Page 1
Backup and
Recovery
Supports parallel and non-parallel backup and restore
https://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html
Using Greenplum Appliance
They have a hardware called data domain system for backup and recovery,
similar solutions should be available from other vendors as well.
https://www.emc.com/collateral/hardware/white-papers/h8038-backup-
recovery-greenplum-data-domain-wp.pdf
Using Commvault
http://documentation.commvault.com/commvault/v11/article?
p=products/greenplum/t_greenplum_backup.htm
http://documentation.commvault.com/fujitsu/v11/article?
p=products/greenplum/t_greenplum_restore_from_backup_job.htm
Veritas
https://gpdb.docs.pivotal.io/500Beta/admin_guide/managing/backup-
veritas.html
Cloud based Replication approach is used, which is costly
Scalability Greenplum’s NewSQL MPP share-nothing RDBMS database designed for
multi-petabyte environments where a share-everything DBMS, like Oracle,
would die due to IO limitations to the SAN.
From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the-Greenplum-
database>
Cloud based options MPP
Big Data Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf,
Delimited Text and Sequence Files.
•
Solr/Lucene integration for multi-lingual full-text search embedded
in the SQL.
•
Row and/or Column-oriented data storage. It is the only database
where a table can be polymorphic with both columnar and row-
based partitions as defined by the DBA.
•
Advanced Map-Reduce CBO Query Optimizer – queries can be run
on over 1,000+ nodes.
•
It has a dynamic distributed pipeline execution model for query
processing. While older map-reduce databases rely on materialized
execution Greenplum doesn't have to write data to disk with every
intermediate query step. It streams data to the next stage of a query
plan in memory, and never has to materialize the data to disk, so it's
much faster than what anybody has demonstrated on Hadoop.
•
Deep analytics – including data mining or machine learning algorithms
using MADlib (think of it as R for MPP). Deep Semantic Textual Analytics
using GPText.
•
Graphical Analysis - billion edge distributed in-memory graph
database and algorithms using GraphLab.
•
Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in
a singlequery. Wow!
•
From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the-
Greenplum-database>
No Support, Map-reduce is not supported Not sure about data science support and integration in Actian.
Data Loading Distributed ETL rate of 16 TB/hr without using master node!!•
Integration with Talend available.•
Data loading component for PDI is also available•
GPLoad is data loading component•
• Compatible with open source and commercial ETL tools
AWS tools are available, compatiable with
various open source and commercial ETL
tools
VWload is available for bulk loading
Integration
with Spagobi
Spagobi supports PostgreSQL, but testing needs to be done with
Greenplum. Spagobi is not officially certified with Greenplum. POC
will be done focusing on this area.
Should be possible through JDBC, not tested Integration with Spagobi 5.2 has been tested. Basic features such as
dashboards, reports, adhoc analysis are working.
Currently there are issues with creating QBE(business models)
Known
Issues
Greenplum may have problems with high concurrency & volatility, but it
would be silly to have high concurrency in the PB range. In the < 100 TB
range it becomes a question of whether you need high concurrency (Oracle)
or Data Science like analytics (Greenplum).
From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the-Greenplum-
database>
Note : This is an opinion, and may not be valid as per my knowledge.
VWWare Case Study(108 TB with 6000 users, 300 of them concurrent)
Redshift is still very limited in terms of
SQL functionalities that it offers. You
can't have procedures, functions,
triggers, CTE etc. in Redshift. So you
have to follow an ETL approach for
your data warehouses in most of the
cases, even though ELT might suit you
better.
Another major limitation with Redshift is
the number of concurrent queries it can
run: 15. Yes, that's right. Redshift can
only run 15 concurrent queries as of
now.
For a big datawarehouse with ETL
processes, admins, users, dashboards on
top of that this number is ridiculously low. I
hope they do something about this sooner
than later.
From <https://www.quora.com/What-features-does-
Amazon-Redshift-fail-to-offer-compared-to-higher-
priced-alternatives-like-Teradata-How-likely-are-
customers-to-switch>
1). Product was discontinued and started again
2). Had issues with large, complex queries crashing in version 3.x, issues
were resolved in version 4.x
Customers BC Hydro, China Railway, TCS Bank Russia, Well Care, VWWare Case
GreenPlum Page 2
Customers BC Hydro, China Railway, TCS Bank Russia, Well Care, VWWare Case
Study(108 TB with 6000 users, 300 of them concurrent)
Performance Petabyte scale data warehouse solution Some customers have claimed that it has better performance : -
https://pavanskumar.wordpress.com/2015/04/23/actian-vector-
migration-from-greenplum/
Gartner Rated as Visionary(2015-16) and Niche Player(2017). Gartner praises
Greenplum for built-in data science support. This would be a major
advantage in development of advanced payment analytics features.
Rated as leader(I wonder why ?) Rated as Visionary(2015-16), Actian did not make it to Gartner Magic
Quadrant in 2017
In-memory
Grid
Gemfire, a key application is fraud detection which requires real-time
transaction data as seen here https://content.pivotal.io/blog/big-data-
meets-fast-data-to-fight-fraud-and-more
Hadoop HAWQ provides the most robust SQL interface for Hadoop and can tackle
data exploration and transformation in HDFS.
HAWQ is a parallel SQL query engine that combines the key
technological advantages of the industry-leading Pivotal Analytic
Database with the scalability and convenience of Hadoop. HAWQ
reads data from and writes data to HDFS natively. HAWQ delivers
industry-leading performance and linear scalability. It provides users
the tools to confidently and successfully interact with petabyte range
data sets. HAWQ provides users with a complete, standards
compliant SQL interface.
By using the proven parallel database technology of the Pivotal
Analytic Database, HAWQ has been shown to be consistently tens to
hundreds of times faster than all Hadoop query engines in the market
today.
Pivotal HAWQ is a Massively Parallel Processing (MPP) database using
several Postgres database instances and HDFS storage. Think of your regular
MPP databases like Teradata/Greenplum/Netezza but instead of using local
storage it uses HDFS to store datafiles. Each of the processing nodes still has
its own CPU/memory and storage.
References
From <https://dwarehouse.wordpress.com/2014/03/14/pivotal-hawq-mpp-database-on-
hdfs/>
From <https://www.quora.com/What-is-HAWQ>
From <https://www.pivotalguru.com/?p=642>
They do have hadoop support, but it is not clear whether hadoop
connnectivity and support is part of existing product license or not. In
the same way, it seems hadoop support will have extra cost.
VectorH seems to be a separate product :-
https://www.actian.com/analytic-database/vectorh-sql-hadoop/
Management
Tools
Pivotal Command Center
And Workload Manager
Cloud based
Figure : http://www.cmswire.com/cms/analytics/a-look-at-gartners-data-management-analytics-leaders-028772.php
GreenPlum Page 3
https://blogs.technet.microsoft.com/dataplatforminsider/2017/03/07/gartner-names-microsoft-a-leader-in-the-magic-quadrant-for-data-management-solutions-for-
analytics-dmsa/
GreenPlum Page 4
Reference : https://dwarehouse.wordpress.com/2014/03/14/pivotal-hawq-mpp-database-on-hdfs/
Key Components of Greenplum
Comparison with Other MPP's
GreenPlum Page 5

Greenplum versus redshift and actian vectorwise comparison

  • 1.
    Criteria Greenplum AmazonRedshift Vectorwise Description Analytic Database platform built on PostgreSQL. Full name is Pivotal Greenplum Database From <https://db-engines.com/en/system/Greenplum%3BIngres> URL :- https://pivotal.io/pivotal-greenplum http://greenplum.org/ http://gpdb.docs.pivotal.io/43160/common/welcome.html Large scale data warehouse service for use with business intelligence tools From <https://db- engines.com/en/system/Amazon+Redshift% 3BGreenplum> Actian Vector is a relational database engine designed for high performance analytics. Actian Vector was designed from the ground up to exploit performance features in today’s x86 CPUs such as vectorization and larger chip caches enabling in-chip analytics. Actian Vector’s record breaking speed delivers results faster than any of its competitors. From <https://www.actian.com/analytic-database/vector-smp-analytic-database/> URL : https://www.actian.com/analytic-database/vector-smp-analytic- database/ Vendor Pivotal Software Inc. It is a division of Dell EMC From <https://db-engines.com/en/system/Greenplum%3BIngres> Amazon Actian Corporation From <https://db-engines.com/en/system/Greenplum%3BIngres> DB Engine Ranking Score: 11.41 Rank#35 Overall#22 Relational DBMS From <https://db-engines.com/en/system/Greenplum%3BIngres> PostgreSQL at #4 This ranking does not make sense because this is apple versus oranges comparison. We should not be comparing relational databases with columnar databases in OLTP versus OLAP scenarios. Score : 13.04 Rank : 32 Overall : 20 Score: 0.66 Rank#151 Overall#76 Relational DBMS From <https://db-engines.com/en/system/Actian+Vector> Ingress is ranked 52 versus PostgreSQL at #4 Informix is also ranked quite low at #25 https://db-engines.com/en/ranking IT Central Station Ranking Ranked at No. 6, top 5 are Oracle Exadata, Teradata, HPE Vertica, Netezza and SAP IQ which are all commercial solution Actian offers Parcels which is rated 18th. We are not using Parcels and can't comment on that. Actian Vectorwise is not listed here Release History Initial Release : 2005 Current Release : 4.3.11.1, January 2017 Beta Release : 5.x beta is out, it is expected they will improve PostgreSQL version support. https://gpdb.docs.pivotal.io/500Beta/relnotes/GPDB_500 _README.html#topic_yxx_bq2_lx Vector 5.0, July 2016 From <https://db-engines.com/en/system/Actian+Vector> Licensing Both Open Source and Licensed Version Available Cloud based, SaaS solution, based on usage Licensed Version Only Technical Support including bug fixing support Available with Licensed Version Technical Documentation gpdb.docs.pivotal.io From <https://db-engines.com/en/system/Greenplum%3BIngres> Available Community Open Source, Greenplum community, PostgreSQL community, lots of information on youtube Yes Lacks community, closed source, have to rely on technical support Licensing Cost Free in case of open-source(Aplache License 2.0) In case of technical support subscription, un-official estimate is :- 100 CPU cores : 1000 $ CPU core, per year Commercial Commercial Architecture Columnar, Shared Nothing with MPP Support Supported using EMC appliance as well as off the shelf suitable hardware Columnar Columnar, MPP is supported. Although it depends on our license, not sure whether we have the license for this. Hardware and Setup Could be implemented using commodity hardware, DCA appliance is also available. Only enterprise MPP that can be run on commodity hardware. Cloud deployment on Amazon and Microsoft is available Cloud Setup only Linux based deployment GitHub https://github.com/greenplum-db/gpdb Not available Storage POLYMORPHIC DATA STORAGE AND EXECUTION The table or partition storage, execution, and compression settings can be configured to suit the way data is accessed. Users have the choice of row or column-oriented storage and processing for any table or partition. From <http://greenplum.org/> Hadoop is supported but requires separately licensed product Replication Methods Master-Slave Yes Yes, Partitioning Methods Sharding Sharding Compression Upto 30% ACID Compliance Yes Yes Yes Greenplum Versus Redshift and Actian Vectorwise Comparison Wednesday, August 23, 2017 10:39 AM GreenPlum Page 1
  • 2.
    Backup and Recovery Supports paralleland non-parallel backup and restore https://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html Using Greenplum Appliance They have a hardware called data domain system for backup and recovery, similar solutions should be available from other vendors as well. https://www.emc.com/collateral/hardware/white-papers/h8038-backup- recovery-greenplum-data-domain-wp.pdf Using Commvault http://documentation.commvault.com/commvault/v11/article? p=products/greenplum/t_greenplum_backup.htm http://documentation.commvault.com/fujitsu/v11/article? p=products/greenplum/t_greenplum_restore_from_backup_job.htm Veritas https://gpdb.docs.pivotal.io/500Beta/admin_guide/managing/backup- veritas.html Cloud based Replication approach is used, which is costly Scalability Greenplum’s NewSQL MPP share-nothing RDBMS database designed for multi-petabyte environments where a share-everything DBMS, like Oracle, would die due to IO limitations to the SAN. From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the-Greenplum- database> Cloud based options MPP Big Data Supports data stored in HDFS, Hive, HBase, Avro, ProtoBuf, Delimited Text and Sequence Files. • Solr/Lucene integration for multi-lingual full-text search embedded in the SQL. • Row and/or Column-oriented data storage. It is the only database where a table can be polymorphic with both columnar and row- based partitions as defined by the DBA. • Advanced Map-Reduce CBO Query Optimizer – queries can be run on over 1,000+ nodes. • It has a dynamic distributed pipeline execution model for query processing. While older map-reduce databases rely on materialized execution Greenplum doesn't have to write data to disk with every intermediate query step. It streams data to the next stage of a query plan in memory, and never has to materialize the data to disk, so it's much faster than what anybody has demonstrated on Hadoop. • Deep analytics – including data mining or machine learning algorithms using MADlib (think of it as R for MPP). Deep Semantic Textual Analytics using GPText. • Graphical Analysis - billion edge distributed in-memory graph database and algorithms using GraphLab. • Integration of SQL, Solr indexes, GPText, MADlib and GraphLab in a singlequery. Wow! • From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the- Greenplum-database> No Support, Map-reduce is not supported Not sure about data science support and integration in Actian. Data Loading Distributed ETL rate of 16 TB/hr without using master node!!• Integration with Talend available.• Data loading component for PDI is also available• GPLoad is data loading component• • Compatible with open source and commercial ETL tools AWS tools are available, compatiable with various open source and commercial ETL tools VWload is available for bulk loading Integration with Spagobi Spagobi supports PostgreSQL, but testing needs to be done with Greenplum. Spagobi is not officially certified with Greenplum. POC will be done focusing on this area. Should be possible through JDBC, not tested Integration with Spagobi 5.2 has been tested. Basic features such as dashboards, reports, adhoc analysis are working. Currently there are issues with creating QBE(business models) Known Issues Greenplum may have problems with high concurrency & volatility, but it would be silly to have high concurrency in the PB range. In the < 100 TB range it becomes a question of whether you need high concurrency (Oracle) or Data Science like analytics (Greenplum). From <https://www.quora.com/Why-would-anyone-migrate-from-Oracle-to-the-Greenplum- database> Note : This is an opinion, and may not be valid as per my knowledge. VWWare Case Study(108 TB with 6000 users, 300 of them concurrent) Redshift is still very limited in terms of SQL functionalities that it offers. You can't have procedures, functions, triggers, CTE etc. in Redshift. So you have to follow an ETL approach for your data warehouses in most of the cases, even though ELT might suit you better. Another major limitation with Redshift is the number of concurrent queries it can run: 15. Yes, that's right. Redshift can only run 15 concurrent queries as of now. For a big datawarehouse with ETL processes, admins, users, dashboards on top of that this number is ridiculously low. I hope they do something about this sooner than later. From <https://www.quora.com/What-features-does- Amazon-Redshift-fail-to-offer-compared-to-higher- priced-alternatives-like-Teradata-How-likely-are- customers-to-switch> 1). Product was discontinued and started again 2). Had issues with large, complex queries crashing in version 3.x, issues were resolved in version 4.x Customers BC Hydro, China Railway, TCS Bank Russia, Well Care, VWWare Case GreenPlum Page 2
  • 3.
    Customers BC Hydro,China Railway, TCS Bank Russia, Well Care, VWWare Case Study(108 TB with 6000 users, 300 of them concurrent) Performance Petabyte scale data warehouse solution Some customers have claimed that it has better performance : - https://pavanskumar.wordpress.com/2015/04/23/actian-vector- migration-from-greenplum/ Gartner Rated as Visionary(2015-16) and Niche Player(2017). Gartner praises Greenplum for built-in data science support. This would be a major advantage in development of advanced payment analytics features. Rated as leader(I wonder why ?) Rated as Visionary(2015-16), Actian did not make it to Gartner Magic Quadrant in 2017 In-memory Grid Gemfire, a key application is fraud detection which requires real-time transaction data as seen here https://content.pivotal.io/blog/big-data- meets-fast-data-to-fight-fraud-and-more Hadoop HAWQ provides the most robust SQL interface for Hadoop and can tackle data exploration and transformation in HDFS. HAWQ is a parallel SQL query engine that combines the key technological advantages of the industry-leading Pivotal Analytic Database with the scalability and convenience of Hadoop. HAWQ reads data from and writes data to HDFS natively. HAWQ delivers industry-leading performance and linear scalability. It provides users the tools to confidently and successfully interact with petabyte range data sets. HAWQ provides users with a complete, standards compliant SQL interface. By using the proven parallel database technology of the Pivotal Analytic Database, HAWQ has been shown to be consistently tens to hundreds of times faster than all Hadoop query engines in the market today. Pivotal HAWQ is a Massively Parallel Processing (MPP) database using several Postgres database instances and HDFS storage. Think of your regular MPP databases like Teradata/Greenplum/Netezza but instead of using local storage it uses HDFS to store datafiles. Each of the processing nodes still has its own CPU/memory and storage. References From <https://dwarehouse.wordpress.com/2014/03/14/pivotal-hawq-mpp-database-on- hdfs/> From <https://www.quora.com/What-is-HAWQ> From <https://www.pivotalguru.com/?p=642> They do have hadoop support, but it is not clear whether hadoop connnectivity and support is part of existing product license or not. In the same way, it seems hadoop support will have extra cost. VectorH seems to be a separate product :- https://www.actian.com/analytic-database/vectorh-sql-hadoop/ Management Tools Pivotal Command Center And Workload Manager Cloud based Figure : http://www.cmswire.com/cms/analytics/a-look-at-gartners-data-management-analytics-leaders-028772.php GreenPlum Page 3
  • 4.
  • 5.
    Reference : https://dwarehouse.wordpress.com/2014/03/14/pivotal-hawq-mpp-database-on-hdfs/ KeyComponents of Greenplum Comparison with Other MPP's GreenPlum Page 5