Hadoop comparative scorecard nick kabra sr mgmt 04042014 and stack integration demo
1. Metrics Weightage Sub - Metrics Criteria
Sub-
Weightage CDH HW/HDP MapR Pivotal HD
Scalability /Fault tolerance Yes Yes Yes Yes
Multi-tenancy- Resource Pooling
1 - groups and resource pooling without YARN
2 - groups and resource pooling and YARN
3 - groups and resource pooling and YARN (significant
contributor) / groups and resource pooling and YARN
+other prop 3% 2 3 1 3
Open source Hadoop based on
products introduced
1 - 1 products introduced
2 - 2-3 products introduced
3 - >=4 products introduced 7% 2 3 0 3
Closed source products built or Closed
source products made open including
portability
1 - 1 products
2 - 2-3 products
3 - >=4 products 2% 3 2 3 3
Cloud based products introduced
1- 1-2 products introduced
2 - >2 products introduced
3 - hadoop integration products+other prop products
introduced 1% 1 1 2 3
No. of committer seats including PMC
1- =0-25 committers,
2 - >25 and <=50 committers,
3 - >50 and 25+PMC committers 3% 3 3 1 2
Support and training provided
1 - OK
2 - good support and training
3 - Excellent support and training 3% 3 2 3 2
Revisions after release
0 - Multiple even after GA
2- Makes the product available only after suitable testing 3% 0 2 2 2
SQL Focus : Open source /Closed
source
0 - Closed source
2 - Open source 3% 0 2 2 2
0.45 0.56 0.29 0.57
Data management -
data lifecycle management, data
replication between HDFS and Hive,
governance, lineage, traceability and
data discovery, process coordination
and scheduling, leveraging existing
products like Oozie and Zookeeper
100% open source framework.
Allow other plug ins.
workflow orchestration /automation (using Oozie
underneath).
Dataset replication.
Dataset retention.
Hive /Hcat integration.
Dashboard /entity viewing.
Integration with system management tool. 2% 2 2 0 1
Data Ingestion - Tools offered etc
1 - Sqoop, Sqoop2 and Flume
2 - Additional 2% 2 2 2 1
Data storage - own, with other systems
1 - HDFS
2 - HDFS and others/prop 2% 1 1 2 2
Realtime Data or OLTP - using Storm,
Spark, or Gemfire, SQLfire
1 - Not sure
2 -Spark or Storm or Prop 2% 2 2 1 2
Streaming Data like Spark Streaming,
Storm
1 - Not sure
2- Spark
3 - Spark+storm 1% 2 3 1 3
Workload Management via Oozie,
Hawq or other tools
1 - Only oozie or only HAWQ
2 - Oozie+integration 3% 1 2 1 1
Data Frameworks working together and
contribution eg: Datastax, Databricks,
MS REEF
1 - very few or through few partnerships
2 - Multiple 1% 2 2 1 1
Data Analytics like Acunu, Rev R,
0 - only tieups
2 - tieups+prop 3% 0 0 0 2
Search - Integration with Search Tool
etc
1 - Prop or external
2 - Prop+external 3% 2 1 1 1
Batch Data Processing-MapReduce and
YARN
1 - Own MR
2 - Only MR+YARN
3 - MapReduce innovation and YARN+Tez or MR
innovation+YARN 5% 3 3 1 3
Multi-cluster management using prop
tools built
1 - good
2 - better
3 - best 2% 3 1 1 2
Monitoring and Managing cluster - like
Cloudera manager, Ambari, Command
Center
1 - Closed source /proprietary
2 - Open sourced
3 - Open sourced and better monitoring product / Closed
source and better monitoring 7% 3 3 2 2
Backup and Recovery/ DR: Availability
and replicaton
1 - Restart required
2 - Autorecovery of nodes or XDR
3 - Autorecovery and XDR 5% 2 2 3 2
CBO on SQL product (cost based
optimizer)
0 - No or not in current version
1 - Yes 2% 0 1 0 1
Security: Data security - Internal
1 - Not sure or None
2 - Good or prop
3 - Better 3% 3 3 2 1
Security: Access/Authentication
Security
External Security:
0 - Not sure and only Kerberos, LDAP, AD
1 - Tie-ups with vendors = Kerberos, LDAP, AD 3% 1 1 1 1
Security: System management
1- Good and prop
2- Better and prop 1% 2 1 1 2
Security: Data governance and audit
1 - Not sure
2 - Good and prop
3 - Better and prop 3% 2 3 2 1
0.99 1.00 0.70 0.84
No SQL vendors like Cassandra, Redis,
1 - <3 or not sure
2 - Prop
3 - >=3 2% 3 3 1 3
Document DBs like MongoDB,
CouchBase
1 - <3 or not sure
2 - few
3 - >=3 2% 3 3 1 3
Graphical DBs like GraphX, InfiniDB,
Giraph
1 - <3 or not sure
2 - Prop
3 - >=3 1% 3 3 1 2
Inmemory DBs like gridgain, Hana
1 - not sure
2 - no specific integration
2 - prop and specific integration 4% 2 2 1 3
MPP Databases like Greenplum,
Vertica, Netezza
1 - not sure
2 - integrates with others
3 - Prop 5% 2 2 1 3
Analytics Databases like Marklogic
1 - <3 or not sure
2 - Prop
3 - >=3 3% 1 1 1 2Messaging tech. like Kafka, Trident,
Kinesis, Spark streaming. BI tools like
Cognos, business objects. ETL tools
like Syncsort, Talend. Data
Visualization, dashboard and reporting
tech like Tableau, Datameer, Ayasdi.
Analytical products/libraries like R,
SAS, Weka. Data Security like
Protegrity, Dataguise, Vormetric.
Configuration management like Chef,
Puppet (for cluster and XDR replication)
etc. Search tools - Solr, ElasticSearch
like Solr, ElasticSearch. RDBMS and
other integration like Oracle, DB2, etc.
List of Connectors, drivers, API.
1 - integrates with fewer technologies
2 - prop and integrates with few other technologies where
prop option is not there
3 - integrates with most better known technologies 8% 3 3 1 3
0.60 0.60 0.25 0.71
Cost and Licensing Policy +
Relationship we have
Not included to remove bias on price /relationship. So all
are 0 0% 0 0 0 0
TOTAL 100% 100% 2.04 2.16 1.24 2.04
Industry Speak / Industry Norm
Our take: No one size fits all.
HADOOP framework, feature set comparison and Performance
Architectural philosophy /open
source /proprietary
25%
The industry norm is having two implementations… eg: Cloudera and Hortonworks or Hortonworks and Pivotal or Cloudera and Pivotal based on their requirements. This also helps reduce
dependency on any one vendor and being tied to one set of technologies.
Since we are looking at the entire stack/suite of products, Pivotal has a product suite/technologies in its datalake. Pivotal CommandCenter, Cloud Foundry, GPDB, HAWQ, MADlib, SQLFire, GemFire,
GemFire XD, Spring support, HAMSTER. Pivotal adheres to open-source Hadoop and has added CommandCenter and features around the Hadoop ecosystem. It did not have Hadoop commiters
before but recently has hired numerous professionals in this matter. Cloudera is becoming more and more closed source as it introduced EDH and Impala. Hortonworks believes in the open source
philosophy which is great. Speaking with Cloudera and Hortonworks executives, the question is: THE VISION and ROADMAP.... Go-forward Strategy. Can they move beyond building wrapper around
Hadoop. Cloudera and Hortonworks do not have the deep pockets or capability to go beyond Hadoop currently. MapR offers tremendous advantages since it bypasses MapReduce and hits the prop
MapR engine(auto-node feature) but the new features take one or two months to be incorporated since it is closed and prop. Also, supporting legacy versions can be a challenge with Cloudera and
MapR where customization is done.
Hadoop framework, featureset
comparison and Performance
and Management
50%
Integration with other
technologies or prop
technologies provided and
connectors, Partnership
/Vendor strategic relationship
25%