N(ot)-o(nly)-(Ha)doop - the DAG showdown
Intel Corporation
Joydeep Ghosh & Seshu Edala
June, 2015
Copyright © 2015, Intel Corporation. All rights reserved.
Legal Message
THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT
SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON
INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS
WILL OBTAIN SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN
THIS SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2015, Intel Corporation. All rights reserved.
2
Copyright © 2015, Intel Corporation. All rights reserved.
Key Messages
3
 Evolution of Hadoop to Big Data
 Introduction to DAG
 DAG runtimes
 Evaluation
 Performance
 Completeness
 Results
nodoop is still not-only-hadoop; time for “no-MR” looks closer
Copyright © 2015, Intel Corporation. All rights reserved.
4
slow
fragmented
skills gap
block-oriented
data mutability
+
Copyright © 2015, Intel Corporation. All rights reserved.
Hadoop to Big Data
5
Processing Model
Analytical Model
Storage Model
Language Model
Complex EventBatch In-Memory
Machine
Learning
Textual SpatialAggregate Temporal Graph
Unstructured Relational Columnar Hierarchic Graph
MR SQL NOSQL JSQL NOSPARQL
retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand
Copyright © 2015, Intel Corporation. All rights reserved.
Map Reduce (MR) and Directed Acyclic Graph
(DAG)
6
Stage - 1
Stage - 2
Stage - 3
 continuous dataflow
 relational semantics
 in-memory buffering
 sequential dataflow
 MR semantics
 on-disk storage
Copyright © 2015, Intel Corporation. All rights reserved.
MR & DAG Runtimes
7
* Chose only few products for evaluation
DAG*MR
Note: Other names and brands may be claimed as the property of others.
Impala
Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103,
Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC –
5.0.14.1
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Criteria
 On Disk failover
 HDFS Compatibility
 Yarn Integration
 File formats
 Expressive language
 Streaming support
8
 Connectivity
 Web UI
 Integrated Monitoring
 Security
 Hybrid Analytics
 Seamless Dataframes
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
9
Note: Other names and brands may be claimed as the property of others.
Copyright © 2015, Intel Corporation. All rights reserved.
Performance Criteria
10
0
200
400
600
800
1000
1200
1400
1600
1800
FULL TABLE
SCAN
JOIN FACT
DIMENSION
AGGREGATE
FUNCTION
JOIN FACT TO
FACT
TEXT
ANALYTICS
LOG ANALYTICS
PROCESSINGTIMESECONDS
PERFORMANCE COMPARISION
Hive
Impala
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
11
 All queries completed successfully
 A reliable baseline
670.99
640.37
1705.75
983.73
1298.56
411.88
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HIVEPROCESSINGTIMESECONDS
Hive
Impala
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
12
 All queries completed successfully
 Lack of window functions in Spark-
SQL makes moving average analytics
challenging
 Mixed SQL & RDD programming
 Not DAG!
 ~2x to 8x
87.28
192.88
669.09
231.55
132.05
285
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
SPARKPROCESSINGTIMESECONDS
Spark
Impala
Hive
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
13
 In-memory DAG
 Table generating functions and array
functions are not supported; text
analytics example failed
 ~1x to 20x
29.06
72.45
222.98
168.86
0
747.64
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
A NA LY T IC S
LOG
ANALY T IC S
IMPALAPROCESSINGTIMESECONDS
Impala
Hive
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
14
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example failed
 Window functions are still
beta/unsupported; log analytics failed
 ~ 5x to 50x
126.99
83.97
250.19
15.13
0
0
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
DRILLPROCESSINGTIMESECONDS
Drill
Spark
Impala
Hive
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
15
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example
failed
 ~ 5x to 60x
4.69
67
491
233.66
0
89.66
F ULL T A BLE
S C A N
JOIN F A C T
DIME NS ION
A GGRE GA T E
F UNC T ION
JOIN F A C T
T O F A C T
T E X T
A NA LY T IC S
LOG
A NA LY T IC S
PRESTOPROCESSINGTIMESECONDS
Presto
Drill
Spark
Impala
Hive
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
16
 All queries completed successfully
 On-disk DAG runtime; reliable,
complete, performant
 Declarative ECL language; not SQL
 No native support for HDFS
 ~ 2x to 20x
39.43
51.16
305.5
10.43
315.5
206.1
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HPCCPROCESSINGTIMESECONDS
HPCC
Presto
Drill
Spark
Impala
Hive
Copyright © 2015, Intel Corporation. All rights reserved.
Findings
 Big data use-cases stretch beyond unstructured batch jobs.
 Can DAG meet the demand and performance?
17
Problem Context
 DAG runtimes are still maturing
 Spark comes closest
Copyright © 2015, Intel Corporation. All rights reserved.
NODOOP = Not only Hadoop
18
Copyright © 2015, Intel Corporation. All rights reserved.
19
Questions
Copyright © 2015, Intel Corporation. All rights reserved.
21
Backup
Copyright © 2015, Intel Corporation. All rights reserved.
Benchmark Environment
 Cloudera Enterprise 5.3.2
 4 Node Cluster [1 master + 3 workers]
 Memory 62.9 GiB in each node
 Cores 16
 TPCDS Database with Scale of 250
 Queries used
 Full Table Scan
 Fact and Dimension Join
 Aggregate functions
 Fact to Fact Join
 Text Analytics
 Log Analytics
22
 Hadoop 2.5.0-cdh5.3.0
 Hive 0.13.1-cdh5.3.0
 presto-server-0.103
 Apache Drill: 0.9.0
 impalad version 2.1.0-cdh5
 Spark 1.3.1
 HPCC – 5.0.14.1
 TPCDS Scale of 250 – 19.3 GB
 Store Sales -18.8 GB
 Customer - 300.3 MB
 Text Analytics (twitter) – 436.6 MB
 CIKM twitter dataset
 Log Analytics (weblog) - 5.0 GB
 HPCC ECL WLAM sample
Versions
Data Volume
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
23
To-disk failover 2 3 0 3 0 3 4
HDFS Compatibility 4 4 4 4 4 4 2
Yarn Integration 4 0 0 3 1 4 0
File formats 4 4 4 3 2 4 1
Expressive language 3 3 3 4 3 3 3
Streaming support 0 0 0 4 0 4 0
Connectivity 4 4 4 4 4 2 3
Web UI 2 3 4 4 4 3 3
Integrated Monitoring 2 3 4 4 4 3 4
Security 3 3 1 1 1 1 1
Hybrid Analytics 3 2 1 4 1 3 4
Seamless Dataframes 1 1 1 4 1 4 2
32 30 26 42 25 38 25
*Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.
N(ot)-o(nly)-(Ha)doop - the DAG showdown

N(ot)-o(nly)-(Ha)doop - the DAG showdown

  • 1.
    N(ot)-o(nly)-(Ha)doop - theDAG showdown Intel Corporation Joydeep Ghosh & Seshu Edala June, 2015
  • 2.
    Copyright © 2015,Intel Corporation. All rights reserved. Legal Message THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2015, Intel Corporation. All rights reserved. 2
  • 3.
    Copyright © 2015,Intel Corporation. All rights reserved. Key Messages 3  Evolution of Hadoop to Big Data  Introduction to DAG  DAG runtimes  Evaluation  Performance  Completeness  Results nodoop is still not-only-hadoop; time for “no-MR” looks closer
  • 4.
    Copyright © 2015,Intel Corporation. All rights reserved. 4 slow fragmented skills gap block-oriented data mutability +
  • 5.
    Copyright © 2015,Intel Corporation. All rights reserved. Hadoop to Big Data 5 Processing Model Analytical Model Storage Model Language Model Complex EventBatch In-Memory Machine Learning Textual SpatialAggregate Temporal Graph Unstructured Relational Columnar Hierarchic Graph MR SQL NOSQL JSQL NOSPARQL retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand
  • 6.
    Copyright © 2015,Intel Corporation. All rights reserved. Map Reduce (MR) and Directed Acyclic Graph (DAG) 6 Stage - 1 Stage - 2 Stage - 3  continuous dataflow  relational semantics  in-memory buffering  sequential dataflow  MR semantics  on-disk storage
  • 7.
    Copyright © 2015,Intel Corporation. All rights reserved. MR & DAG Runtimes 7 * Chose only few products for evaluation DAG*MR Note: Other names and brands may be claimed as the property of others. Impala Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103, Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC – 5.0.14.1
  • 8.
    Copyright © 2015,Intel Corporation. All rights reserved. Completeness Criteria  On Disk failover  HDFS Compatibility  Yarn Integration  File formats  Expressive language  Streaming support 8  Connectivity  Web UI  Integrated Monitoring  Security  Hybrid Analytics  Seamless Dataframes
  • 9.
    Copyright © 2015,Intel Corporation. All rights reserved. Completeness Scores 9 Note: Other names and brands may be claimed as the property of others.
  • 10.
    Copyright © 2015,Intel Corporation. All rights reserved. Performance Criteria 10 0 200 400 600 800 1000 1200 1400 1600 1800 FULL TABLE SCAN JOIN FACT DIMENSION AGGREGATE FUNCTION JOIN FACT TO FACT TEXT ANALYTICS LOG ANALYTICS PROCESSINGTIMESECONDS PERFORMANCE COMPARISION Hive Impala Spark Drill Presto HPCC
  • 11.
    Copyright © 2015,Intel Corporation. All rights reserved. 11  All queries completed successfully  A reliable baseline 670.99 640.37 1705.75 983.73 1298.56 411.88 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S HIVEPROCESSINGTIMESECONDS Hive Impala Spark Drill Presto HPCC
  • 12.
    Copyright © 2015,Intel Corporation. All rights reserved. 12  All queries completed successfully  Lack of window functions in Spark- SQL makes moving average analytics challenging  Mixed SQL & RDD programming  Not DAG!  ~2x to 8x 87.28 192.88 669.09 231.55 132.05 285 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S SPARKPROCESSINGTIMESECONDS Spark Impala Hive Drill Presto HPCC
  • 13.
    Copyright © 2015,Intel Corporation. All rights reserved. 13  In-memory DAG  Table generating functions and array functions are not supported; text analytics example failed  ~1x to 20x 29.06 72.45 222.98 168.86 0 747.64 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T A NA LY T IC S LOG ANALY T IC S IMPALAPROCESSINGTIMESECONDS Impala Hive Spark Drill Presto HPCC
  • 14.
    Copyright © 2015,Intel Corporation. All rights reserved. 14  In-memory DAG – No Resilience  Table generating functions not supported; text analytics example failed  Window functions are still beta/unsupported; log analytics failed  ~ 5x to 50x 126.99 83.97 250.19 15.13 0 0 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S DRILLPROCESSINGTIMESECONDS Drill Spark Impala Hive Presto HPCC
  • 15.
    Copyright © 2015,Intel Corporation. All rights reserved. 15  In-memory DAG – No Resilience  Table generating functions not supported; text analytics example failed  ~ 5x to 60x 4.69 67 491 233.66 0 89.66 F ULL T A BLE S C A N JOIN F A C T DIME NS ION A GGRE GA T E F UNC T ION JOIN F A C T T O F A C T T E X T A NA LY T IC S LOG A NA LY T IC S PRESTOPROCESSINGTIMESECONDS Presto Drill Spark Impala Hive HPCC
  • 16.
    Copyright © 2015,Intel Corporation. All rights reserved. 16  All queries completed successfully  On-disk DAG runtime; reliable, complete, performant  Declarative ECL language; not SQL  No native support for HDFS  ~ 2x to 20x 39.43 51.16 305.5 10.43 315.5 206.1 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S HPCCPROCESSINGTIMESECONDS HPCC Presto Drill Spark Impala Hive
  • 17.
    Copyright © 2015,Intel Corporation. All rights reserved. Findings  Big data use-cases stretch beyond unstructured batch jobs.  Can DAG meet the demand and performance? 17 Problem Context  DAG runtimes are still maturing  Spark comes closest
  • 18.
    Copyright © 2015,Intel Corporation. All rights reserved. NODOOP = Not only Hadoop 18
  • 19.
    Copyright © 2015,Intel Corporation. All rights reserved. 19 Questions
  • 21.
    Copyright © 2015,Intel Corporation. All rights reserved. 21 Backup
  • 22.
    Copyright © 2015,Intel Corporation. All rights reserved. Benchmark Environment  Cloudera Enterprise 5.3.2  4 Node Cluster [1 master + 3 workers]  Memory 62.9 GiB in each node  Cores 16  TPCDS Database with Scale of 250  Queries used  Full Table Scan  Fact and Dimension Join  Aggregate functions  Fact to Fact Join  Text Analytics  Log Analytics 22  Hadoop 2.5.0-cdh5.3.0  Hive 0.13.1-cdh5.3.0  presto-server-0.103  Apache Drill: 0.9.0  impalad version 2.1.0-cdh5  Spark 1.3.1  HPCC – 5.0.14.1  TPCDS Scale of 250 – 19.3 GB  Store Sales -18.8 GB  Customer - 300.3 MB  Text Analytics (twitter) – 436.6 MB  CIKM twitter dataset  Log Analytics (weblog) - 5.0 GB  HPCC ECL WLAM sample Versions Data Volume
  • 23.
    Copyright © 2015,Intel Corporation. All rights reserved. Completeness Scores 23 To-disk failover 2 3 0 3 0 3 4 HDFS Compatibility 4 4 4 4 4 4 2 Yarn Integration 4 0 0 3 1 4 0 File formats 4 4 4 3 2 4 1 Expressive language 3 3 3 4 3 3 3 Streaming support 0 0 0 4 0 4 0 Connectivity 4 4 4 4 4 2 3 Web UI 2 3 4 4 4 3 3 Integrated Monitoring 2 3 4 4 4 3 4 Security 3 3 1 1 1 1 1 Hybrid Analytics 3 2 1 4 1 3 4 Seamless Dataframes 1 1 1 4 1 4 2 32 30 26 42 25 38 25 *Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.

Editor's Notes

  • #3 2
  • #7 DAG Definition: directed = the connections between the nodes (edges) have a direction: A -> B is not the same as B -> A acyclic = "non-circular" - moving from node to node by following the edges, you will never encounter the same node for the second time. graph = structure consisting of nodes, that are connected to each other with edges Basically a directed acyclic graph is a tree.
  • #9 In-memory to on-disk failover [DAG v MR] Storage Compatibility [HDFS v proprietary] Resource Management [Yarn v other] File format compatibility [Parquet/Columnar, Avro/Row, JSON/Hierarchic, Textfile/Linear] Expressive language [Declarative/Functional v Imperative] Streaming + batch support [Temporal/Dimensional Partitioning v Tabular Scans] Connectivity to other systems [ODBC vs WS, Virtual vs Physical] Ease of use – Web UIs [IDE vs Putty, Dashboard vs File-based aggregation] Execution, monitoring, debugging, logging [Centralized v Decentralized; Integrated with CM v Fragmented] Security [Authentication with LDAP/AD, Authorization/ACLs w Sentry/Posix/Kerberos, Encyrption on disk v wire, Key Management central v isolated] Integrated graph, temporal, and statistical analytics [one framework vs Multiple Libraries] Integrated Files, Tables, Datasets, DataFrame “views” – Ability to Share Results
  • #15 https://cwiki.apache.org/confluence/display/DRILL/Release+Notes Drill now features complete support for UNION ALL and COUNT(DISTINCT). Drill 0.8 also includes new functions such as unix_timestamp and the window functions sum, count and rank. Note that these window functions should be considered beta.