More Related Content Similar to N(ot)-o(nly)-(Ha)doop - the DAG showdown (20) More from DataWorks Summit (20) N(ot)-o(nly)-(Ha)doop - the DAG showdown2. Copyright © 2015, Intel Corporation. All rights reserved.
Legal Message
THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT
SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON
INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS
WILL OBTAIN SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN
THIS SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2015, Intel Corporation. All rights reserved.
2
3. Copyright © 2015, Intel Corporation. All rights reserved.
Key Messages
3
Evolution of Hadoop to Big Data
Introduction to DAG
DAG runtimes
Evaluation
Performance
Completeness
Results
nodoop is still not-only-hadoop; time for “no-MR” looks closer
4. Copyright © 2015, Intel Corporation. All rights reserved.
4
slow
fragmented
skills gap
block-oriented
data mutability
+
5. Copyright © 2015, Intel Corporation. All rights reserved.
Hadoop to Big Data
5
Processing Model
Analytical Model
Storage Model
Language Model
Complex EventBatch In-Memory
Machine
Learning
Textual SpatialAggregate Temporal Graph
Unstructured Relational Columnar Hierarchic Graph
MR SQL NOSQL JSQL NOSPARQL
retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand
6. Copyright © 2015, Intel Corporation. All rights reserved.
Map Reduce (MR) and Directed Acyclic Graph
(DAG)
6
Stage - 1
Stage - 2
Stage - 3
continuous dataflow
relational semantics
in-memory buffering
sequential dataflow
MR semantics
on-disk storage
7. Copyright © 2015, Intel Corporation. All rights reserved.
MR & DAG Runtimes
7
* Chose only few products for evaluation
DAG*MR
Note: Other names and brands may be claimed as the property of others.
Impala
Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103,
Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC –
5.0.14.1
8. Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Criteria
On Disk failover
HDFS Compatibility
Yarn Integration
File formats
Expressive language
Streaming support
8
Connectivity
Web UI
Integrated Monitoring
Security
Hybrid Analytics
Seamless Dataframes
9. Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
9
Note: Other names and brands may be claimed as the property of others.
10. Copyright © 2015, Intel Corporation. All rights reserved.
Performance Criteria
10
0
200
400
600
800
1000
1200
1400
1600
1800
FULL TABLE
SCAN
JOIN FACT
DIMENSION
AGGREGATE
FUNCTION
JOIN FACT TO
FACT
TEXT
ANALYTICS
LOG ANALYTICS
PROCESSINGTIMESECONDS
PERFORMANCE COMPARISION
Hive
Impala
Spark
Drill
Presto
HPCC
11. Copyright © 2015, Intel Corporation. All rights reserved.
11
All queries completed successfully
A reliable baseline
670.99
640.37
1705.75
983.73
1298.56
411.88
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HIVEPROCESSINGTIMESECONDS
Hive
Impala
Spark
Drill
Presto
HPCC
12. Copyright © 2015, Intel Corporation. All rights reserved.
12
All queries completed successfully
Lack of window functions in Spark-
SQL makes moving average analytics
challenging
Mixed SQL & RDD programming
Not DAG!
~2x to 8x
87.28
192.88
669.09
231.55
132.05
285
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
SPARKPROCESSINGTIMESECONDS
Spark
Impala
Hive
Drill
Presto
HPCC
13. Copyright © 2015, Intel Corporation. All rights reserved.
13
In-memory DAG
Table generating functions and array
functions are not supported; text
analytics example failed
~1x to 20x
29.06
72.45
222.98
168.86
0
747.64
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
A NA LY T IC S
LOG
ANALY T IC S
IMPALAPROCESSINGTIMESECONDS
Impala
Hive
Spark
Drill
Presto
HPCC
14. Copyright © 2015, Intel Corporation. All rights reserved.
14
In-memory DAG – No Resilience
Table generating functions not
supported; text analytics example failed
Window functions are still
beta/unsupported; log analytics failed
~ 5x to 50x
126.99
83.97
250.19
15.13
0
0
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
DRILLPROCESSINGTIMESECONDS
Drill
Spark
Impala
Hive
Presto
HPCC
15. Copyright © 2015, Intel Corporation. All rights reserved.
15
In-memory DAG – No Resilience
Table generating functions not
supported; text analytics example
failed
~ 5x to 60x
4.69
67
491
233.66
0
89.66
F ULL T A BLE
S C A N
JOIN F A C T
DIME NS ION
A GGRE GA T E
F UNC T ION
JOIN F A C T
T O F A C T
T E X T
A NA LY T IC S
LOG
A NA LY T IC S
PRESTOPROCESSINGTIMESECONDS
Presto
Drill
Spark
Impala
Hive
HPCC
16. Copyright © 2015, Intel Corporation. All rights reserved.
16
All queries completed successfully
On-disk DAG runtime; reliable,
complete, performant
Declarative ECL language; not SQL
No native support for HDFS
~ 2x to 20x
39.43
51.16
305.5
10.43
315.5
206.1
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HPCCPROCESSINGTIMESECONDS
HPCC
Presto
Drill
Spark
Impala
Hive
17. Copyright © 2015, Intel Corporation. All rights reserved.
Findings
Big data use-cases stretch beyond unstructured batch jobs.
Can DAG meet the demand and performance?
17
Problem Context
DAG runtimes are still maturing
Spark comes closest
18. Copyright © 2015, Intel Corporation. All rights reserved.
NODOOP = Not only Hadoop
18
22. Copyright © 2015, Intel Corporation. All rights reserved.
Benchmark Environment
Cloudera Enterprise 5.3.2
4 Node Cluster [1 master + 3 workers]
Memory 62.9 GiB in each node
Cores 16
TPCDS Database with Scale of 250
Queries used
Full Table Scan
Fact and Dimension Join
Aggregate functions
Fact to Fact Join
Text Analytics
Log Analytics
22
Hadoop 2.5.0-cdh5.3.0
Hive 0.13.1-cdh5.3.0
presto-server-0.103
Apache Drill: 0.9.0
impalad version 2.1.0-cdh5
Spark 1.3.1
HPCC – 5.0.14.1
TPCDS Scale of 250 – 19.3 GB
Store Sales -18.8 GB
Customer - 300.3 MB
Text Analytics (twitter) – 436.6 MB
CIKM twitter dataset
Log Analytics (weblog) - 5.0 GB
HPCC ECL WLAM sample
Versions
Data Volume
23. Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
23
To-disk failover 2 3 0 3 0 3 4
HDFS Compatibility 4 4 4 4 4 4 2
Yarn Integration 4 0 0 3 1 4 0
File formats 4 4 4 3 2 4 1
Expressive language 3 3 3 4 3 3 3
Streaming support 0 0 0 4 0 4 0
Connectivity 4 4 4 4 4 2 3
Web UI 2 3 4 4 4 3 3
Integrated Monitoring 2 3 4 4 4 3 4
Security 3 3 1 1 1 1 1
Hybrid Analytics 3 2 1 4 1 3 4
Seamless Dataframes 1 1 1 4 1 4 2
32 30 26 42 25 38 25
*Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.
Editor's Notes 2 DAG Definition:
directed = the connections between the nodes (edges) have a direction: A -> B is not the same as B -> A
acyclic = "non-circular" - moving from node to node by following the edges, you will never encounter the same node for the second time.
graph = structure consisting of nodes, that are connected to each other with edges
Basically a directed acyclic graph is a tree.
In-memory to on-disk failover [DAG v MR]
Storage Compatibility [HDFS v proprietary]
Resource Management [Yarn v other]
File format compatibility [Parquet/Columnar, Avro/Row, JSON/Hierarchic, Textfile/Linear]
Expressive language [Declarative/Functional v Imperative]
Streaming + batch support [Temporal/Dimensional Partitioning v Tabular Scans]
Connectivity to other systems [ODBC vs WS, Virtual vs Physical]
Ease of use – Web UIs [IDE vs Putty, Dashboard vs File-based aggregation]
Execution, monitoring, debugging, logging [Centralized v Decentralized; Integrated with CM v Fragmented]
Security [Authentication with LDAP/AD, Authorization/ACLs w Sentry/Posix/Kerberos, Encyrption on disk v wire, Key Management central v isolated]
Integrated graph, temporal, and statistical analytics [one framework vs Multiple Libraries]
Integrated Files, Tables, Datasets, DataFrame “views” – Ability to Share Results
https://cwiki.apache.org/confluence/display/DRILL/Release+Notes
Drill now features complete support for UNION ALL and COUNT(DISTINCT). Drill 0.8 also includes new functions such as unix_timestamp and the window functions sum, count and rank. Note that these window functions should be considered beta.