N(ot)-o(nly)-(Ha)doop - the DAG showdown

N(ot)-o(nly)-(Ha)doop - the DAG showdown
Intel Corporation
Joydeep Ghosh & Seshu Edala
June, 2015

Copyright © 2015, Intel Corporation. All rights reserved.
Legal Message
THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT
SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON
INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS
WILL OBTAIN SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN
THIS SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
2

Key Messages
3
 Evolution of Hadoop to Big Data
 Introduction to DAG
 DAG runtimes
 Evaluation
 Performance
 Completeness
 Results
nodoop is still not-only-hadoop; time for “no-MR” looks closer

4
slow
fragmented
skills gap
block-oriented
data mutability
+

Hadoop to Big Data
5
Processing Model
Analytical Model
Storage Model
Language Model
Complex EventBatch In-Memory
Machine
Learning
Textual SpatialAggregate Temporal Graph
Unstructured Relational Columnar Hierarchic Graph
MR SQL NOSQL JSQL NOSPARQL
retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand

Map Reduce (MR) and Directed Acyclic Graph
(DAG)
6
Stage - 1
Stage - 2
Stage - 3
 continuous dataflow
 relational semantics
 in-memory buffering
 sequential dataflow
 MR semantics
 on-disk storage

MR & DAG Runtimes
7
* Chose only few products for evaluation
DAG*MR
Note: Other names and brands may be claimed as the property of others.
Impala
Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103,
Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC –
5.0.14.1

Completeness Criteria
 On Disk failover
 HDFS Compatibility
 Yarn Integration
 File formats
 Expressive language
 Streaming support
8
 Connectivity
 Web UI
 Integrated Monitoring
 Security
 Hybrid Analytics
 Seamless Dataframes

Completeness Scores
9
Note: Other names and brands may be claimed as the property of others.

Performance Criteria
10
0
200
400
600
800
1000
1200
1400
1600
1800
FULL TABLE
SCAN
JOIN FACT
DIMENSION
AGGREGATE
FUNCTION
JOIN FACT TO
FACT
TEXT
ANALYTICS
LOG ANALYTICS
PROCESSINGTIMESECONDS
PERFORMANCE COMPARISION
Hive
Impala
Spark
Drill
Presto
HPCC

11
 All queries completed successfully
 A reliable baseline
670.99
640.37
1705.75
983.73
1298.56
411.88
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HIVEPROCESSINGTIMESECONDS
Hive
Impala
Spark
Drill
Presto
HPCC

12
 Lack of window functions in Spark-
SQL makes moving average analytics
challenging
 Mixed SQL & RDD programming
 Not DAG!
 ~2x to 8x
87.28
192.88
669.09
231.55
132.05
285
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
SPARKPROCESSINGTIMESECONDS
Spark
Impala
Hive
Drill
Presto
HPCC

13
 In-memory DAG
 Table generating functions and array
functions are not supported; text
analytics example failed
 ~1x to 20x
29.06
72.45
222.98
168.86
0
747.64
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
A NA LY T IC S
LOG
ANALY T IC S
IMPALAPROCESSINGTIMESECONDS
Impala
Hive
Spark
Drill
Presto
HPCC

14
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example failed
 Window functions are still
beta/unsupported; log analytics failed
 ~ 5x to 50x
126.99
83.97
250.19
15.13
0
0
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
DRILLPROCESSINGTIMESECONDS
Drill
Spark
Impala
Hive
Presto
HPCC

15
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example
failed
 ~ 5x to 60x
4.69
67
491
233.66
0
89.66
F ULL T A BLE
S C A N
JOIN F A C T
DIME NS ION
A GGRE GA T E
F UNC T ION
JOIN F A C T
T O F A C T
T E X T
A NA LY T IC S
LOG
A NA LY T IC S
PRESTOPROCESSINGTIMESECONDS
Presto
Drill
Spark
Impala
Hive
HPCC

16
 On-disk DAG runtime; reliable,
complete, performant
 Declarative ECL language; not SQL
 No native support for HDFS
 ~ 2x to 20x
39.43
51.16
305.5
10.43
315.5
206.1
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HPCCPROCESSINGTIMESECONDS
HPCC
Presto
Drill
Spark
Impala
Hive

Findings
 Big data use-cases stretch beyond unstructured batch jobs.
 Can DAG meet the demand and performance?
17
Problem Context
 DAG runtimes are still maturing
 Spark comes closest

NODOOP = Not only Hadoop
18

19
Questions

21
Backup

Benchmark Environment
 Cloudera Enterprise 5.3.2
 4 Node Cluster [1 master + 3 workers]
 Memory 62.9 GiB in each node
 Cores 16
 TPCDS Database with Scale of 250
 Queries used
 Full Table Scan
 Fact and Dimension Join
 Aggregate functions
 Fact to Fact Join
 Text Analytics
 Log Analytics
22
 Hadoop 2.5.0-cdh5.3.0
 Hive 0.13.1-cdh5.3.0
 presto-server-0.103
 Apache Drill: 0.9.0
 impalad version 2.1.0-cdh5
 Spark 1.3.1
 HPCC – 5.0.14.1
 TPCDS Scale of 250 – 19.3 GB
 Store Sales -18.8 GB
 Customer - 300.3 MB
 Text Analytics (twitter) – 436.6 MB
 CIKM twitter dataset
 Log Analytics (weblog) - 5.0 GB
 HPCC ECL WLAM sample
Versions
Data Volume

Completeness Scores
23
To-disk failover 2 3 0 3 0 3 4
HDFS Compatibility 4 4 4 4 4 4 2
Yarn Integration 4 0 0 3 1 4 0
File formats 4 4 4 3 2 4 1
Expressive language 3 3 3 4 3 3 3
Streaming support 0 0 0 4 0 4 0
Connectivity 4 4 4 4 4 2 3
Web UI 2 3 4 4 4 3 3
Integrated Monitoring 2 3 4 4 4 3 4
Security 3 3 1 1 1 1 1
Hybrid Analytics 3 2 1 4 1 3 4
Seamless Dataframes 1 1 1 4 1 4 2
32 30 26 42 25 38 25
*Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.

N(ot)-o(nly)-(Ha)doop - the DAG showdown

N(ot)-o(nly)-(Ha)doop - the DAG showdown

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to N(ot)-o(nly)-(Ha)doop - the DAG showdown

Similar to N(ot)-o(nly)-(Ha)doop - the DAG showdown (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

N(ot)-o(nly)-(Ha)doop - the DAG showdown

Editor's Notes