SlideShare a Scribd company logo
1 of 24
N(ot)-o(nly)-(Ha)doop - the DAG showdown
Intel Corporation
Joydeep Ghosh & Seshu Edala
June, 2015
Copyright © 2015, Intel Corporation. All rights reserved.
Legal Message
THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT
SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON
INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS
WILL OBTAIN SIMILAR RESULTS
This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN
THIS SUMMARY.
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer
systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
Copyright © 2015, Intel Corporation. All rights reserved.
2
Copyright © 2015, Intel Corporation. All rights reserved.
Key Messages
3
 Evolution of Hadoop to Big Data
 Introduction to DAG
 DAG runtimes
 Evaluation
 Performance
 Completeness
 Results
nodoop is still not-only-hadoop; time for “no-MR” looks closer
Copyright © 2015, Intel Corporation. All rights reserved.
4
slow
fragmented
skills gap
block-oriented
data mutability
+
Copyright © 2015, Intel Corporation. All rights reserved.
Hadoop to Big Data
5
Processing Model
Analytical Model
Storage Model
Language Model
Complex EventBatch In-Memory
Machine
Learning
Textual SpatialAggregate Temporal Graph
Unstructured Relational Columnar Hierarchic Graph
MR SQL NOSQL JSQL NOSPARQL
retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand
Copyright © 2015, Intel Corporation. All rights reserved.
Map Reduce (MR) and Directed Acyclic Graph
(DAG)
6
Stage - 1
Stage - 2
Stage - 3
 continuous dataflow
 relational semantics
 in-memory buffering
 sequential dataflow
 MR semantics
 on-disk storage
Copyright © 2015, Intel Corporation. All rights reserved.
MR & DAG Runtimes
7
* Chose only few products for evaluation
DAG*MR
Note: Other names and brands may be claimed as the property of others.
Impala
Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103,
Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC –
5.0.14.1
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Criteria
 On Disk failover
 HDFS Compatibility
 Yarn Integration
 File formats
 Expressive language
 Streaming support
8
 Connectivity
 Web UI
 Integrated Monitoring
 Security
 Hybrid Analytics
 Seamless Dataframes
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
9
Note: Other names and brands may be claimed as the property of others.
Copyright © 2015, Intel Corporation. All rights reserved.
Performance Criteria
10
0
200
400
600
800
1000
1200
1400
1600
1800
FULL TABLE
SCAN
JOIN FACT
DIMENSION
AGGREGATE
FUNCTION
JOIN FACT TO
FACT
TEXT
ANALYTICS
LOG ANALYTICS
PROCESSINGTIMESECONDS
PERFORMANCE COMPARISION
Hive
Impala
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
11
 All queries completed successfully
 A reliable baseline
670.99
640.37
1705.75
983.73
1298.56
411.88
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HIVEPROCESSINGTIMESECONDS
Hive
Impala
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
12
 All queries completed successfully
 Lack of window functions in Spark-
SQL makes moving average analytics
challenging
 Mixed SQL & RDD programming
 Not DAG!
 ~2x to 8x
87.28
192.88
669.09
231.55
132.05
285
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
SPARKPROCESSINGTIMESECONDS
Spark
Impala
Hive
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
13
 In-memory DAG
 Table generating functions and array
functions are not supported; text
analytics example failed
 ~1x to 20x
29.06
72.45
222.98
168.86
0
747.64
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
A NA LY T IC S
LOG
ANALY T IC S
IMPALAPROCESSINGTIMESECONDS
Impala
Hive
Spark
Drill
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
14
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example failed
 Window functions are still
beta/unsupported; log analytics failed
 ~ 5x to 50x
126.99
83.97
250.19
15.13
0
0
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
DRILLPROCESSINGTIMESECONDS
Drill
Spark
Impala
Hive
Presto
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
15
 In-memory DAG – No Resilience
 Table generating functions not
supported; text analytics example
failed
 ~ 5x to 60x
4.69
67
491
233.66
0
89.66
F ULL T A BLE
S C A N
JOIN F A C T
DIME NS ION
A GGRE GA T E
F UNC T ION
JOIN F A C T
T O F A C T
T E X T
A NA LY T IC S
LOG
A NA LY T IC S
PRESTOPROCESSINGTIMESECONDS
Presto
Drill
Spark
Impala
Hive
HPCC
Copyright © 2015, Intel Corporation. All rights reserved.
16
 All queries completed successfully
 On-disk DAG runtime; reliable,
complete, performant
 Declarative ECL language; not SQL
 No native support for HDFS
 ~ 2x to 20x
39.43
51.16
305.5
10.43
315.5
206.1
F ULL T A BLE
S C A N
JOIN F A C T
DIME NSION
A GGRE GA T E
FUNC T ION
JOIN F A C T
T O FAC T
T E X T
ANALY T IC S
LOG
ANALY T IC S
HPCCPROCESSINGTIMESECONDS
HPCC
Presto
Drill
Spark
Impala
Hive
Copyright © 2015, Intel Corporation. All rights reserved.
Findings
 Big data use-cases stretch beyond unstructured batch jobs.
 Can DAG meet the demand and performance?
17
Problem Context
 DAG runtimes are still maturing
 Spark comes closest
Copyright © 2015, Intel Corporation. All rights reserved.
NODOOP = Not only Hadoop
18
Copyright © 2015, Intel Corporation. All rights reserved.
19
Questions
Copyright © 2015, Intel Corporation. All rights reserved.
21
Backup
Copyright © 2015, Intel Corporation. All rights reserved.
Benchmark Environment
 Cloudera Enterprise 5.3.2
 4 Node Cluster [1 master + 3 workers]
 Memory 62.9 GiB in each node
 Cores 16
 TPCDS Database with Scale of 250
 Queries used
 Full Table Scan
 Fact and Dimension Join
 Aggregate functions
 Fact to Fact Join
 Text Analytics
 Log Analytics
22
 Hadoop 2.5.0-cdh5.3.0
 Hive 0.13.1-cdh5.3.0
 presto-server-0.103
 Apache Drill: 0.9.0
 impalad version 2.1.0-cdh5
 Spark 1.3.1
 HPCC – 5.0.14.1
 TPCDS Scale of 250 – 19.3 GB
 Store Sales -18.8 GB
 Customer - 300.3 MB
 Text Analytics (twitter) – 436.6 MB
 CIKM twitter dataset
 Log Analytics (weblog) - 5.0 GB
 HPCC ECL WLAM sample
Versions
Data Volume
Copyright © 2015, Intel Corporation. All rights reserved.
Completeness Scores
23
To-disk failover 2 3 0 3 0 3 4
HDFS Compatibility 4 4 4 4 4 4 2
Yarn Integration 4 0 0 3 1 4 0
File formats 4 4 4 3 2 4 1
Expressive language 3 3 3 4 3 3 3
Streaming support 0 0 0 4 0 4 0
Connectivity 4 4 4 4 4 2 3
Web UI 2 3 4 4 4 3 3
Integrated Monitoring 2 3 4 4 4 3 4
Security 3 3 1 1 1 1 1
Hybrid Analytics 3 2 1 4 1 3 4
Seamless Dataframes 1 1 1 4 1 4 2
32 30 26 42 25 38 25
*Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.
N(ot)-o(nly)-(Ha)doop - the DAG showdown

More Related Content

What's hot

Passing The Joel Test In The PHP World
Passing The Joel Test In The PHP WorldPassing The Joel Test In The PHP World
Passing The Joel Test In The PHP World
Lorna Mitchell
 

What's hot (20)

Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase – Big D...
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Telec...
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Data ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Data ...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Data ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Data ...
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Fin...
	 Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Fin...	 Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Fin...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Fin...
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Tec...
	 Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Tec...	 Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Tec...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Tec...
 
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
Intel® Xeon® Processor E5-2600 v3 Product Family Application Showcase - Core ...
 
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSciStreamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
Streamline End-to-End AI Pipelines with Intel, Databricks, and OmniSci
 
IT@Intel: Creating Smart Spaces with All-in-Ones
IT@Intel:  Creating Smart Spaces with All-in-OnesIT@Intel:  Creating Smart Spaces with All-in-Ones
IT@Intel: Creating Smart Spaces with All-in-Ones
 
Using Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC WorkloadsUsing Xeon + FPGA for Accelerating HPC Workloads
Using Xeon + FPGA for Accelerating HPC Workloads
 
Intel HPC Update
Intel HPC UpdateIntel HPC Update
Intel HPC Update
 
QATCodec: past, present and future
QATCodec: past, present and futureQATCodec: past, present and future
QATCodec: past, present and future
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
Passing The Joel Test In The PHP World
Passing The Joel Test In The PHP WorldPassing The Joel Test In The PHP World
Passing The Joel Test In The PHP World
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
 
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
A Dell Latitude 5420 laptop powered by a four-core Intel Core i5-1145G7 vPro ...
 
Intel Knights Landing Slides
Intel Knights Landing SlidesIntel Knights Landing Slides
Intel Knights Landing Slides
 
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
 
A Dell Latitude 7420 laptop powered by a four-core Intel Core i7-1185G7 vPro ...
A Dell Latitude 7420 laptop powered by a four-core Intel Core i7-1185G7 vPro ...A Dell Latitude 7420 laptop powered by a four-core Intel Core i7-1185G7 vPro ...
A Dell Latitude 7420 laptop powered by a four-core Intel Core i7-1185G7 vPro ...
 

Similar to N(ot)-o(nly)-(Ha)doop - the DAG showdown

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Alluxio, Inc.
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
Lex Yu
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red_Hat_Storage
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Databricks
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
DataWorks Summit
 

Similar to N(ot)-o(nly)-(Ha)doop - the DAG showdown (20)

Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
Intel: How to Use Alluxio to Accelerate BigData Analytics on the Cloud and Ne...
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
Python* Scalability in Production Environments
Python* Scalability in Production EnvironmentsPython* Scalability in Production Environments
Python* Scalability in Production Environments
 
Intel python 2017
Intel python 2017Intel python 2017
Intel python 2017
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
Red Hat Storage Day New York - Intel Unlocking Big Data Infrastructure Effici...
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
QCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AIQCon2016--Drive Best Spark Performance on AI
QCon2016--Drive Best Spark Performance on AI
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
 
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
TDC2019 Intel Software Day - Tecnicas de Programacao Paralela em Machine Lear...
 
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
Red Hat Storage Day Atlanta - Designing Ceph Clusters Using Intel-Based Hardw...
 
Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7Konsolidace Oracle DB na systémech s procesory M7
Konsolidace Oracle DB na systémech s procesory M7
 
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent MemoryAccelerate Your Apache Spark with Intel Optane DC Persistent Memory
Accelerate Your Apache Spark with Intel Optane DC Persistent Memory
 
Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352Hadoop vs Java Batch Processing JSR 352
Hadoop vs Java Batch Processing JSR 352
 
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference ChipSpring Hill (NNP-I 1000): Intel's Data Center Inference Chip
Spring Hill (NNP-I 1000): Intel's Data Center Inference Chip
 
FPGA MeetUp
FPGA MeetUpFPGA MeetUp
FPGA MeetUp
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
 
IBM Power for Life Sciences
IBM Power for Life SciencesIBM Power for Life Sciences
IBM Power for Life Sciences
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

N(ot)-o(nly)-(Ha)doop - the DAG showdown

  • 1. N(ot)-o(nly)-(Ha)doop - the DAG showdown Intel Corporation Joydeep Ghosh & Seshu Edala June, 2015
  • 2. Copyright © 2015, Intel Corporation. All rights reserved. Legal Message THE INFORMATION PROVIDED IN THIS PRESENTATION IS INTENDED TO BE GENERAL IN NATURE AND IS NOT SPECIFIC GUIDANCE. RECOMMENDATIONS (INCLUDING POTENTIAL COST SAVINGS) ARE BASED UPON INTEL'S EXPERIENCE AND ARE ESTIMATES ONLY. INTEL DOES NOT GUARANTEE OR WARRANT OTHERS WILL OBTAIN SIMILAR RESULTS This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. * Other names and brands may be claimed as the property of others. Copyright © 2015, Intel Corporation. All rights reserved. 2
  • 3. Copyright © 2015, Intel Corporation. All rights reserved. Key Messages 3  Evolution of Hadoop to Big Data  Introduction to DAG  DAG runtimes  Evaluation  Performance  Completeness  Results nodoop is still not-only-hadoop; time for “no-MR” looks closer
  • 4. Copyright © 2015, Intel Corporation. All rights reserved. 4 slow fragmented skills gap block-oriented data mutability +
  • 5. Copyright © 2015, Intel Corporation. All rights reserved. Hadoop to Big Data 5 Processing Model Analytical Model Storage Model Language Model Complex EventBatch In-Memory Machine Learning Textual SpatialAggregate Temporal Graph Unstructured Relational Columnar Hierarchic Graph MR SQL NOSQL JSQL NOSPARQL retrofitting Hadoop [unstructured batch analytics] to cater to the full big data demand
  • 6. Copyright © 2015, Intel Corporation. All rights reserved. Map Reduce (MR) and Directed Acyclic Graph (DAG) 6 Stage - 1 Stage - 2 Stage - 3  continuous dataflow  relational semantics  in-memory buffering  sequential dataflow  MR semantics  on-disk storage
  • 7. Copyright © 2015, Intel Corporation. All rights reserved. MR & DAG Runtimes 7 * Chose only few products for evaluation DAG*MR Note: Other names and brands may be claimed as the property of others. Impala Hadoop 2.5.0-cdh5.3.0, Hive 0.13.1-cdh5.3.0,presto-server-0.103, Apache Drill: 0.9.0 ,impalad version 2.1.0-cdh5, Spark 1.3.1, HPCC – 5.0.14.1
  • 8. Copyright © 2015, Intel Corporation. All rights reserved. Completeness Criteria  On Disk failover  HDFS Compatibility  Yarn Integration  File formats  Expressive language  Streaming support 8  Connectivity  Web UI  Integrated Monitoring  Security  Hybrid Analytics  Seamless Dataframes
  • 9. Copyright © 2015, Intel Corporation. All rights reserved. Completeness Scores 9 Note: Other names and brands may be claimed as the property of others.
  • 10. Copyright © 2015, Intel Corporation. All rights reserved. Performance Criteria 10 0 200 400 600 800 1000 1200 1400 1600 1800 FULL TABLE SCAN JOIN FACT DIMENSION AGGREGATE FUNCTION JOIN FACT TO FACT TEXT ANALYTICS LOG ANALYTICS PROCESSINGTIMESECONDS PERFORMANCE COMPARISION Hive Impala Spark Drill Presto HPCC
  • 11. Copyright © 2015, Intel Corporation. All rights reserved. 11  All queries completed successfully  A reliable baseline 670.99 640.37 1705.75 983.73 1298.56 411.88 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S HIVEPROCESSINGTIMESECONDS Hive Impala Spark Drill Presto HPCC
  • 12. Copyright © 2015, Intel Corporation. All rights reserved. 12  All queries completed successfully  Lack of window functions in Spark- SQL makes moving average analytics challenging  Mixed SQL & RDD programming  Not DAG!  ~2x to 8x 87.28 192.88 669.09 231.55 132.05 285 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S SPARKPROCESSINGTIMESECONDS Spark Impala Hive Drill Presto HPCC
  • 13. Copyright © 2015, Intel Corporation. All rights reserved. 13  In-memory DAG  Table generating functions and array functions are not supported; text analytics example failed  ~1x to 20x 29.06 72.45 222.98 168.86 0 747.64 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T A NA LY T IC S LOG ANALY T IC S IMPALAPROCESSINGTIMESECONDS Impala Hive Spark Drill Presto HPCC
  • 14. Copyright © 2015, Intel Corporation. All rights reserved. 14  In-memory DAG – No Resilience  Table generating functions not supported; text analytics example failed  Window functions are still beta/unsupported; log analytics failed  ~ 5x to 50x 126.99 83.97 250.19 15.13 0 0 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S DRILLPROCESSINGTIMESECONDS Drill Spark Impala Hive Presto HPCC
  • 15. Copyright © 2015, Intel Corporation. All rights reserved. 15  In-memory DAG – No Resilience  Table generating functions not supported; text analytics example failed  ~ 5x to 60x 4.69 67 491 233.66 0 89.66 F ULL T A BLE S C A N JOIN F A C T DIME NS ION A GGRE GA T E F UNC T ION JOIN F A C T T O F A C T T E X T A NA LY T IC S LOG A NA LY T IC S PRESTOPROCESSINGTIMESECONDS Presto Drill Spark Impala Hive HPCC
  • 16. Copyright © 2015, Intel Corporation. All rights reserved. 16  All queries completed successfully  On-disk DAG runtime; reliable, complete, performant  Declarative ECL language; not SQL  No native support for HDFS  ~ 2x to 20x 39.43 51.16 305.5 10.43 315.5 206.1 F ULL T A BLE S C A N JOIN F A C T DIME NSION A GGRE GA T E FUNC T ION JOIN F A C T T O FAC T T E X T ANALY T IC S LOG ANALY T IC S HPCCPROCESSINGTIMESECONDS HPCC Presto Drill Spark Impala Hive
  • 17. Copyright © 2015, Intel Corporation. All rights reserved. Findings  Big data use-cases stretch beyond unstructured batch jobs.  Can DAG meet the demand and performance? 17 Problem Context  DAG runtimes are still maturing  Spark comes closest
  • 18. Copyright © 2015, Intel Corporation. All rights reserved. NODOOP = Not only Hadoop 18
  • 19. Copyright © 2015, Intel Corporation. All rights reserved. 19 Questions
  • 20.
  • 21. Copyright © 2015, Intel Corporation. All rights reserved. 21 Backup
  • 22. Copyright © 2015, Intel Corporation. All rights reserved. Benchmark Environment  Cloudera Enterprise 5.3.2  4 Node Cluster [1 master + 3 workers]  Memory 62.9 GiB in each node  Cores 16  TPCDS Database with Scale of 250  Queries used  Full Table Scan  Fact and Dimension Join  Aggregate functions  Fact to Fact Join  Text Analytics  Log Analytics 22  Hadoop 2.5.0-cdh5.3.0  Hive 0.13.1-cdh5.3.0  presto-server-0.103  Apache Drill: 0.9.0  impalad version 2.1.0-cdh5  Spark 1.3.1  HPCC – 5.0.14.1  TPCDS Scale of 250 – 19.3 GB  Store Sales -18.8 GB  Customer - 300.3 MB  Text Analytics (twitter) – 436.6 MB  CIKM twitter dataset  Log Analytics (weblog) - 5.0 GB  HPCC ECL WLAM sample Versions Data Volume
  • 23. Copyright © 2015, Intel Corporation. All rights reserved. Completeness Scores 23 To-disk failover 2 3 0 3 0 3 4 HDFS Compatibility 4 4 4 4 4 4 2 Yarn Integration 4 0 0 3 1 4 0 File formats 4 4 4 3 2 4 1 Expressive language 3 3 3 4 3 3 3 Streaming support 0 0 0 4 0 4 0 Connectivity 4 4 4 4 4 2 3 Web UI 2 3 4 4 4 3 3 Integrated Monitoring 2 3 4 4 4 3 4 Security 3 3 1 1 1 1 1 Hybrid Analytics 3 2 1 4 1 3 4 Seamless Dataframes 1 1 1 4 1 4 2 32 30 26 42 25 38 25 *Score: 0 Min [0] - 4 Max [4]Note: Other names and brands may be claimed as the property of others.

Editor's Notes

  1. 2
  2. DAG Definition: directed = the connections between the nodes (edges) have a direction: A -> B is not the same as B -> A acyclic = "non-circular" - moving from node to node by following the edges, you will never encounter the same node for the second time. graph = structure consisting of nodes, that are connected to each other with edges Basically a directed acyclic graph is a tree.
  3. In-memory to on-disk failover [DAG v MR] Storage Compatibility [HDFS v proprietary] Resource Management [Yarn v other] File format compatibility [Parquet/Columnar, Avro/Row, JSON/Hierarchic, Textfile/Linear] Expressive language [Declarative/Functional v Imperative] Streaming + batch support [Temporal/Dimensional Partitioning v Tabular Scans] Connectivity to other systems [ODBC vs WS, Virtual vs Physical] Ease of use – Web UIs [IDE vs Putty, Dashboard vs File-based aggregation] Execution, monitoring, debugging, logging [Centralized v Decentralized; Integrated with CM v Fragmented] Security [Authentication with LDAP/AD, Authorization/ACLs w Sentry/Posix/Kerberos, Encyrption on disk v wire, Key Management central v isolated] Integrated graph, temporal, and statistical analytics [one framework vs Multiple Libraries] Integrated Files, Tables, Datasets, DataFrame “views” – Ability to Share Results
  4. https://cwiki.apache.org/confluence/display/DRILL/Release+Notes Drill now features complete support for UNION ALL and COUNT(DISTINCT). Drill 0.8 also includes new functions such as unix_timestamp and the window functions sum, count and rank. Note that these window functions should be considered beta.