Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Page1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP: Sub-Second Analytical Queries in Hive
Gunther Hagleitner
Page2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Perf overview – a demo in a gif
Page3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive 2 with LLAP: 5x Performance Boost at 10 TB Scale
0
2
4
6
8
10
1...
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)...
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Hive vs. Apache Impala at 10TB
 10TB scale on 10 identical
n...
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Hive vs. Presto on a partitioned 1TB dataset.
 Presto lacks ...
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive LLAP: Stable Performance under High Concurrency
4x Queries,
2.8...
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP: Linear Scale
Average and Maximum query times
Mixed S...
Page10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance – cache on HDFS, 1Tb scale
20.2
18.3
17.6
16.2 16.2...
Page11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance – cache on cloud storage
• Cloud storage, slower th...
Page12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP on text data (CSV, logs, etc)
-
5.00
10.00
15.00...
Page13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive with LLAP: Architecture Overview
Deep
Storage
YARN Cluster...
Page14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
I/OExecution
LLAP: In-memory processing
Decoder
Distributed FS
...
Page15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Parallel queries – priorities, preemption
• Lower-priority frag...
Page16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Sidebar: Relational data node
• LLAP can provide a "relational ...
Page17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example - SparkSQL integration
• Allowing direct access to Hive...
Page18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Example - SparkSQL integration – execution flow
Page 18
HadoopR...
Page19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
LLAP Availability
• First version shipped in Apache Hive 2.0
• ...
Page20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Setup: Ambari
• Simple setup
• Integrated into Hive configurati...
Page21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Setup: Cloud
• Spin up LLAP clusters in minutes
• Exposes HS2 e...
Page22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Monitoring
• Simple UI for monitoring
• JMX endpoint with more ...
Page23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Monitoring
• HDP ships with number of default
Grafana dashboard...
Page24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Watching queries – Tez UI integration
Page25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Future work
• Fine-grained workload management
• Reservations &...
Page26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Summary
• LLAP is a new execution substrate for fast, concurren...
Page27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
?
Interested? Stop by the Hortonworks booth to learn...
Upcoming SlideShare
Loading in …5
×

LLAP: Sub-Second Analytical Queries in Hive

2,923 views

Published on

We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.

Published in: Technology
  • Secrets to making $$$ with paid surveys... ★★★ https://tinyurl.com/realmoneystreams2019
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • I went from getting $3 surveys to $500 surveys every day!! learn more... ★★★ https://tinyurl.com/make2793amonth
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

LLAP: Sub-Second Analytical Queries in Hive

  1. 1. Page1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP: Sub-Second Analytical Queries in Hive Gunther Hagleitner
  2. 2. Page2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Perf overview – a demo in a gif
  3. 3. Page3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
  4. 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive 2 with LLAP: 5x Performance Boost at 10 TB Scale 0 2 4 6 8 10 12 14 16 18 0 500 1000 1500 2000 2500 Improvement(Factor) Querytime,s Hive 2 with LLAP: 5x Performance Improvement Across All Query Types (10 TB, 10 Nodes (256GB/32 cores), TPC-DS Queries) Hive 2 LLAP Improvement v Hive 1 (Factor)
  5. 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 0 5 10 15 20 25 30 35 40 45 50 0 50 100 150 200 250 Speedup(xFactor) QueryTime(s)(LowerisBetter) Hive 2 with LLAP averages 26x faster than Hive 1 Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor) Hive 2 with LLAP: 25+x Performance Boost: Interactive / 1TB Scale
  6. 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Hive vs. Apache Impala at 10TB  10TB scale on 10 identical nodes.  Hive and Impala showed similar times on most smaller queries.  Hive scaled better, completing larger queries quicker with lower end-to- end time for all queries Highlights 1 10 100 1000 10000 Querytime,s Impala 2.8 Hive 2.1
  7. 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Hive vs. Presto on a partitioned 1TB dataset.  Presto lacks basic performance optimizations like dynamic partition pruning.  On a real dataset / workload Presto perform poorly without full re-writes.  Example: Query 55 without re-writes = 185.17s, with re- writes = 16s. LLAP = 1.37s. Highlights
  8. 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive LLAP: Stable Performance under High Concurrency 4x Queries, 2.8x Runtime Difference 5x Queries, 4.6x Runtime Difference Mark Concurrent Queries Average Runtime 5 7.76s 25 36.24s 100 102.89s
  9. 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP: Linear Scale Average and Maximum query times Mixed SQL workload of simple and complex queries 1TB Dataset Size 0 20 40 60 80 100 120 8 16 32 Time(s) Concurrent Users Average Query Time by Concurrency, 1TB Average Time: 8 Node Average Time: 16 Node 0 50 100 150 200 250 300 8 16 32 Time(s) Concurrent Users Maximum Query Time by Concurrency, 1TB Max Time: 8 Node Max Time: 16 Node Linear Scale: Double Nodes, Consistent Runtime
  10. 10. Page10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance – cache on HDFS, 1Tb scale 20.2 18.3 17.6 16.2 16.2 16.8 14.5 11.3 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 0 5 10 15 20 25 24 26 28 30 32 34 36 38 40 42 Cachehitrate Queryruntime,s Cache size, Gb Query runtime, after unrelated workload Query runtime, after related workload Metadata hit rate Data hit rate
  11. 11. Page11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Performance – cache on cloud storage • Cloud storage, slower than on-prem HDFS • Large scan on 1Tb scale • Local HD/SDD cache • Massive improvement on slow storage with little memory cost 0 50 100 150 200 250 300 350 400 Cold cache Hot cache Runtime,s Entire query Scan portion
  12. 12. Page12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP on text data (CSV, logs, etc) - 5.00 10.00 15.00 20.00 25.00 30.00 35.00 40.00 45.00 50.00 0 200 400 600 800 1000 1200 1400 Text: No Cache Text: Cached Improvement (Factor)
  13. 13. Page13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive with LLAP: Architecture Overview Deep Storage YARN Cluster LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors LLAP Daemon Query Executors Query Coordinators Coord- inator Coord- inator Coord- inator HiveServer2 (Query Endpoint) ODBC / JDBC SQL Queries In-Memory Cache (Shared Across All Users) HDFS and Compatible S3 WASB Isilon
  14. 14. Page14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved I/OExecution LLAP: In-memory processing Decoder Distributed FS Fragment Hive operator Hive operator Vectorized processin g col1 col2 Low-level pushdown (e.g. filter)* SSD/HD cache Off-heap cache Compression codec Native data vectors Compact encoded data Compressed data col1 col2
  15. 15. Page15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Parallel queries – priorities, preemption • Lower-priority fragments can be preempted • For example, a fragment can start running before its inputs are ready, for better pipelining; such fragments may be preempted • LLAP work queue examines the DAG parameters to give preference to interactive (BI) queries LLAP QueueExecutor Executor Interactive query map 1/3 … Interactive query map 3/3 Executor Interactive query map 2/3 Wide query reduce waiting Time ClusterUtilization Long-Running Query Short-Running Query
  16. 16. Page16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sidebar: Relational data node • LLAP can provide a "relational datanode" view of the data • Standard interface to read tables directly from LLAP nodes • Supports column-level security • LLAP handles ACID, schema-evolution • Optimized for fast access/scans • Read horizontal table partitions in parallel on compute nodes • Push projections, filters, aggregates into LLAP • Utilizes cache and LLAP IO subsystem
  17. 17. Page17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example - SparkSQL integration • Allowing direct access to Hive table data from Spark breaks security model for warehouses secured using Ranger or SQL Standard Authorization • Other features may not work correctly unless support added in Spark • Row/Cell level security (when implemented), ACID, Schema Evolution, … • SparkSQL has interfaces for external sources; they can be implemented to access Hive data via HS2/LLAP • SparkSQL Catalyst optimizer hooks can be used to push processing into LLAP (e.g. filters on the tables) • Some optimizations, like dynamic partition pruning, can be used by Hive even if SparkSQL doesn't support them; if execution pushdown is advanced enough
  18. 18. Page18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Example - SparkSQL integration – execution flow Page 18 HadoopRDD.compute() LlapInputFormat.getRecordReader() - Open socket for incoming data - Send package/plan to LLAP Check permissions Generate splits w/LLAP locations Return securely signed splits HS2/Metastore Spark Executor HadoopRDD.getPartitions() Get Hive splits Cluster nodes Verify splits Run query plan Send data back LLAP Partitions/ Splits Request splits : : var llapContext = LlapContext.newInstance( sparkContext, jdbcUrl) var df: DataFrame = llapContext.sql("select * from tpch_text_5.region") DataFrame for Hive/LLAP data
  19. 19. Page19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved LLAP Availability • First version shipped in Apache Hive 2.0 • Hive 2.0.1 contains important bug fixes to make it production ready • Hive 2.1 has new features and further perf improvements • HDP 2.5 includes beta version of LLAP • HDP 2.6 includes GA version of Apache Hive 2.1 w/ LLAP
  20. 20. Page20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Setup: Ambari • Simple setup • Integrated into Hive configuration page • Launches HS2 instance and LLAP cluster • Handles configuration, slider, security, alerts and monitoring, … • Requires Ambari 2.5.x • Requires HDP 2.6.x stack (includes Apache Hive 2.1 w/ LLAP)
  21. 21. Page21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Setup: Cloud • Spin up LLAP clusters in minutes • Exposes HS2 endpoint • JDBC/ODBC • Zeppelin/Hive view also options • RDS + S3 friendly • Use HD cache for fast S3 access • RDS for shared metastore instance • HDP 2.6 available now!
  22. 22. Page22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Monitoring • Simple UI for monitoring • JMX endpoint with more data • Logs and jstack endpoints
  23. 23. Page23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Monitoring • HDP ships with number of default Grafana dashboards for LLAP • Show wealth of metrics for cache, compute and scheduling • Helps to identify bottlenecks and issues • Helps with capacity planning • Easy to extend or integrate with other systems
  24. 24. Page24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Watching queries – Tez UI integration
  25. 25. Page25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future work • Fine-grained workload management • Reservations & priorities • Configurable guardrails • Maximum permissible scan range, runtime, cost • Integrated with CBO’s cost model • Admin tools • Identify hotspots & data layout issues • Easy reporting on health, performance and utilization
  26. 26. Page26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Summary • LLAP is a new execution substrate for fast, concurrent analytical workloads, harnessing Hive vectorized SQL engine and efficient in-memory caching layer • Provides secure relational view of the data through a simple API • Available in Hive 2, integrated with ecosystem (Ambari, Cloud, Grafana, jmx, Hive & Tez views)
  27. 27. Page27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions? ? Interested? Stop by the Hortonworks booth to learn more

×