Page1 © Hortonworks Inc. 2014
Interactive Query In Hadoop
Rommel Garcia
Solutions Engineer
May 3, 2014
Hortonworks. We do ...
Page2 © Hortonworks Inc. 2014
Hadoop 2
Multi Use Data Platform
Batch, Interactive, Online, Streaming, …
HADOOP 2
Redundant...
Page3 © Hortonworks Inc. 2014
The Interactive Query Tech Stack
Hive
Tez
YARN
HDFS
SQL
DAG
Resource
Storage
Page4 © Hortonworks Inc. 2014
Hive
Page5 © Hortonworks Inc. 2014
Hive
Open source project that
• facilitates querying (SQL compliant)
• project structure
res...
Page6 © Hortonworks Inc. 2014
Hive SQL Compliance
Page7 © Hortonworks Inc. 2014
Hive Performance
Page 7
Feature Description Benefit
Tez Integration Tez is significantly bet...
Page8 © Hortonworks Inc. 2014
Vectorization Using Modern CPU
CPU
10K rows
Page9 © Hortonworks Inc. 2014
Hive Optimizations
• Pre-warmed Containers (Hive Query Server)
• Low-latency Dispatch (Hive ...
Page10 © Hortonworks Inc. 2014
Hive - ORCFile
Page11 © Hortonworks Inc. 2014
Tez
Page12 © Hortonworks Inc. 2014
Tez – Introduction
• Distributed execution framework targeted towards data-processing
appli...
Page13 © Hortonworks Inc. 2014
Feature Description Benefit
Tez Session
Overcomes Map-Reduce job-launch latency by pre-laun...
Page14 © Hortonworks Inc. 2014
SELECT a.state, COUNT(*), AVERAGE(c.price) FROM
a
JOIN b on (a.id = b.id)
JOIN c on (a.item...
Page15 © Hortonworks Inc. 2014
SELECT a.state, COUNT(*), AVERAGE(c.price) FROM
a
JOIN b on (a.id = b.id)
JOIN c on (a.item...
Page16 © Hortonworks Inc. 2014
Tez – Deep Dive – API
DAG dag = new DAG();
Vertex map1 = new Vertex(MapProcessor.class);
Ve...
Page17 © Hortonworks Inc. 2014
Demo
Hive 13 + Tez
Page18 © Hortonworks Inc. 2014
Multi-Tenancy with HiveServer2
Resource contentions may exists when multiple users run
very...
Page19 © Hortonworks Inc. 2014
Tez - Waves
queue
C.1
C.2
C.3
C.4
C.5
containers
TEZ
tez.am.grouping.split-waves=3.0
15 Tas...
Page20 © Hortonworks Inc. 2014
Thank You!
Rommel Garcia
Hortonworks
@rommelgarcia
Upcoming SlideShare
Loading in …5
×

Interactive query in hadoop

1,467 views

Published on

Hive 13 & Tez providing Human Interactive Query across petabytes of data.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,467
On SlideShare
0
From Embeds
0
Number of Embeds
14
Actions
Shares
0
Downloads
30
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • http://hortonworks.com/hadoop-tutorial/supercharging-interactive-queries-hive-tez/
  • Interactive query in hadoop

    1. 1. Page1 © Hortonworks Inc. 2014 Interactive Query In Hadoop Rommel Garcia Solutions Engineer May 3, 2014 Hortonworks. We do Hadoop.
    2. 2. Page2 © Hortonworks Inc. 2014 Hadoop 2 Multi Use Data Platform Batch, Interactive, Online, Streaming, … HADOOP 2 Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Standard Query Processing Hive, Pig Batch MapReduce Online Data Processing HBase, Accumulo Interactive Tez Real Time Stream Processing Storm others …
    3. 3. Page3 © Hortonworks Inc. 2014 The Interactive Query Tech Stack Hive Tez YARN HDFS SQL DAG Resource Storage
    4. 4. Page4 © Hortonworks Inc. 2014 Hive
    5. 5. Page5 © Hortonworks Inc. 2014 Hive Open source project that • facilitates querying (SQL compliant) • project structure residing in a distributed storage like HDFS.
    6. 6. Page6 © Hortonworks Inc. 2014 Hive SQL Compliance
    7. 7. Page7 © Hortonworks Inc. 2014 Hive Performance Page 7 Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner Using extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency ORC File Columnar, type aware format with indices Latency Cost Based Optimizer (Optiq) Join re-ordering and other optimizations based on column statistics including histograms etc. Latency
    8. 8. Page8 © Hortonworks Inc. 2014 Vectorization Using Modern CPU CPU 10K rows
    9. 9. Page9 © Hortonworks Inc. 2014 Hive Optimizations • Pre-warmed Containers (Hive Query Server) • Low-latency Dispatch (Hive Query Server) • DAG utilization (Tez) • Buffer Caching (cache accessed data) • Predicate Pushdown
    10. 10. Page10 © Hortonworks Inc. 2014 Hive - ORCFile
    11. 11. Page11 © Hortonworks Inc. 2014 Tez
    12. 12. Page12 © Hortonworks Inc. 2014 Tez – Introduction • Distributed execution framework targeted towards data-processing applications. • Express computation as a dataflow graph. • Flexible Input-Processor-Output runtime model • Extensively use caching • Data type agnostic • Built on top of YARN • Apache licensed.
    13. 13. Page13 © Hortonworks Inc. 2014 Feature Description Benefit Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency Tez Container Pre-Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency Tez Container Re-Use Finished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Latency Runtime re-configuration of DAG Runtime query tuning by picking aggregation parallelism using online query statistics Throughput Tez In-Memory Cache Hot data kept in RAM for fast access. Latency Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput Hive On Tez - Execution
    14. 14. Page14 © Hortonworks Inc. 2014 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b on (a.id = b.id) JOIN c on (a.itemId = c.itemId) GROUP by a.state Comparing Tez vs. MR – running queries in Hive • To express the above query in MapReduce, Hive needs to compose and execute four separate MR jobs. • Each MR job comes at a cost of job start-up and disk I/O as the results are written and re-read between MR jobs. This takes too long!
    15. 15. Page15 © Hortonworks Inc. 2014 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b on (a.id = b.id) JOIN c on (a.itemId = c.itemId) GROUP by a.state Comparing Tez vs. MR – running queries in Hive • Using the Tez framework, this query can be expressed as a single executing graph. • No wasted I/O. Each node in the graph streams results to the next node. • No wasted job start up. Tez provides “hot containers” for jobs to be immediately submitted.
    16. 16. Page16 © Hortonworks Inc. 2014 Tez – Deep Dive – API DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
    17. 17. Page17 © Hortonworks Inc. 2014 Demo Hive 13 + Tez
    18. 18. Page18 © Hortonworks Inc. 2014 Multi-Tenancy with HiveServer2 Resource contentions may exists when multiple users run very large queries simultaneously which affects overall query latency. Apply these controls to resolve it. • Container re-use timeout • Tez split wave tuning • Round Robin Queuing setup
    19. 19. Page19 © Hortonworks Inc. 2014 Tez - Waves queue C.1 C.2 C.3 C.4 C.5 containers TEZ tez.am.grouping.split-waves=3.0 15 Tasks T.1 T.2 T.3 T.4 T.5
    20. 20. Page20 © Hortonworks Inc. 2014 Thank You! Rommel Garcia Hortonworks @rommelgarcia

    ×