Your SlideShare is downloading. ×
Interactive query in hadoop
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Interactive query in hadoop


Published on

Hive 13 & Tez providing Human Interactive Query across petabytes of data.

Hive 13 & Tez providing Human Interactive Query across petabytes of data.

Published in: Technology

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Transcript

    • 1. Page1 © Hortonworks Inc. 2014 Interactive Query In Hadoop Rommel Garcia Solutions Engineer May 3, 2014 Hortonworks. We do Hadoop.
    • 2. Page2 © Hortonworks Inc. 2014 Hadoop 2 Multi Use Data Platform Batch, Interactive, Online, Streaming, … HADOOP 2 Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Standard Query Processing Hive, Pig Batch MapReduce Online Data Processing HBase, Accumulo Interactive Tez Real Time Stream Processing Storm others …
    • 3. Page3 © Hortonworks Inc. 2014 The Interactive Query Tech Stack Hive Tez YARN HDFS SQL DAG Resource Storage
    • 4. Page4 © Hortonworks Inc. 2014 Hive
    • 5. Page5 © Hortonworks Inc. 2014 Hive Open source project that • facilitates querying (SQL compliant) • project structure residing in a distributed storage like HDFS.
    • 6. Page6 © Hortonworks Inc. 2014 Hive SQL Compliance
    • 7. Page7 © Hortonworks Inc. 2014 Hive Performance Page 7 Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner Using extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency ORC File Columnar, type aware format with indices Latency Cost Based Optimizer (Optiq) Join re-ordering and other optimizations based on column statistics including histograms etc. Latency
    • 8. Page8 © Hortonworks Inc. 2014 Vectorization Using Modern CPU CPU 10K rows
    • 9. Page9 © Hortonworks Inc. 2014 Hive Optimizations • Pre-warmed Containers (Hive Query Server) • Low-latency Dispatch (Hive Query Server) • DAG utilization (Tez) • Buffer Caching (cache accessed data) • Predicate Pushdown
    • 10. Page10 © Hortonworks Inc. 2014 Hive - ORCFile
    • 11. Page11 © Hortonworks Inc. 2014 Tez
    • 12. Page12 © Hortonworks Inc. 2014 Tez – Introduction • Distributed execution framework targeted towards data-processing applications. • Express computation as a dataflow graph. • Flexible Input-Processor-Output runtime model • Extensively use caching • Data type agnostic • Built on top of YARN • Apache licensed.
    • 13. Page13 © Hortonworks Inc. 2014 Feature Description Benefit Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency Tez Container Pre-Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency Tez Container Re-Use Finished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Latency Runtime re-configuration of DAG Runtime query tuning by picking aggregation parallelism using online query statistics Throughput Tez In-Memory Cache Hot data kept in RAM for fast access. Latency Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput Hive On Tez - Execution
    • 14. Page14 © Hortonworks Inc. 2014 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b on ( = JOIN c on (a.itemId = c.itemId) GROUP by a.state Comparing Tez vs. MR – running queries in Hive • To express the above query in MapReduce, Hive needs to compose and execute four separate MR jobs. • Each MR job comes at a cost of job start-up and disk I/O as the results are written and re-read between MR jobs. This takes too long!
    • 15. Page15 © Hortonworks Inc. 2014 SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b on ( = JOIN c on (a.itemId = c.itemId) GROUP by a.state Comparing Tez vs. MR – running queries in Hive • Using the Tez framework, this query can be expressed as a single executing graph. • No wasted I/O. Each node in the graph streams results to the next node. • No wasted job start up. Tez provides “hot containers” for jobs to be immediately submitted.
    • 16. Page16 © Hortonworks Inc. 2014 Tez – Deep Dive – API DAG dag = new DAG(); Vertex map1 = new Vertex(MapProcessor.class); Vertex map2 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); Vertex reduce2 = new Vertex(ReduceProcessor.class); Vertex join1 = new Vertex(JoinProcessor.class); ……. Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge2 = Edge(map2, reduce2, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge3 = Edge(reduce1, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); Edge edge4 = Edge(reduce2, join1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); ……. dag.addVertex(map1).addVertex(map2) .addVertex(reduce1).addVertex(reduce2) .addVertex(join1) .addEdge(edge1).addEdge(edge2) .addEdge(edge3).addEdge(edge4); reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
    • 17. Page17 © Hortonworks Inc. 2014 Demo Hive 13 + Tez
    • 18. Page18 © Hortonworks Inc. 2014 Multi-Tenancy with HiveServer2 Resource contentions may exists when multiple users run very large queries simultaneously which affects overall query latency. Apply these controls to resolve it. • Container re-use timeout • Tez split wave tuning • Round Robin Queuing setup
    • 19. Page19 © Hortonworks Inc. 2014 Tez - Waves queue C.1 C.2 C.3 C.4 C.5 containers TEZ 15 Tasks T.1 T.2 T.3 T.4 T.5
    • 20. Page20 © Hortonworks Inc. 2014 Thank You! Rommel Garcia Hortonworks @rommelgarcia