Successfully reported this slideshow.

More Related Content

More from DataWorks Summit

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Apache Tez - A New Chapter in Hadoop Data Processing

  1. 1. © Hortonworks Inc. 2013 Page 1 Apache Tez : Accelerating Hadoop Query Processing Bikas Saha@bikassaha Hitesh Shah @hitesh1892
  2. 2. © Hortonworks Inc. 2013 Tez – Introduction Page 2 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.
  3. 3. © Hortonworks Inc. 2012© Hortonworks Inc. 2013. Confidential and Proprietary. Hadoop 1 -> Hadoop 2 HADOOP 1.0 HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) Pig (data flow) Hive (sql) Others (cascading) HDFS2 (redundant, reliable storage) YARN (cluster resource management) Tez (execution engine) HADOOP 2.0 Data Flow Pig SQL Hive Others (Cascading) Batch MapReduce Real Time Stream Processing Storm Online Data Processing HBase, Accumulo Monolithic • Resource Management • Execution Engine • User API Layered • Resource Management – YARN • Execution Engine – Tez • User API – Hive, Pig, Cascading, Your App!
  4. 4. © Hortonworks Inc. 2013 Tez – Empowering Applications Page 4 • Tez solves hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • This design is important to be a platform for a variety of applications App Tez • Custom application logic • Custom data format • Custom data transfer technology • Distributed parallel execution • Negotiating resources from the Hadoop framework • Fault tolerance and recovery • Horizontal scalability • Resource elasticity • Shared library of ready-to-use components • Built-in performance optimizations • Security
  5. 5. © Hortonworks Inc. 2013 Tez – End User Benefits • Better performance of applications • Built-in performance + Application define optimizations • Better predictability of results • Minimization of overheads and queuing delays • Better utilization of compute capacity • Efficient use of allocated resources • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated writes • Reduced network usage • Better locality and data transfer using new data patterns • Higher application developer productivity • Focus on application business logic rather than Hadoop internals Page 5
  6. 6. © Hortonworks Inc. 2013 Tez – Design considerations Don’t solve problems that have already been solved. Or else you will have to solve them again! • Leverage discrete task based compute model for elasticity, scalability and fault tolerance • Leverage several man years of work in Hadoop Map-Reduce data shuffling operations • Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN • Leverage built-in security mechanisms in Hadoop for privacy and isolation Page 6 Look to the Future with an eye on the Past
  7. 7. © Hortonworks Inc. 2013 Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow • Interfacing with application code and new technologies • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable engine to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade Page 7
  8. 8. © Hortonworks Inc. 2013 Tez – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Tez is a completely client side application. • Simply upload to any accessible FileSystem and change local Tez configuration to point to that. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. Page 8 Client Machine Node Manager TezTask Node Manager TezTaskTezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  9. 9. © Hortonworks Inc. 2013 Tez – Expressing the computation Page 9 Aggregate Stage Partition Stage Preprocessor Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers
  10. 10. © Hortonworks Inc. 2013 Tez – Expressing the computation Page 10 Tez provides the following APIs to define the processing • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how all the tasks in the job get specified • Runtime API • Defines the interfaces using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes
  11. 11. © Hortonworks Inc. 2013 Tez – DAG API // Define DAG DAG dag = new DAG(); // Define Vertex Vertex Map1 = new Vertex(Processor.class); // Define Edge Edge edge = Edge(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Map1).addEdge(edge)… Page 11 Defines the global processing flow Map1 Map2 Reduce1 Reduce2 Join Scatter Gather Scatter Gather
  12. 12. © Hortonworks Inc. 2013 Tez – DAG API Page 12 • Data movement – Defines routing of data between tasks – One-To-One : Data from the ith producer task routes to the ith consumer task. – Broadcast : Data from a producer task routes to all consumer tasks. – Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. • Scheduling – Defines when a consumer task is scheduled – Sequential : Consumer task may be scheduled after a producer task completes. – Concurrent : Consumer task must be co-scheduled with a producer task. • Data source – Defines the lifetime/reliability of a task output – Persisted : Output will be available after the task exits. Output may be lost later on. – Persisted-Reliable : Output is reliably stored and will always be available – Ephemeral : Output is available only while the producer task is running Edge properties define the connection between producer and consumer tasks in the DAG
  13. 13. © Hortonworks Inc. 2013 Tez – Logical DAG expansion at Runtime Page 13 Reduce1 Map2 Reduce2 Join Map1
  14. 14. © Hortonworks Inc. 2013 Tez – Runtime API Flexible Inputs-Processor-Outputs Model • Thin API layer to wrap around arbitrary application code • Compose inputs, processor and outputs to execute arbitrary processing • Event routing based control plane architecture • Applications decide logical data format and data transfer technology • Customize for performance • Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService. Built on the same API. Your impls are as first class as ours! Page 14
  15. 15. © Hortonworks Inc. 2013 Tez – Library of Inputs and Outputs Page 15 Classical ‘Map’ Classical ‘Reduce’ Intermediate ‘Reduce’ for Map-Reduce-Reduce Map Processor HDFS Input Sorted Output Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output • What is built in? – Hadoop InputFormat/OutputFormat – SortedGroupedPartitioned Key-Value Input/Output – UnsortedGroupedPartitioned Key-Value Input/Output – Key-Value Input/Output
  16. 16. © Hortonworks Inc. 2013 Tez – Performance • Benefits of expressing the data processing as a DAG • Reducing overheads and queuing effects • Gives system the global picture for better planning • Efficient use of resources • Re-use resources to maximize utilization • Pre-launch, pre-warm and cache • Locality & resource aware scheduling • Support for application defined DAG modifications at runtime for optimized execution • Change task concurrency • Change task scheduling • Change DAG edges • Change DAG vertices (TBD) Page 16
  17. 17. © Hortonworks Inc. 2013 Tez – Benefits of DAG execution Faster Execution and Higher Predictability • Eliminate replicated write barrier between successive computations. • Eliminate job launch overhead of workflow jobs. • Eliminate extra stage of map reads in every workflow job. • Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. • Better locality because the engine has the global picture Page 17 Pig/Hive - MR Pig/Hive - Tez
  18. 18. © Hortonworks Inc. 2013 Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution Page 18 YARN Container / JVM TezTask Host TezTask1 TezTask2 SharedObjects YARN Container Tez Application Master Start Task Task Done Start Task
  19. 19. © Hortonworks Inc. 2013 Tez – Sessions Page 19 Application Master Client Start Session Submit DAG Task Scheduler ContainerPool Shared Object Registry Pre Warmed JVM Sessions • Standard concepts of pre-launch and pre-warm applied • Key for interactive queries • Represents a connection between the user and the cluster • Multiple DAGs executed in the same session • Containers re-used across queries • Takes care of data locality and releasing resources when idle
  20. 20. © Hortonworks Inc. 2013 Tez – Customizable Core Engine Page 20 Vertex-2 Vertex-1 Start vertex Vertex Manager Start tasks DAG Scheduler Get Priority Get Priority Start vertex Task Scheduler Get container Get container • Vertex Manager • Determines task parallelism • Determines when tasks in a vertex can start. • DAG Scheduler Determines priority of task • Task Scheduler Allocates containers from YARN and assigns them to tasks
  21. 21. © Hortonworks Inc. 2013 Tez – Event Based Control Plane Page 21 Reduce Task 2 Input1 Input2 Map Task 2 Output1 Output2 Output3 Map Task 1 Output1 Output2 Output3 AM Router Scatter-Gather Edge • Events used to communicate between the tasks and between task and framework • Data Movement Event used by producer task to inform the consumer task about data location, size etc. • Input Error event sent by task to the engine to inform about errors in reading input. The engine then takes action by re-generating the input • Other events to send task completion notification, data statistics and other control plane information Data Event Error Event
  22. 22. © Hortonworks Inc. 2013 Tez – Automatic Reduce Parallelism Page 22 Map Vertex Reduce Vertex App Master Vertex Manager Data Size Statistics Vertex State Machine Set Parallelism Cancel Task Re-Route Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
  23. 23. © Hortonworks Inc. 2013 Tez – Theory to Practice • Performance • Scalability Page 23
  24. 24. © Hortonworks Inc. 2013 Tez – Hive TPC-DS Scale 200GB latency
  25. 25. Tez – Pig performance gains • Demonstrates performance gains from a basic translation to a Tez DAG 0 20 40 60 80 100 120 140 160 Prod script 1 25m vs 10m 5 MR Jobs Prod script 2 34m vs 16m 5 MR Jobs Prod script 3 1h 46m vs 48m 12 MR Jobs Prod script 4 2h 22m vs 1h 21m 15 MR jobs Timeinmins MR Tez
  26. 26. © Hortonworks Inc. 2013 Tez – Observations on Performance • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage. • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Offload work to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation. • Vertex caching • The more re-computation can be avoided the better is the performance. Page 26
  27. 27. © Hortonworks Inc. 2013 Tez – Data at scale Page 27 Hive TPC-DS Scale 10TB
  28. 28. © Hortonworks Inc. 2013 Tez – DAG definition at scale Page 28 Hive : TPC-DS Query 88 Logical DAG ( 39 MR jobs with Hive-10)
  29. 29. © Hortonworks Inc. 2013 Tez – Container Reuse at Scale • 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4) Page 29
  30. 30. © Hortonworks Inc. 2013 Tez – Real World Use Cases for the API Page 30
  31. 31. © Hortonworks Inc. 2013 Tez – Broadcast Edge SELECT ss.ss_item_sk, ss.ss_quantity, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sold_price) as avg_price, ss_item_sk, ss_quantity_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); Hive – MR Hive – Tez M M M M M HDFS Store Sales scan. Group by and aggregation reduce size of this input. Inventory scan and Join Broadcast edge M M M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R R R RR M MMM HDFS Hive : Broadcast Join
  32. 32. © Hortonworks Inc. 2013 Tez – Multiple Outputs Page 32 Pig : Split & Group-by f = LOAD ‘foo’ AS (x, y, z); g1 = GROUP f BY y; g2 = GROUP f BY z; j = JOIN g1 BY group, g2 BY group; Group by y Group by z Load foo Join Load g1 and Load g2 Group by y Group by z Load foo Join Multiple outputs Reduce follows reduce HDFS HDFS Split multiplex de-multiplex Pig – MR Pig – Tez
  33. 33. © Hortonworks Inc. 2013 Tez – One to One Edge Page 33 Aggregate Sample L Join Stage sample map on distributed cache l = LOAD ‘left’ AS (x, y); r = LOAD ‘right’ AS (x, z); j = JOIN l BY x, r BY x USING ‘skewed’; Load & Sample Aggregate Partition L Join Pass through input via 1-1 edge Partition R HDFS Broadcast sample map Partition L and Partition R Pig – MR Pig – Tez Pig : Skewed Join
  34. 34. © Hortonworks Inc. 2013 Tez – Current status • Apache Incubator Project –Rapid development. Over 1100 jiras opened. Over 800 resolved –Growing community of contributors and users –Latest release is 0.4 • Support for a vast topology of DAGs • Being used by multiple applications such as Apache Hive, Apache Pig, Cascading. Page 34
  35. 35. © Hortonworks Inc. 2013 Tez – Adoption Path •Pre-requisite : Hadoop 2 with YARN •Simple client-side install (no admin support needed) –No side effects or traces left behind on your cluster. Low risk and low effort to try out. •Apache Hive – Already available in 0.13 •Apache Pig – Available on trunk ( slated for v0.14 ) •Cascading – version 3.0 ( coming soon ) •Run your MapReduce jobs using Tez runtime • Change “” to “yarn-tez” •ETL: Replace MR or custom pipelines with native Tez • Page 35
  36. 36. © Hortonworks Inc. 2013 Tez – Roadmap • Richer DAG support – Addition of vertices at runtime – Shared edges for shared outputs – Enhance Input/Output library • Performance optimizations – Improve support for high concurrency – Improve locality aware scheduling. – Add framework level data statistics – HDFS memory storage integration • Usability –Tez UI (coming soon) –Stability and testability – API ease of use – Tools for performance analysis and debugging Page 36
  37. 37. © Hortonworks Inc. 2013 Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Tez meetup for developers and users – • Technical blog series – processing • Useful links – Work tracking: – Code: – Developer list: User list: Issues list: Page 37
  38. 38. © Hortonworks Inc. 2013 Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Customizable execution architecture designed to enable dynamic performance optimizations at runtime • Works out of the box with the platform figuring out the hard stuff • Span the spectrum of interactive latency to batch, small to large • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive and Pig Page 38
  39. 39. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Video with Deep Dive on Tez Questions? @bikassaha, @hitesh1892 Page 39

Editor's Notes

  • For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework …

    [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three.

    [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.
  • query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
  • 1.5x to 3x speedup on some of the Pigmix queries.
  • Hive has written it’s own processor
  • ×