YARN Ready: Integrating to YARN with Tez


Published on

YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 1.5x to 3x speedup on some of the Pigmix queries.
  • Register for Office Hours:
  • YARN Ready: Integrating to YARN with Tez

    1. 1. Apache Tez : Accelerating Hadoop Data Processing Bikas Saha@bikassaha © Hortonworks Inc. 2013 Page 1
    2. 2. Tez – Introduction © Hortonworks Inc. 2013 Page 2 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed.
    3. 3. Tez – Hadoop 1 ---> Hadoop 2 Monolithic • Resource Management – MR • Execution Engine – MR Layered • Resource Management – YARN • Engines – Hive, Pig, Cascading, Your App!
    4. 4. Tez – Empowering Applications • Tez solves hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • This design is important to be a platform for a variety of applications © Hortonworks Inc. 2013 Page 4 App Tez • Custom application logic • Custom data format • Custom data transfer technology • Distributed parallel execution • Negotiating resources from the Hadoop framework • Fault tolerance and recovery • Horizontal scalability • Resource elasticity • Shared library of ready-to-use components • Built-in performance optimizations • Security
    5. 5. Tez – End User Benefits • Better performance of applications • Built-in performance + Application define optimizations • Better predictability of results • Minimization of overheads and queuing delays • Better utilization of compute capacity • Efficient use of allocated resources • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated writes © Hortonworks Inc. 2013 • Reduced network usage • Better locality and data transfer using new data patterns • Higher application developer productivity • Focus on application business logic rather than Hadoop internals Page 5
    6. 6. Tez – Design considerations Don’t solve problems that have already been solved. Or else you will have to solve them again! • Leverage discrete task based compute model for elasticity, scalability and fault tolerance • Leverage several man years of work in Hadoop Map-Reduce data shuffling operations • Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN • Leverage built-in security mechanisms in Hadoop for privacy and isolation © Hortonworks Inc. 2013 Page 6 Look to the Future with an eye on the Past
    7. 7. Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow • Interfacing with application code and new technologies © Hortonworks Inc. 2013 • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable engine to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade Page 7
    8. 8. Tez – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Tez is a completely client side application. • Simply upload to any accessible FileSystem and change local Tez configuration to point to that. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. TezClient TezTask TezTask © Hortonworks Inc. 2013 Page 8 Client Machine Node Manager Node Manager HDFS Tez Lib 1 Tez Lib 2 TezClient Client Machine
    9. 9. Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers © Hortonworks Inc. 2013 Page 9 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
    10. 10. Tez – Expressing the computation Tez provides the following APIs to define the processing • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how the connection of tasks in the job gets specified © Hortonworks Inc. 2013 • Runtime API • Defines the interfaces using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes Page 10
    11. 11. © Hortonworks Inc. 2013 Tez – DAG API // Define DAG DAG dag = DAG.create(); // Define Vertex Vertex Map1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Map1).addEdge(edge)… Page 11 Defines the global processing flow Map1 Map2 Scatter Gather Reduce1 Reduce2 Scatter Gather Join
    12. 12. © Hortonworks Inc. 2013 Tez – DAG API Data movement – Defines routing of data between tasks • One-To-One : Data from the ith producer task routes to the ith consumer task. • Broadcast : Data from a producer task routes to all consumer tasks. • Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. Page 12 Edge properties define the connection between producer and consumer tasks in the DAG
    13. 13. Tez – Logical DAG expansion at Runtime © Hortonworks Inc. 2013 Page 13 Reduce1 Map2 Reduce2 Join Map1
    14. 14. Tez – Runtime API Flexible Inputs-Processor-Outputs Model • Thin API layer to wrap around arbitrary application code • Compose inputs, processor and outputs to execute arbitrary processing • Applications decide logical data format and data transfer technology • Customize for performance • Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService. Built on the same API. Your impls are as first class as ours! © Hortonworks Inc. 2013 Page 14
    15. 15. Tez – Library of Inputs and Outputs Sorted Output © Hortonworks Inc. 2013 Classical ‘Map’ Page 15 Classical ‘Reduce’ Map Processor HDFS Input Intermediate ‘Reduce’ for Map-Reduce-Reduce Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output • What is built in? – Hadoop InputFormat/OutputFormat – SortedGroupedPartitioned Key-Value Input/Output – UnsortedGroupedPartitioned Key-Value Input/Output – Key-Value Input/Output
    16. 16. Tez – Composable Task Model Adopt Evolve Optimize HDFS Input Remote File Server Input © Hortonworks Inc. 2013 Native DB Input Page 16 HDFS Input Remote File Server Input Hive Processor HDFS Output Local Disk Output Your Processor HDFS Output Local Disk Output RDMA Input Your Processor Kakfa Pub-Sub Output Amazon S3 Output
    17. 17. Tez – Performance • Benefits of expressing the data processing as a DAG • Reducing overheads and queuing effects • Gives system the global picture for better planning • Efficient use of resources • Re-use resources to maximize utilization • Pre-launch, pre-warm and cache • Locality & resource aware scheduling • Support for application defined DAG modifications at runtime for optimized execution • Change task concurrency • Change task scheduling • Change DAG edges • Change DAG vertices (TBD) © Hortonworks Inc. 2013 Page 17
    18. 18. Tez – Benefits of DAG execution Faster Execution and Higher Predictability • Eliminate replicated write barrier between successive computations. • Eliminate job launch overhead of workflow jobs. • Eliminate extra stage of map reads in every workflow job. • Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. • Better locality because the engine has the global picture © Hortonworks Inc. 2013 Page 18 Pig/Hive - MR Pig/Hive - Tez
    19. 19. Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution © Hortonworks Inc. 2013 Page 19 TezTask Host TezTask1 TezTask2 Shared Objects YARN Container / JVM Tez Application Master YARN Container Start Task Task Done Start Task
    20. 20. © Hortonworks Inc. 2013 Tez – Sessions Page 20 Client Application Master Start Session Submit DAG Task Scheduler Container Pool Shared Object Registry Pre Warmed JVM Sessions • Standard concepts of pre-launch and pre-warm applied • Key for interactive queries • Represents a connection between the user and the cluster • Multiple DAGs executed in the same session • Containers re-used across queries • Takes care of data locality and releasing resources when idle
    21. 21. Tez – Current status © Hortonworks Inc. 2013 • Apache Project –Rapid development. Over 1100 jiras opened. Over 800 resolved –Growing community of contributors and users • 0.5 being released. In voting for release –Developer release focused ease of development – Stable API –Better debugging –Documentation and code samples • Support for a vast topology of DAGs • Being used by multiple applications such as Apache Hive, Apache Pig, Cascading, Scalding Page 21
    22. 22. Tez – Adoption Path Pre-requisite : Hadoop 2 with YARN Tez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out. • Using Hive, Pig, Cascading, Scalding • Try them with Tez as execution engine • Already have MapReduce based pipeline • Use configuration to change MapReduce to run on Tez by setting ‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml • Consolidate MR workflow into MR-DAG on Tez • Change MR-DAG to use more efficient Tez constructs © Hortonworks Inc. 2013 • Have custom pipeline • Wrap custom code/scripts into Tez inputs-processor-outputs • Translate your custom pipeline topology into Tez DAG • Change custom topology to use more efficient Tez constructs Page 22
    23. 23. Tez – Theory to Practice © Hortonworks Inc. 2013 • Performance • Scalability Page 23
    24. 24. Tez – Hive TPC-DS Scale 200GB latency © Hortonworks Inc. 2013
    25. 25. Tez – Pig performance gains • Demonstrates performance gains from a basic translation to a Tez DAG 160 140 120 100 80 60 40 20 0 Prod script 1 25m vs 10m 5 MR Jobs Prod script 2 34m vs 16m 5 MR Jobs Prod script 3 1h 46m vs 48m 12 MR Jobs Prod script 4 2h 22m vs 1h 21m 15 MR jobs Time in mins MR Tez
    26. 26. Tez – Observations on Performance © Hortonworks Inc. 2013 • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage. • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Offload work to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation. • Vertex caching • The more re-computation can be avoided the better is the performance. Page 26
    27. 27. Tez – Data at scale © Hortonworks Inc. 2013 Hive TPC-DS Scale 10TB Page 27
    28. 28. Tez – DAG definition at scale © Hortonworks Inc. 2013 Page 28 Hive : TPC-DS Query 64 Logical DAG
    29. 29. Tez – Container Reuse at Scale • 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4) © Hortonworks Inc. 2013 Page 29
    30. 30. Tez – Bridging the Data Spectrum © Hortonworks Inc. 2013 Page 30 Fact Table Dimension Table 1 Result Table 1 Dimension Table 2 Result Table 2 Dimension Table 3 Result Table 3 Broadcast Join Shuffle Join Typical pattern in a TPC-DS query Fact Table Dimension Table 1 Dimension Table 1 Dimension Table 1 Broadcast join for small data sets Based on data size, the query optimizer can run either plan as a single Tez job Broadcast Join
    31. 31. Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Tez meetup for developers and users – http://www.meetup.com/Apache-Tez-User-Group © Hortonworks Inc. 2013 • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: dev@tez.apache.org User list: user@tez.apache.org Issues list: issues@tez.apache.org Page 31
    32. 32. Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Customizable execution architecture designed to enable dynamic performance optimizations at runtime •Works out of the box with the platform figuring out the hard stuff • Span the spectrum of interactive latency to batch, small to large • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive and Pig © Hortonworks Inc. 2013 Page 32
    33. 33. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 http://www.infoq.com/presentations/apache-tez Questions? @bikassaha Page 33
    34. 34. Next Steps
    35. 35. Next Steps 1. Review YARN Resources 2. Review webinar recording 3. Attend Office Hours 4. Attend the next YARN webinar
    36. 36. Resources Setup HDP 2.1 environment • Leverage Sandbox: Hortonworks.com/Sandbox Get Started with YARN • http://hortonworks.com/get-started/YARN Video Deep Dive on Tez • http://goo.gl/BL67o7 and http://www.infoq.com/presentations/apache-tez Tez meetup for developers and users • http://www.meetup.com/Apache-Tez-User-Group Technical blog series http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing Useful links Work tracking: https://issues.apache.org/jira/browse/TEZ Code: https://github.com/apache/tez Developer list: dev@tez.apache.org User list: user@tez.apache.org
    37. 37. Hortonworks Office Hours YARN Office Hours Dial in and chat with YARN experts We plan Office Hours for September 11th and October 9th @ 10am PT (2nd Thursdays) Invitations will go out to those that attended or reviewed YARN webinars These will also be posted to hortonworks.com/webinars Registration required.
    38. 38. YARN Ready Webinar Schedule Native Integration Slider Office Hours August 14 Tez August 21 Ambari Sept. 4 Office Hours Sept. 11 Scalding Sept. 18 Spark Oct. 2 Office Hours Oct. 9 Upcoming Webinars Office Hours Timeline Recorded Webinars: Introduction to YARN Ready You can also visit http://hortonworks.com/webinars/#librar y Sign up here: For the Series, Individual webinar or office hourshttp://info.hortonworks.com/YarnReady-BigData-Webcast-Series.html
    39. 39. Thank you!