Apache Tez: Accelerating Hadoop Query Processing

20,624 views

Published on

Tez is the next generation Hadoop Query Processing framework written on top of YARN. Computation topologies in higher level languages like Pig/Hive can be naturally expressed in the new graph dataflow model exposed by Tez. Multi-stage queries can be expressed as a single Tez job resulting in lower latency for short queries and improved throughput for large scale queries. MapReduce has been the workhorse for Hadoop but its monolithic structure had made innovation slower. YARN separates resource management from application logic and thus enables the creation of Tez, a more flexible and generic new framework for data processing for the benefit of the entire Hadoop query ecosystem.

Published in: Technology
1 Comment
43 Likes
Statistics
Notes
No Downloads
Views
Total views
20,624
On SlideShare
0
From Embeds
0
Number of Embeds
7,869
Actions
Shares
0
Downloads
0
Comments
1
Likes
43
Embeds 0
No embeds

No notes for slide

Apache Tez: Accelerating Hadoop Query Processing

  1. 1. Apache Tez : Accelerating Hadoop Query Processing Page 1 Arun C. Murthy Bikas Saha Founder & Architect Hortonworks @acmurthy @bikassaha (@hortonworks)
  2. 2. © Hortonworks Inc. 2013 Hello! • Founder/Architect at Hortonworks Inc. –Lead - Map-Reduce/YARN/Tez –Formerly, Architect Hadoop MapReduce, Yahoo –Responsible for running Hadoop MapReduce as a service for all of Yahoo (~50k nodes footprint) • Apache Hadoop, ASF –Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC) –Long-term Committer/PMC member (full time for 7 years) –Release Manager for hadoop-2.x Page 2
  3. 3. © Hortonworks Inc. 2013 Once upon a time … Page 3 … long, long ago, there was a kingdom we shall call Apache Hadoop http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo
  4. 4. © Hortonworks Inc. 2013 Hadoop begat … Page 4 … a two-headed monster on every node in the kingdom; each belonged to a different clan and answered to a different master http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg
  5. 5. © Hortonworks Inc. 2013 Knights of Bytes - HDFS Page 5 … stored data uncompromisingly in directories/files, nary a care about contents http://whoiscraigmoser.com/Images/identity/knight.png
  6. 6. © Hortonworks Inc. 2013 Prince of Processing - MapReduce Page 6 He ruled with an iron fist by mapping, and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg
  7. 7. © Hortonworks Inc. 2013 Peace Reigned Page 7 … for a while with the odd change in the direction of the wind http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg
  8. 8. © Hortonworks Inc. 2013 Slowly, but surely … Page 8 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  9. 9. © Hortonworks Inc. 2013 Slowly, but surely … Page 9 Human beings define reality through misery and suffering. - Agent Smith http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp
  10. 10. © Hortonworks Inc. 2013 Slowly, but surely … Page 10 … people of the kingdom clamored for more. A palpable sense of greed & expectation. http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg
  11. 11. © Hortonworks Inc. 2013 Signs of Distress Page 11 SQL said some, others said Machine Learning, still others said Real-Time Event Processing http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg
  12. 12. © Hortonworks Inc. 2013 A Meeting at the Summit Page 12 MapReduce is dead! Err… not quite. We need more options! We need more! True… http://4.bp.blogspot.com/- oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp
  13. 13. © Hortonworks Inc. 2013 A Meeting at the Summit Page 13 A common thread YARN running through all applications… Long live the King! http://whipup.net/wp-content/images/2008/08/yarn.gif
  14. 14. © Hortonworks Inc. 2013 The Edict Page 14 Henceforth, in the Kingdom of King YARN… MapReduce has been relegated to the status of, merely, one of the applications! http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg
  15. 15. © Hortonworks Inc. 2013 Reign of King YARN Page 15 King YARN came to throne with promises to return power to all applications equally, lower performance taxes and resource management… http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg
  16. 16. © Hortonworks Inc. 2013 Oh the Shame! Page 16 Well, at least, Prince MapReduce still had powerful allies like Highness Hive, Powerful Pig, Cheery Cascading… http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg
  17. 17. © Hortonworks Inc. 2013 Things get worse before better Page 17 Unfortunately, things got a lot worse for the Prince MapReduce… http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg
  18. 18. © Hortonworks Inc. 2013 Knight Tez Page 18 He did MapReduce, and so much more… Smartly aligned himself to Kingdom YARN. http://twomorrows.com/alterego/media/08shiningknight.gif
  19. 19. © Hortonworks Inc. 2013 Knight Tez Page 19 … they decided to throw their lot with Knight Tez! http://informatica.upg-ploiesti.ro/62689/img/partners.jpg Long term alliances of MapReduce with Hive, Pig, Cascading etc. broke up… http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png
  20. 20. © Hortonworks Inc. 2013 Happily ever after… Page 20 (nothing cute to say)
  21. 21. © Hortonworks Inc. 2013 On a more serious note… Page 21
  22. 22. © Hortonworks Inc. 2013 Every season has a flavor… Page 22 SQL-on-Hadoop is the new black! SQL-on-Hadoop will be solved within the existing ecosystem
  23. 23. © Hortonworks Inc. 2013 Looking ahead Page 23 What will it be next year? Real-time event processing? Machine Learning?
  24. 24. © Hortonworks Inc. 2013 Play to our strengths Page 24 Invest in the Apache Hadoop platform and the ecosystem (Hive et al).
  25. 25. © Hortonworks Inc. 2013 Seriously… Technical Details Page 25
  26. 26. © Hortonworks Inc. 2013 Tez – Introduction Page 26 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed.
  27. 27. © Hortonworks Inc. 2013 Tez – Design Themes Page 27 • Empowering End Users • Execution Performance
  28. 28. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment Page 28
  29. 29. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s –Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. –Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. Page 29 TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2 TaskD-1 TaskD-2 TaskE-1 TaskE-2
  30. 30. © Hortonworks Inc. 2013 Aggregate Stage Partition Stage Preprocessor Stage Tez – Empowering End Users • Expressive dataflow definition API’s Page 30 Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  31. 31. © Hortonworks Inc. 2013 Tez – Empowering End Users • Flexible Input-Processor-Output runtime model –Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. –End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful operators. Page 31 IntermediateReduce ShuffleInput ReduceProcessor FileSortedOutput FinalReduce ShuffleInput ReduceProcessor HDFSOutput PairwiseJoin Input1 JoinProcessor FileSortedOutput Input2
  32. 32. © Hortonworks Inc. 2013 Tez – Empowering End Users • Data type agnostic –Tez is only concerned with the movement of data. Files and streams of bytes. –Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Page 32 File Stream Key Value Tez Task Tuples User Code Bytes Bytes
  33. 33. © Hortonworks Inc. 2013 Tez – Empowering End Users • Simplifying deployment –Tez is a completely client side application. –No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. –Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. –Leverages YARN local resources and distributed cache. Page 33 Client Machine Node Manager TezTask Node Manager TezTaskTezClient HDFS Tez Lib 1 Tez Lib 2 Client Machine TezClient
  34. 34. © Hortonworks Inc. 2013 Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities  Page 34
  35. 35. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce • Plan reconfiguration at runtime • Optimal resource management • Dynamic physical data flow decisions Page 35
  36. 36. © Hortonworks Inc. 2013 Tez – Execution Performance • Performance gains over Map Reduce –Eliminate replicated write barrier between successive computations. –Eliminate job launch overhead of workflow jobs. –Eliminate extra stage of map reads in every workflow job. –Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Page 36 Pig/Hive - MR Pig/Hive - Tez
  37. 37. © Hortonworks Inc. 2013 Tez – Execution Performance • Plan reconfiguration at runtime –Dynamic runtime concurrent control based on data size, user operator resources, available cluster resources and locality. –Advanced changes in dataflow graph structure. –Progressive graph construction in concert with user optimizer. Page 37 HDFS Blocks YARN Resources Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Stage 2 100 10 reducers Only 10GB’s of data
  38. 38. © Hortonworks Inc. 2013 Tez – Execution Performance • Optimal resource management –Reuse YARN containers to launch new tasks. –Reuse YARN containers to enable shared objects across tasks. Page 38 YARN Container TezTask Host TezTask1 TezTask2 SharedObjects YARN Container Tez Application Master Start Task Task Done Start Task
  39. 39. © Hortonworks Inc. 2013 Tez – Execution Performance • Dynamic physical data flow decisions –Decide the type of physical byte movement and storage on the fly. –Store intermediate data on distributed store, local store or in- memory. –Transfer bytes via blocking files or streaming and the spectrum in between. Page 39 Producer (small size) In-Memory Consumer Producer Local File Consumer At Runtime
  40. 40. © Hortonworks Inc. 2013 Tez – Current status • Apache Incubator Project –Rapid development. Over 270 jiras opened. Over 170 resolved. –Growing community. • Focus on stability –Testing and quality are highest priority. –Code ready and deployed on multi-node clusters. • DAG of MR processing is working – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Working Hive prototype that can target Tez for execution of queries. –Work started on prototype of Pig that can target Tez. Page 40
  41. 41. © Hortonworks Inc. 2013 Tez – Current status Page 41 Fact Table Dimension Table 1 Result Table 1 Dimension Table 2 Result Table 2 Dimension Table 3 Result Table 3 Join Join Join Typical pattern in a TPC-DS query Fact Table Dimension Table 1 Dimension Table 1 Dimension Table 1 Optimization for small data sets Both can now run as a single Tez job
  42. 42. © Hortonworks Inc. 2013 Tez – Looking ahead • Early adopters and contributors welcome –Adopters to drive more scenarios. Contributors to make them happen. • Stay tuned for Tez meetups with deep dives on Tez architecture and using Tez • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ –Code: https://github.com/apache/incubator-tez –High level design document and API specification: https://issues.apache.org/jira/browse/TEZ-65 – Developer list: dev@tez.incubator.apache.org User list: user@tez.incubator.apache.org Issues list: issues@tez.incubator.apache.org Page 42
  43. 43. © Hortonworks Inc. 2013 Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Execution architecture designed to enable dynamic performance optimizations at runtime • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive Page 43
  44. 44. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Questions? Page 44

×