February 2014 HUG : Tez Details and Insides

  • 1,375 views
Uploaded on

February 2014 HUG : Tez Details and Insides

February 2014 HUG : Tez Details and Insides

More in: Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,375
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
30
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Tez : Accelerating Hadoop Query Processing Bikas Saha @bikassaha © Hortonworks Inc. 2013 Page 1
  • 2. Tez – Introduction • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed. © Hortonworks Inc. 2013 Page 2
  • 3. Tez – Design Themes • Empowering End Users • Execution Performance © Hortonworks Inc. 2013 Page 3
  • 4. Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment © Hortonworks Inc. 2013 Page 4
  • 5. Tez – Empowering End Users • Expressive dataflow definition API’s Task-1 Task-2 Preprocessor Stage Task-1 Task-2 Partition Stage Samples Sampler Ranges Distributed Sort Task-1 © Hortonworks Inc. 2013 Task-2 Aggregate Stage Page 5
  • 6. Tez – Empowering End Users • Flexible Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful tasks. HDFSInput ShuffleInput MapProcessor ReduceProcessor JoinProcessor FileSortedOutput HDFSOutput FileSortedOutput Mapper FinalReduce IntermediateJoiner © Hortonworks Inc. 2013 Input1 Input2 Page 6
  • 7. Tez – Empowering End Users • Data type agnostic – Tez is only concerned with the movement of data. Files and streams of bytes. – Clean separation between logical application layer and physical framework layer. Design important to be a platform for a variety of applications. Tez Task File User Code Key Value Bytes Bytes Tuples Stream © Hortonworks Inc. 2013 Page 7
  • 8. Tez – Empowering End Users • Simplifying deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. – Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask TezClient Client Machine Node Manager Node Manager Client Machine © Hortonworks Inc. 2013 Page 8
  • 9. Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities  Tez is a framework on which end user applications can be built © Hortonworks Inc. 2013 Page 9
  • 10. Tez – Execution Performance • Performance gains over Map Reduce • Optimal resource management • Plan reconfiguration at runtime • Dynamic physical data flow decisions © Hortonworks Inc. 2013 Page 10
  • 11. Tez – Execution Performance • Performance gains over Map Reduce – Eliminate replicated write barrier between successive computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Pig/Hive - Tez Pig/Hive - MR © Hortonworks Inc. 2013 Page 11
  • 12. Tez – Execution Performance • Plan reconfiguration at runtime – Dynamic runtime concurrency control based on data size, user operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer. HDFS Blocks Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Only 10GB’s of data Stage 2 100 10 reducers YARN Resources © Hortonworks Inc. 2013 Page 12
  • 13. Tez – Execution Performance • Optimal resource management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks. – TezSession to encapsulate all this for the user Start Task Tez Application Master Task Done Start Task YARN Container TezTask1 TezTask2 Shared Objects TezTask Host YARN Container © Hortonworks Inc. 2013 Page 13
  • 14. Tez – Execution Performance • Dynamic physical data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or in-memory. – Transfer bytes via blocking files or streaming and the spectrum in between. Producer (small size) Producer Local File At Runtime In-Memory Consumer Consumer © Hortonworks Inc. 2013 Page 14
  • 15. Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Map Vertex Vertex Manager Pluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Vertex State Machine App Master Reduce Vertex Cancel Task © Hortonworks Inc. 2013 Page 15
  • 16. Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Data Size Statistics Vertex Manager Map Vertex Vertex Manager Pluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Vertex State Machine App Master Reduce Vertex Cancel Task © Hortonworks Inc. 2013 Page 16
  • 17. Tez – Automatic Reduce Parallelism Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Data Size Statistics Vertex Manager Map Vertex Set Parallelism Re-Route Vertex State Machine App Master Reduce Vertex Cancel Task © Hortonworks Inc. 2013 Page 17
  • 18. Tez – Now and Next © Hortonworks Inc. 2013 Page 18
  • 19. Tez – Bridge the Data Spectrum Fact Table Dimension Table 1 Dimension Table 1 Fact Table Broadcast Join Result Table 1 Dimension Table 2 Broadcast join for small data sets Dimension Table 1 Dimension Table 1 Broadcast Join Result Table 2 Dimension Table 3 Shuffle Join Typical pattern in a TPC-DS query Result Table 3 © Hortonworks Inc. 2013 Based on data size, the query optimizer can run either plan as a single Tez job Page 19
  • 20. Tez – Current status • Apache Incubator Project – Rapid development. Over 800 jiras opened. Over 600 resolved. – Growing community of contributors and users • Focus on stability – Testing and quality are highest priority. – Code ready and deployed on multi-node environments. • Support for a vast topology of DAGs – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Hive retargeted to use Tez for execution of queries (HIVE-4660). – Pig to use Tez for execution of scripts (PIG-3446). © Hortonworks Inc. 2013 Page 20
  • 21. Tez – Roadmap • Richer DAG support – Support for co-scheduling – Efficient iterations • Performance optimizations – More efficiencies in transfer of data – Improve session performance • Usability. – Stability and testability – Recovery and history – Tools for performance analysis and debugging © Hortonworks Inc. 2013 Page 21
  • 22. Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. – Hive and Pig communities are on-board and making great progress - HIVE-4660 and PIG-3446 • Tez meetup for developers and users – http://www.meetup.com/Apache-Tez-User-Group • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-dataprocessing/ (will soon be available on the Apache Wiki) • Useful links – Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/incubator-tez – Developer list: dev@tez.incubator.apache.org User list: user@tez.incubator.apache.org Issues list: issues@tez.incubator.apache.org © Hortonworks Inc. 2013 Page 22
  • 23. Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Customizable execution architecture designed to enable dynamic performance optimizations at runtime • Works out of the box with the platform figuring out the hard stuff • Span the spectrum of interactive latency to batch • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive and Pig © Hortonworks Inc. 2013 Page 23
  • 24. Tez Thanks for your time and attention! Deep dive on Tez video at http://www.infoq.com/presentations/apache-tez Questions? @bikassaha © Hortonworks Inc. 2013 Page 24