SlideShare a Scribd company logo
1 of 28
Download to read offline
Apache Tez : Accelerating
Hadoop Query Processing
Jeff Markham
Technical Director, APAC
Hortonworks

Page 1
Tez – Introduction
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.

© Hortonworks Inc. 2013

Page 2
YARN: Taking Hadoop Beyond Batch
MapReduce as Base

Apache Tez as Base

HADOOP 1.0

HADOOP 2.0
Batch	
  

Pig	
  

(data	
  flow)	
  

	
  
Hive	
   Others	
  
(sql)	
  

(cascading)	
  
	
  

MapReduce	
  

MapReduce	
  

Data	
  Flow	
  
Pig	
  

SQL	
  
Hive	
  

	
  
Others	
  

(cascading)	
  

	
  

Tez	
  

Storm	
  

(execu:on	
  engine)	
  

YARN	
  

(cluster	
  resource	
  management	
  
	
  &	
  data	
  processing)	
  

(cluster	
  resource	
  management)	
  

HDFS	
  

HDFS2	
  

(redundant,	
  reliable	
  storage)	
  

© Hortonworks Inc. 2013.

Online	
  	
  

Real	
  Time	
  	
  
Data	
  	
  
Stream	
  	
   Processing	
  
Processing	
   HBase,	
  

(redundant,	
  reliable	
  storage)	
  

Accumulo	
  
	
  
Apache Tez (“Speed”)
•  Replaces MapReduce as primitive for Pig, Hive, Cascading etc.
– Smaller latency for interactive queries
– Higher throughput for batch queries
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft
Task with pluggable Input, Processor and Output

Input	
  

Processor	
  

Output	
  

Task	
  
Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks
© Hortonworks Inc. 2013.
Tez: Building blocks for scalable data processing
Classical ‘Map’

HDFS	
  
Input	
  

Map	
  
Processor	
  

Classical ‘Reduce’

Sorted	
  
Output	
  

Shuffle	
  
Input	
  

Shuffle	
  
Input	
  

Reduce	
  
Processor	
  

Sorted	
  
Output	
  

Intermediate ‘Reduce’ for
Map-Reduce-Reduce
© Hortonworks Inc. 2013.

Reduce	
  
Processor	
  

HDFS	
  
Output	
  
Hive-on-MR vs. Hive-on-Tez
Tez avoids
unneeded writes to
HDFS

SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;

Hive – MR
M

M

Hive – Tez

M

SELECT a.state

SELECT b.id
R

R

M

SELECT a.state,
c.itemId

M

M

M
R

M

SELECT b.id

R

M

HDFS

JOIN (a, c)
SELECT c.price

M

R

M
R

HDFS

R

JOIN (a, c)

R

HDFS

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

© Hortonworks Inc. 2013.

M

M

R

M

JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)

R
Tez Sessions
… because Map/Reduce query startup is expensive
• Tez Sessions
– Hot containers ready for immediate use
– Removes task and job launch overhead (~5s – 30s)

• Hive
– Session launch/shutdown in background (seamless, user not
aware)
– Submits query plan directly to Tez Session

Native Hadoop service, not ad-hoc
© Hortonworks Inc. 2013.
Tez Delivers Interactive Query - Out of the Box!
Feature	
  

DescripEon	
  

Benefit	
  

Tez	
  Session	
  

Overcomes	
  Map-­‐Reduce	
  job-­‐launch	
  latency	
  by	
  pre-­‐
launching	
  Tez	
  AppMaster	
  

Latency	
  

Tez	
  Container	
  Pre-­‐
Launch	
  

Overcomes	
  Map-­‐Reduce	
  latency	
  by	
  pre-­‐launching	
  
hot	
  containers	
  ready	
  to	
  serve	
  queries.	
  

Latency	
  

Finished	
  maps	
  and	
  reduces	
  pick	
  up	
  more	
  work	
  
Tez	
  Container	
  Re-­‐Use	
   rather	
  than	
  exi:ng.	
  Reduces	
  latency	
  and	
  eliminates	
  
difficult	
  split-­‐size	
  tuning.	
  Out	
  of	
  box	
  performance!	
  
Run:me	
  re-­‐
Run:me	
  query	
  tuning	
  by	
  picking	
  aggrega:on	
  
configura:on	
  of	
  DAG	
   parallelism	
  using	
  online	
  query	
  sta:s:cs	
  
Tez	
  In-­‐Memory	
  
Cache	
  

Hot	
  data	
  kept	
  in	
  RAM	
  for	
  fast	
  access.	
  

Complex	
  DAGs	
  

Tez	
  Broadcast	
  Edge	
  and	
  Map-­‐Reduce-­‐Reduce	
  
paXern	
  improve	
  query	
  scale	
  and	
  throughput.	
  

© Hortonworks Inc. 2013.

Latency	
  

Throughput	
  

Latency	
  
Throughput	
  
Page 8
Tez – Design Themes
• Empowering End Users
• Execution Performance

© Hortonworks Inc. 2013

Page 9
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment

© Hortonworks Inc. 2013

Page 10
Tez – Empowering End Users
• Expressive dataflow definition API’s
– Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
– Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
TaskA-1

TaskA-2

TaskD-1

TaskB-1

TaskB-2

TaskD-2

© Hortonworks Inc. 2013

TaskC-1

TaskE-1

TaskC-2

TaskE-2

Page 11
Tez – Empowering End Users
• Expressive dataflow definition API’s
Task-2

Task-1

Samples

Task-1

Partition Stage
Task-2

Preprocessor Stage

Sampler
Ranges

Distributed Sort

Task-1

© Hortonworks Inc. 2013

Task-2
Aggregate Stage

Page 12
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model
– Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
– End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful tasks.

HDFSInput

ShuffleInput

MapProcessor

ReduceProcessor

JoinProcessor

FileSortedOutput

HDFSOutput

FileSortedOutput

Mapper

Reducer

PairwiseJoin

© Hortonworks Inc. 2013

Input1

Input2

Page 13
Tez – Empowering End Users
• Data type agnostic
– Tez is only concerned with the movement of data. Files and
streams of bytes.
– Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.

Tez Task

File

User Code
Key Value

Bytes

Bytes
Tuples

Stream

© Hortonworks Inc. 2013

Page 14
Tez – Empowering End Users
• Simplifying deployment
– Tez is a completely client side application.
– No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
– Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
– Leverages YARN local resources.
HDFS
Tez Lib 1

Tez Lib 2

TezClient

TezTask

TezTask

TezClient

Client
Machine

Node
Manager

Node
Manager

Client
Machine

© Hortonworks Inc. 2013

Page 15
Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying usage
With great power API’s come great responsibilities J
Tez is a framework on which end user applications can
be built

© Hortonworks Inc. 2013

Page 16
Tez – Execution Performance
• Performance gains over Map Reduce
• Optimal resource management
• Plan reconfiguration at runtime
• Dynamic physical data flow decisions

© Hortonworks Inc. 2013

Page 17
Tez – Execution Performance
• Performance gains over Map Reduce
– Eliminate replicated write barrier between successive
computations.
– Eliminate job launch overhead of workflow jobs.
– Eliminate extra stage of map reads in every workflow job.
– Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.

Pig/Hive - Tez

Pig/Hive - MR

© Hortonworks Inc. 2013

Page 18
Tez – Execution Performance
• Plan reconfiguration at runtime
– Dynamic runtime concurrency control based on data size, user
operator resources, available cluster resources and locality.
– Advanced changes in dataflow graph structure.
– Progressive graph construction in concert with user optimizer.

HDFS
Blocks
Stage 1
50 maps
100
partitions

Stage 2
100
reducers

Stage 1
50 maps
100
partitions

Only 10GB’s
of data

Stage 2
100 10
reducers

YARN
Resources

© Hortonworks Inc. 2013

Page 19
Tez – Execution Performance
• Optimal resource management
– Reuse YARN containers to launch new tasks.
– Reuse YARN containers to enable shared objects across tasks.

Start Task

Tez
Application Master

Task Done

Start Task

YARN Container

TezTask1

TezTask2

Shared Objects

TezTask Host

YARN Container

© Hortonworks Inc. 2013

Page 20
Tez – Execution Performance
• Dynamic physical data flow decisions
– Decide the type of physical byte movement and storage on the fly.
– Store intermediate data on distributed store, local store or inmemory.
– Transfer bytes via blocking files or streaming and the spectrum in
between.
Producer
(small size)

Producer

Local File

At Runtime

In-Memory

Consumer

Consumer

© Hortonworks Inc. 2013

Page 21
Tez – Sessions
Start
Session

Submit
DAG

Client

Application Master
Task Scheduler

Container Pool

•  Key for interactive queries
•  Analogous to database
sessions and represents a
connection between the user
and the cluster
•  Run multiple DAGs / queries in
the same session
•  Maintains a pool of reusable
containers for low latency
execution of tasks within and
across queries
•  Takes care of data locality and
releasing resources when idle
•  Session cache in the
Application Master and in the
container pool reduce recomputation and re-initialization

© Hortonworks Inc. 2013

PreWarmed
JVM

Shared
Object
Registry

Page 33
Tez – Benchmark Performance

Significant (but not all) speed-ups due to Tez:
•  DAG support and runtime graph reconfiguration enable utilizing the
parallelism of the cluster
•  Tez Session and container re-use enable
efficient and low latency execution

© Hortonworks Inc. 2013

Page 35
Tez – Performance Analysis
Tez Session populates
container pool

AM

Dimension table
calculation and HDFS
split generation in
parallel
Dimension tables
broadcasted to Hive
MapJoin tasks

…

…

Final Reducer prelaunched and fetches
completed inputs

TPC-DS – Query 27 with Hive on Tez
© Hortonworks Inc. 2013

Page 36
Tez – Current status
• Apache Incubator Project
– Rapid development. Over 600 jiras opened. Over 400 resolved.
– Growing community of contributors and users.

• Focus on stability
– Testing and quality are highest priority.
– Code ready and deployed on multi-node environments.

• Support for a vast topology of DAGs
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Hive re-targeted to use Tez for execution of queries (HIVE-4660).
– Work started on Pig to use Tez for execution of scripts (PIG-3446).
© Hortonworks Inc. 2013

Page 37
Tez – Roadmap
• Richer DAG support
– Support for co-scheduling and streaming
– Better fault tolerance with checkpoints

• Performance optimizations
– More efficiencies in transfer of data
– Improve session performance

• Usability
– Stability and testability
– Recovery and history
– Tools for performance analysis and debugging

© Hortonworks Inc. 2013

Page 38
Tez – Key Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Customizable execution architecture designed to
enable dynamic performance optimizations at runtime
• Works out of the box with the platform figuring out
the hard stuff
• Span the spectrum of interactive latency to batch
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive and Pig
© Hortonworks Inc. 2013

Page 40
Thank You !

© Hortonworks Inc. 2013

Page 41

More Related Content

What's hot

Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetDataWorks Summit
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHortonworks
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2DataWorks Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitDataWorks Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hortonworks
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jDataWorks Summit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightDataWorks Summit/Hadoop Summit
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureLynn Langit
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksHortonworks
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldDataWorks Summit
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1Adam Muise
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to HadoopHortonworks
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course WorkshopDataWorks Summit
 

What's hot (20)

Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
 
Hadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - JaspersoftHadoop Reporting and Analysis - Jaspersoft
Hadoop Reporting and Analysis - Jaspersoft
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
 
Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2Luo june27 1150am_room230_a_v2
Luo june27 1150am_room230_a_v2
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)Hdp r-google charttools-webinar-3-5-2013 (2)
Hdp r-google charttools-webinar-3-5-2013 (2)
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
HDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows AzureHDInsight Hadoop on Windows Azure
HDInsight Hadoop on Windows Azure
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and HortonworksPowering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
2014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part12014 sept 26_thug_lambda_part1
2014 sept 26_thug_lambda_part1
 
Pig Out to Hadoop
Pig Out to HadoopPig Out to Hadoop
Pig Out to Hadoop
 
Internet of things Crash Course Workshop
Internet of things Crash Course WorkshopInternet of things Crash Course Workshop
Internet of things Crash Course Workshop
 

Similar to Apache Tez: Accelerating Hadoop Query Processing

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 

Similar to Apache Tez: Accelerating Hadoop Query Processing (20)

Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 

Recently uploaded

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Apache Tez: Accelerating Hadoop Query Processing

  • 1. Apache Tez : Accelerating Hadoop Query Processing Jeff Markham Technical Director, APAC Hortonworks Page 1
  • 2. Tez – Introduction • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache incubator project and Apache licensed. © Hortonworks Inc. 2013 Page 2
  • 3. YARN: Taking Hadoop Beyond Batch MapReduce as Base Apache Tez as Base HADOOP 1.0 HADOOP 2.0 Batch   Pig   (data  flow)     Hive   Others   (sql)   (cascading)     MapReduce   MapReduce   Data  Flow   Pig   SQL   Hive     Others   (cascading)     Tez   Storm   (execu:on  engine)   YARN   (cluster  resource  management    &  data  processing)   (cluster  resource  management)   HDFS   HDFS2   (redundant,  reliable  storage)   © Hortonworks Inc. 2013. Online     Real  Time     Data     Stream     Processing   Processing   HBase,   (redundant,  reliable  storage)   Accumulo    
  • 4. Apache Tez (“Speed”) •  Replaces MapReduce as primitive for Pig, Hive, Cascading etc. – Smaller latency for interactive queries – Higher throughput for batch queries – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft Task with pluggable Input, Processor and Output Input   Processor   Output   Task   Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013.
  • 5. Tez: Building blocks for scalable data processing Classical ‘Map’ HDFS   Input   Map   Processor   Classical ‘Reduce’ Sorted   Output   Shuffle   Input   Shuffle   Input   Reduce   Processor   Sorted   Output   Intermediate ‘Reduce’ for Map-Reduce-Reduce © Hortonworks Inc. 2013. Reduce   Processor   HDFS   Output  
  • 6. Hive-on-MR vs. Hive-on-Tez Tez avoids unneeded writes to HDFS SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR M M Hive – Tez M SELECT a.state SELECT b.id R R M SELECT a.state, c.itemId M M M R M SELECT b.id R M HDFS JOIN (a, c) SELECT c.price M R M R HDFS R JOIN (a, c) R HDFS JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) © Hortonworks Inc. 2013. M M R M JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) R
  • 7. Tez Sessions … because Map/Reduce query startup is expensive • Tez Sessions – Hot containers ready for immediate use – Removes task and job launch overhead (~5s – 30s) • Hive – Session launch/shutdown in background (seamless, user not aware) – Submits query plan directly to Tez Session Native Hadoop service, not ad-hoc © Hortonworks Inc. 2013.
  • 8. Tez Delivers Interactive Query - Out of the Box! Feature   DescripEon   Benefit   Tez  Session   Overcomes  Map-­‐Reduce  job-­‐launch  latency  by  pre-­‐ launching  Tez  AppMaster   Latency   Tez  Container  Pre-­‐ Launch   Overcomes  Map-­‐Reduce  latency  by  pre-­‐launching   hot  containers  ready  to  serve  queries.   Latency   Finished  maps  and  reduces  pick  up  more  work   Tez  Container  Re-­‐Use   rather  than  exi:ng.  Reduces  latency  and  eliminates   difficult  split-­‐size  tuning.  Out  of  box  performance!   Run:me  re-­‐ Run:me  query  tuning  by  picking  aggrega:on   configura:on  of  DAG   parallelism  using  online  query  sta:s:cs   Tez  In-­‐Memory   Cache   Hot  data  kept  in  RAM  for  fast  access.   Complex  DAGs   Tez  Broadcast  Edge  and  Map-­‐Reduce-­‐Reduce   paXern  improve  query  scale  and  throughput.   © Hortonworks Inc. 2013. Latency   Throughput   Latency   Throughput   Page 8
  • 9. Tez – Design Themes • Empowering End Users • Execution Performance © Hortonworks Inc. 2013 Page 9
  • 10. Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying deployment © Hortonworks Inc. 2013 Page 10
  • 11. Tez – Empowering End Users • Expressive dataflow definition API’s – Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical plan at runtime. – Targeted towards data processing applications like Hive/Pig but not limited to it. Hive/Pig query plans naturally map to Tez dataflow graphs with no translation impedance. TaskA-1 TaskA-2 TaskD-1 TaskB-1 TaskB-2 TaskD-2 © Hortonworks Inc. 2013 TaskC-1 TaskE-1 TaskC-2 TaskE-2 Page 11
  • 12. Tez – Empowering End Users • Expressive dataflow definition API’s Task-2 Task-1 Samples Task-1 Partition Stage Task-2 Preprocessor Stage Sampler Ranges Distributed Sort Task-1 © Hortonworks Inc. 2013 Task-2 Aggregate Stage Page 12
  • 13. Tez – Empowering End Users • Flexible Input-Processor-Output runtime model – Construct physical runtime executors dynamically by connecting different inputs, processors and outputs. – End goal is to have a library of inputs, outputs and processors that can be programmatically composed to generate useful tasks. HDFSInput ShuffleInput MapProcessor ReduceProcessor JoinProcessor FileSortedOutput HDFSOutput FileSortedOutput Mapper Reducer PairwiseJoin © Hortonworks Inc. 2013 Input1 Input2 Page 13
  • 14. Tez – Empowering End Users • Data type agnostic – Tez is only concerned with the movement of data. Files and streams of bytes. – Does not impose any data format on the user application. MR application can use Key-Value pairs on top of Tez. Hive and Pig can use tuple oriented formats that are natural and native to them. Tez Task File User Code Key Value Bytes Bytes Tuples Stream © Hortonworks Inc. 2013 Page 14
  • 15. Tez – Empowering End Users • Simplifying deployment – Tez is a completely client side application. – No deployments to do. Simply upload to any accessible FileSystem and change local Tez configuration to point to that. – Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. – Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask TezClient Client Machine Node Manager Node Manager Client Machine © Hortonworks Inc. 2013 Page 15
  • 16. Tez – Empowering End Users • Expressive dataflow definition API’s • Flexible Input-Processor-Output runtime model • Data type agnostic • Simplifying usage With great power API’s come great responsibilities J Tez is a framework on which end user applications can be built © Hortonworks Inc. 2013 Page 16
  • 17. Tez – Execution Performance • Performance gains over Map Reduce • Optimal resource management • Plan reconfiguration at runtime • Dynamic physical data flow decisions © Hortonworks Inc. 2013 Page 17
  • 18. Tez – Execution Performance • Performance gains over Map Reduce – Eliminate replicated write barrier between successive computations. – Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. Pig/Hive - Tez Pig/Hive - MR © Hortonworks Inc. 2013 Page 18
  • 19. Tez – Execution Performance • Plan reconfiguration at runtime – Dynamic runtime concurrency control based on data size, user operator resources, available cluster resources and locality. – Advanced changes in dataflow graph structure. – Progressive graph construction in concert with user optimizer. HDFS Blocks Stage 1 50 maps 100 partitions Stage 2 100 reducers Stage 1 50 maps 100 partitions Only 10GB’s of data Stage 2 100 10 reducers YARN Resources © Hortonworks Inc. 2013 Page 19
  • 20. Tez – Execution Performance • Optimal resource management – Reuse YARN containers to launch new tasks. – Reuse YARN containers to enable shared objects across tasks. Start Task Tez Application Master Task Done Start Task YARN Container TezTask1 TezTask2 Shared Objects TezTask Host YARN Container © Hortonworks Inc. 2013 Page 20
  • 21. Tez – Execution Performance • Dynamic physical data flow decisions – Decide the type of physical byte movement and storage on the fly. – Store intermediate data on distributed store, local store or inmemory. – Transfer bytes via blocking files or streaming and the spectrum in between. Producer (small size) Producer Local File At Runtime In-Memory Consumer Consumer © Hortonworks Inc. 2013 Page 21
  • 22. Tez – Sessions Start Session Submit DAG Client Application Master Task Scheduler Container Pool •  Key for interactive queries •  Analogous to database sessions and represents a connection between the user and the cluster •  Run multiple DAGs / queries in the same session •  Maintains a pool of reusable containers for low latency execution of tasks within and across queries •  Takes care of data locality and releasing resources when idle •  Session cache in the Application Master and in the container pool reduce recomputation and re-initialization © Hortonworks Inc. 2013 PreWarmed JVM Shared Object Registry Page 33
  • 23. Tez – Benchmark Performance Significant (but not all) speed-ups due to Tez: •  DAG support and runtime graph reconfiguration enable utilizing the parallelism of the cluster •  Tez Session and container re-use enable efficient and low latency execution © Hortonworks Inc. 2013 Page 35
  • 24. Tez – Performance Analysis Tez Session populates container pool AM Dimension table calculation and HDFS split generation in parallel Dimension tables broadcasted to Hive MapJoin tasks … … Final Reducer prelaunched and fetches completed inputs TPC-DS – Query 27 with Hive on Tez © Hortonworks Inc. 2013 Page 36
  • 25. Tez – Current status • Apache Incubator Project – Rapid development. Over 600 jiras opened. Over 400 resolved. – Growing community of contributors and users. • Focus on stability – Testing and quality are highest priority. – Code ready and deployed on multi-node environments. • Support for a vast topology of DAGs – Already functionally equivalent to Map Reduce. Existing Map Reduce jobs can be executed on Tez with few or no changes. – Hive re-targeted to use Tez for execution of queries (HIVE-4660). – Work started on Pig to use Tez for execution of scripts (PIG-3446). © Hortonworks Inc. 2013 Page 37
  • 26. Tez – Roadmap • Richer DAG support – Support for co-scheduling and streaming – Better fault tolerance with checkpoints • Performance optimizations – More efficiencies in transfer of data – Improve session performance • Usability – Stability and testability – Recovery and history – Tools for performance analysis and debugging © Hortonworks Inc. 2013 Page 38
  • 27. Tez – Key Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Naturally maps to execution plans produced by query optimizers • Customizable execution architecture designed to enable dynamic performance optimizations at runtime • Works out of the box with the platform figuring out the hard stuff • Span the spectrum of interactive latency to batch • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive and Pig © Hortonworks Inc. 2013 Page 40
  • 28. Thank You ! © Hortonworks Inc. 2013 Page 41