SlideShare a Scribd company logo
1 of 29
Data Processing 
over YARN 
Page 1
Tez – Introduction 
• Distributed execution framework targeted towards data-processing 
© Hortonworks Inc. 2014 
applications. 
• Based on expressing a computation as a dataflow graph. 
• Highly customizable to meet a broad spectrum of use cases. 
• Built on top of YARN – the resource management framework 
for Hadoop. 
• Open source Apache project and Apache licensed. 
Page 2
Hadoop 1 -> Hadoop 2 
© Hortonworks Inc. 2014 
HADOOP 1.0 
Pig 
(data flow) 
Hive 
(sql) 
Others 
(cascading) 
MapReduce 
(cluster resource management 
& data processing) 
HDFS 
(redundant, reliable storage) 
HADOOP 2.0 
Data Flow 
(execution engine) 
YARN 
Tez 
(cluster resource management) 
HDFS2 
(redundant, reliable storage) 
Pig 
SQL 
Hive 
Others 
(Cascading) 
Batch 
MapReduce Real Time 
Stream 
Processing 
Storm 
Online 
Data 
Processing 
HBase, 
Accumulo 
Monolithic 
• Resource Management 
• Execution Engine 
• User API 
Layered 
• Resource Management – YARN 
• Execution Engine – Tez 
• User API – Hive, Pig, Cascading, Your App!
Tez – Problems that it addresses 
• Expressing the computation 
• Direct and elegant representation of the data processing flow 
© Hortonworks Inc. 2014 
• Performance 
• Late Binding : Make decisions as late as possible using real data from at 
runtime 
• Leverage the resources of the cluster efficiently 
• Just work out of the box! 
• Customizable engine to let applications tailor the job to meet their 
specific requirements 
• Operation simplicity 
• Painless to operate, experiment and upgrade 
Page 4
Tez – Simplifying Operations 
• Tez is a pure YARN application. Easy and safe to try it out! 
• No deployments to do, no servers to run 
• Enables running different versions concurrently. Easy to test new 
functionality while keeping stable versions for production. 
• Leverages YARN local resources. 
HDFS 
Tez Lib 1 Tez Lib 2 
TezClient TezTask 
TezTask 
© Hortonworks Inc. 2014 
Page 5 
Client 
Machine 
Node 
Manager 
Node 
Manager 
TezClient 
Client 
Machine
Tez – Expressing the computation 
Distributed data processing jobs typically look like DAGs (Directed Acyclic 
Graph). 
• Vertices in the graph represent data transformations 
• Edges represent data movement from producers to consumers 
© Hortonworks Inc. 2014 
Page 6 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Sample 
s 
Ranges 
Distributed Sort
MR is a 2-vertex sub-set of Tez 
© Hortonworks Inc. 2014 
Page 7
But Tez is so much more 
© Hortonworks Inc. 2014 
Page 8
Tez – Expressing the computation 
Tez defines the following APIs to define the work 
• DAG API 
• Defines the structure of the data processing and the relationship 
between producers and consumers 
• Enable definition of complex data flow pipelines using simple graph 
connection API’s. Tez expands the logical DAG at runtime 
• This is how all the tasks in the job get specified 
© Hortonworks Inc. 2014 
• Runtime API 
• Defines the interface using which the framework and app code interact 
with each other 
• App code transforms data and moves it between tasks 
• This is how we specify what actually executes in each task on the cluster 
nodes 
Page 9
© Hortonworks Inc. 2014 
reduce1 
map2 
reduce2 
join1 
map1 
Scatter_Gather 
Bipartite 
Sequential 
Scatter_Gather 
Bipartite 
Sequential 
Tez – DAG API 
// Define DAG 
DAG dag = DAG.create(“sessionize”); 
// Define Vertex-M1 
Vertex m1 = Vertex.create("M_1", 
ProcessorDescriptor.create(MapProcessor_1.class.getName())); 
//Define Vertex-R1 
Vertex r1 = Vertex.create(”R_1", 
ProcessorDescriptor.create(ReduceProcessor_1.class.getName())); 
… 
… 
// Define Edge (edge between m1 and r1) 
Edge e1 =Edge.create(m1, r1, 
OrderedPartitionedKVEdgeConfig.newBuilder(…).build() 
.createDefaultEdgeProperty()); 
// Connect them 
dag.addVertex(m1).addVertex(r1).addEdge(e1)… 
Page 10 
Defines the global processing flow
Tez - Different Edge Properties 
Scatter-Gather Broadcast One-to-One 
11
Tez – Logical DAG expansion at Runtime 
© Hortonworks Inc. 2014 
Page 12 
reduce1 
map2 
reduce2 
join1 
map1
Tez – Runtime API building blocks 
13
Tez – Library of Inputs and Outputs 
Classical ‘Map’ 
Sorted 
Output 
© Hortonworks Inc. 2014 
Page 14 
Classical ‘Reduce’ 
Map 
Processor 
HDFS 
Input 
Intermediate ‘Reduce’ for 
Map-Reduce-Reduce 
Reduce 
Processor 
Shuffle 
Input 
HDFS 
Output 
Reduce 
Processor 
Shuffle 
Input 
Sorted 
Output 
• What is built in? 
– Hadoop InputFormat/OutputFormat 
– OrderedPartitioned Key-Value 
Input/Output 
– UnorderedPartitioned Key-Value 
Input/Output 
– Key-Value Input/Output
Tez – Container Re-Use 
• Reuse YARN containers/JVMs to launch new tasks 
• Reduce scheduling and launching delays 
• Shared in-memory data across tasks 
• JVM JIT friendly execution 
© Hortonworks Inc. 2014 
Page 15 
TezTask Host 
TezTask1 
TezTask2 
Shared Objects 
YARN Container / JVM 
Tez 
Application Master 
YARN Container 
Start Task 
Task Done 
Start Task
© Hortonworks Inc. 2014 
Container reuse 
• Tez specific feature 
• Run an entire DAG using the same containers 
• Different vertices use same container 
• Saves time talking to YARN for new containers
© Hortonworks Inc. 2014 
Tez – Sessions 
Page 17 
Client 
Start 
Session 
Application Master 
Submit 
DAG 
Task Scheduler 
Container Pool 
Shared 
Object 
Registry 
Pre 
Warmed 
JVM 
Sessions 
• Standard concepts of pre-launch 
and pre-warm applied 
• Key for Interactive queries 
• Represents a connection between 
the user and the cluster 
• Multiple DAGs/Queries executed in 
the same AM 
• Containers re-used across queries 
• Takes care of data locality and 
releasing resources when idle
Tez – Re-Use in Action (In Session) 
Task Execution Timeline 
© Hortonworks Inc. 2014
Tez – Auto Reduce Parallelism 
© Hortonworks 
Inc. 2011 
19
Tez – Auto Reduce Parallelism 
20
Tez – End User Benefits 
© Hortonworks Inc. 2014 
• Better Performance 
• Framework performance + application performance 
• Better utilization of cluster resources 
• Efficient use of allocated resources 
• Better predictability of results 
• Minimized queuing delays 
• Reduced load on HDFS 
• Removes unnecessary HDFS writes 
• Reduced network usage 
• Efficient data transfer using new data patterns 
• Increased developer productivity 
• Lets the user concentrate on application logic instead of Hadoop 
internals 
Page 21
Tez – Real World Examples 
© Hortonworks 
Inc. 2011 
22
Tez – Broadcast Edge 
SELECT ss.ss_item_sk, avg_price, inv.inv_quantity_on_hand 
FROM (select avg(ss_sales_price) as avg_price, ss_item_sk from store_sales 
group by ss_item_sk) ss 
JOIN inventory inv 
ON (inv.inv_item_sk = ss.ss_item_sk); 
M M M 
HDFS 
M M M 
© Hortonworks Inc. 2014 
HDFS 
Store Sales 
scan. Group by 
and aggregation. 
Inventory and Store 
Sales (aggr.) output 
scan and shuffle 
join. 
R R 
R R 
M M M 
M M M 
HDFS 
Store Sales 
scan. Group by 
and aggregation. 
Inventory and Store 
Sales (aggr.) output 
scan and shuffle 
join. 
R R 
broadcast
Tez – Multiple Outputs 
FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) 
INSERT INTO TABLE t1 SELECT distinct ss_item_sk 
INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; 
Hive – MR Hive – Tez 
© Hortonworks Inc. 2014 
M 
M M M 
HDFS 
Map join 
date_dim/store 
sales 
Two MR jobs to 
do the distinct 
M M 
M M M 
R R 
HDFS 
HDFS 
M M M 
R 
M M M 
R 
HDFS 
Broadcast Join 
(scan date_dim, 
join store sales) 
Distinct for 
customer + items 
Materialize join on 
HDFS 
Hive : Multi-insert 
queries
Tez – Data at scale 
© Hortonworks Inc. 2014 
Hive TPC-DS 
Scale 10TB 
Page 25
Tez – what if you can’t get enough containers? 
• 78 vertex + 8374 tasks on 50 YARN containers 
© Hortonworks Inc. 2014 
Page 26
Tez – Designed for big, busy clusters 
© Hortonworks Inc. 2014 
• Number of stages in the DAG 
• Higher the number of stages in the DAG, performance of Tez (over MR) will be 
better. 
• Cluster/queue capacity 
• More congested a queue is, the performance of Tez (over MR) will be better due 
to container reuse. 
• Size of intermediate output 
• More the size of intermediate output, the performance of Tez (over MR) will be 
better due to reduced HDFS usage (cross-rack traffic) 
• Size of data in the job 
• For smaller data and more stages, the performance of Tez (over MR) will be 
better as percentage of launch overhead in the total time is high for smaller 
jobs. 
• Move workloads from gateway boxes to the cluster 
• Move as much work as possible to the cluster by modelling it via the job DAG. 
Exploit the parallelism and resources of the cluster. 
Page 27
© Hortonworks Inc. 2014 
Tez – Adoption 
• Hive 
• Hadoop standard for declarative access via SQL-like interface 
• Pig 
• Hadoop standard for procedural scripting and pipeline processing 
• Cascading 
• Developer friendly Java API and SDK 
• Scalding (Scala API on Cascading) 
• Commercial Vendors 
• ETL : Use Tez instead of MR or custom pipelines 
• Analytics Vendors : Use Tez as a target platform for scaling parallel 
analytical tools to large data-sets 
Page 28
© Hortonworks Inc. 2014 
Tez – Community 
• Early adopters and code contributors welcome 
– Adopters to drive more scenarios. Contributors to make them happen. 
• Technical blog series 
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing 
• Useful links 
–Work tracking: https://issues.apache.org/jira/browse/TEZ 
– Code: https://github.com/apache/tez 
– Developer list: dev@tez.apache.org 
User list: user@tez.apache.org 
Issues list: issues@tez.apache.org 
Page 29

More Related Content

What's hot

Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache TezGetInData
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingDataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hiverxu
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep DiveHortonworks
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache TezGal Vinograd
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudDataWorks Summit/Hadoop Summit
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranMapR Technologies
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 

What's hot (19)

Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 

Similar to Data Processing over YARN with Tez Framework

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingHortonworks
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0Adam Muise
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitData Con LA
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 

Similar to Data Processing over YARN with Tez Framework (20)

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

More from InMobi Technology

PostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesPostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesInMobi Technology
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQLInMobi Technology
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in CloudInMobi Technology
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning PipelinesInMobi Technology
 
Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingInMobi Technology
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQLInMobi Technology
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingInMobi Technology
 
Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat ModelingInMobi Technology
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataInMobi Technology
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014InMobi Technology
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreInMobi Technology
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataInMobi Technology
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformInMobi Technology
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search EngineInMobi Technology
 

More from InMobi Technology (20)

Optimizer Hints
Optimizer HintsOptimizer Hints
Optimizer Hints
 
Case Studies on PostgreSQL
Case Studies on PostgreSQLCase Studies on PostgreSQL
Case Studies on PostgreSQL
 
PostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesPostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major Features
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic Trading
 
Backbone & Graphs
Backbone & GraphsBackbone & Graphs
Backbone & Graphs
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site Scripting
 
Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat Modeling
 
HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big Data
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Attacking Web Proxies
Attacking Web ProxiesAttacking Web Proxies
Attacking Web Proxies
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet Bangalore
 
Matriux blue
Matriux blueMatriux blue
Matriux blue
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder data
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search Engine
 

Recently uploaded

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Recently uploaded (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Data Processing over YARN with Tez Framework

  • 1. Data Processing over YARN Page 1
  • 2. Tez – Introduction • Distributed execution framework targeted towards data-processing © Hortonworks Inc. 2014 applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed. Page 2
  • 3. Hadoop 1 -> Hadoop 2 © Hortonworks Inc. 2014 HADOOP 1.0 Pig (data flow) Hive (sql) Others (cascading) MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage) HADOOP 2.0 Data Flow (execution engine) YARN Tez (cluster resource management) HDFS2 (redundant, reliable storage) Pig SQL Hive Others (Cascading) Batch MapReduce Real Time Stream Processing Storm Online Data Processing HBase, Accumulo Monolithic • Resource Management • Execution Engine • User API Layered • Resource Management – YARN • Execution Engine – Tez • User API – Hive, Pig, Cascading, Your App!
  • 4. Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow © Hortonworks Inc. 2014 • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable engine to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade Page 4
  • 5. Tez – Simplifying Operations • Tez is a pure YARN application. Easy and safe to try it out! • No deployments to do, no servers to run • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask © Hortonworks Inc. 2014 Page 5 Client Machine Node Manager Node Manager TezClient Client Machine
  • 6. Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers © Hortonworks Inc. 2014 Page 6 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Sample s Ranges Distributed Sort
  • 7. MR is a 2-vertex sub-set of Tez © Hortonworks Inc. 2014 Page 7
  • 8. But Tez is so much more © Hortonworks Inc. 2014 Page 8
  • 9. Tez – Expressing the computation Tez defines the following APIs to define the work • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how all the tasks in the job get specified © Hortonworks Inc. 2014 • Runtime API • Defines the interface using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes Page 9
  • 10. © Hortonworks Inc. 2014 reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Tez – DAG API // Define DAG DAG dag = DAG.create(“sessionize”); // Define Vertex-M1 Vertex m1 = Vertex.create("M_1", ProcessorDescriptor.create(MapProcessor_1.class.getName())); //Define Vertex-R1 Vertex r1 = Vertex.create(”R_1", ProcessorDescriptor.create(ReduceProcessor_1.class.getName())); … … // Define Edge (edge between m1 and r1) Edge e1 =Edge.create(m1, r1, OrderedPartitionedKVEdgeConfig.newBuilder(…).build() .createDefaultEdgeProperty()); // Connect them dag.addVertex(m1).addVertex(r1).addEdge(e1)… Page 10 Defines the global processing flow
  • 11. Tez - Different Edge Properties Scatter-Gather Broadcast One-to-One 11
  • 12. Tez – Logical DAG expansion at Runtime © Hortonworks Inc. 2014 Page 12 reduce1 map2 reduce2 join1 map1
  • 13. Tez – Runtime API building blocks 13
  • 14. Tez – Library of Inputs and Outputs Classical ‘Map’ Sorted Output © Hortonworks Inc. 2014 Page 14 Classical ‘Reduce’ Map Processor HDFS Input Intermediate ‘Reduce’ for Map-Reduce-Reduce Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output • What is built in? – Hadoop InputFormat/OutputFormat – OrderedPartitioned Key-Value Input/Output – UnorderedPartitioned Key-Value Input/Output – Key-Value Input/Output
  • 15. Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution © Hortonworks Inc. 2014 Page 15 TezTask Host TezTask1 TezTask2 Shared Objects YARN Container / JVM Tez Application Master YARN Container Start Task Task Done Start Task
  • 16. © Hortonworks Inc. 2014 Container reuse • Tez specific feature • Run an entire DAG using the same containers • Different vertices use same container • Saves time talking to YARN for new containers
  • 17. © Hortonworks Inc. 2014 Tez – Sessions Page 17 Client Start Session Application Master Submit DAG Task Scheduler Container Pool Shared Object Registry Pre Warmed JVM Sessions • Standard concepts of pre-launch and pre-warm applied • Key for Interactive queries • Represents a connection between the user and the cluster • Multiple DAGs/Queries executed in the same AM • Containers re-used across queries • Takes care of data locality and releasing resources when idle
  • 18. Tez – Re-Use in Action (In Session) Task Execution Timeline © Hortonworks Inc. 2014
  • 19. Tez – Auto Reduce Parallelism © Hortonworks Inc. 2011 19
  • 20. Tez – Auto Reduce Parallelism 20
  • 21. Tez – End User Benefits © Hortonworks Inc. 2014 • Better Performance • Framework performance + application performance • Better utilization of cluster resources • Efficient use of allocated resources • Better predictability of results • Minimized queuing delays • Reduced load on HDFS • Removes unnecessary HDFS writes • Reduced network usage • Efficient data transfer using new data patterns • Increased developer productivity • Lets the user concentrate on application logic instead of Hadoop internals Page 21
  • 22. Tez – Real World Examples © Hortonworks Inc. 2011 22
  • 23. Tez – Broadcast Edge SELECT ss.ss_item_sk, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sales_price) as avg_price, ss_item_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); M M M HDFS M M M © Hortonworks Inc. 2014 HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R R R M M M M M M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R broadcast
  • 24. Tez – Multiple Outputs FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive – MR Hive – Tez © Hortonworks Inc. 2014 M M M M HDFS Map join date_dim/store sales Two MR jobs to do the distinct M M M M M R R HDFS HDFS M M M R M M M R HDFS Broadcast Join (scan date_dim, join store sales) Distinct for customer + items Materialize join on HDFS Hive : Multi-insert queries
  • 25. Tez – Data at scale © Hortonworks Inc. 2014 Hive TPC-DS Scale 10TB Page 25
  • 26. Tez – what if you can’t get enough containers? • 78 vertex + 8374 tasks on 50 YARN containers © Hortonworks Inc. 2014 Page 26
  • 27. Tez – Designed for big, busy clusters © Hortonworks Inc. 2014 • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage (cross-rack traffic) • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Move workloads from gateway boxes to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. Page 27
  • 28. © Hortonworks Inc. 2014 Tez – Adoption • Hive • Hadoop standard for declarative access via SQL-like interface • Pig • Hadoop standard for procedural scripting and pipeline processing • Cascading • Developer friendly Java API and SDK • Scalding (Scala API on Cascading) • Commercial Vendors • ETL : Use Tez instead of MR or custom pipelines • Analytics Vendors : Use Tez as a target platform for scaling parallel analytical tools to large data-sets Page 28
  • 29. © Hortonworks Inc. 2014 Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: dev@tez.apache.org User list: user@tez.apache.org Issues list: issues@tez.apache.org Page 29

Editor's Notes

  1. Container Reuse Fault Tolerance Recovery Routing Data Efficiently Elasticity
  2. Hard to expect the f/w to do the last bit of optimizations. Sometimes, user would like to instruct the framework on what needs to be done at runtime. Tez allows such customizations. It is easy to operate, experiment and upgrade Tez deployment. This is hugely important, since we had to get a downtime for the entire cluster with previous MR deployment.
  3. Talk a little bit about custom edges as well
  4. Hive has written it’s own processor