SlideShare a Scribd company logo
1 of 29
Data Processing 
over YARN 
Page 1
Tez – Introduction 
• Distributed execution framework targeted towards data-processing 
© Hortonworks Inc. 2014 
applications. 
• Based on expressing a computation as a dataflow graph. 
• Highly customizable to meet a broad spectrum of use cases. 
• Built on top of YARN – the resource management framework 
for Hadoop. 
• Open source Apache project and Apache licensed. 
Page 2
Hadoop 1 -> Hadoop 2 
© Hortonworks Inc. 2014 
HADOOP 1.0 
Pig 
(data flow) 
Hive 
(sql) 
Others 
(cascading) 
MapReduce 
(cluster resource management 
& data processing) 
HDFS 
(redundant, reliable storage) 
HADOOP 2.0 
Data Flow 
(execution engine) 
YARN 
Tez 
(cluster resource management) 
HDFS2 
(redundant, reliable storage) 
Pig 
SQL 
Hive 
Others 
(Cascading) 
Batch 
MapReduce Real Time 
Stream 
Processing 
Storm 
Online 
Data 
Processing 
HBase, 
Accumulo 
Monolithic 
• Resource Management 
• Execution Engine 
• User API 
Layered 
• Resource Management – YARN 
• Execution Engine – Tez 
• User API – Hive, Pig, Cascading, Your App!
Tez – Problems that it addresses 
• Expressing the computation 
• Direct and elegant representation of the data processing flow 
© Hortonworks Inc. 2014 
• Performance 
• Late Binding : Make decisions as late as possible using real data from at 
runtime 
• Leverage the resources of the cluster efficiently 
• Just work out of the box! 
• Customizable engine to let applications tailor the job to meet their 
specific requirements 
• Operation simplicity 
• Painless to operate, experiment and upgrade 
Page 4
Tez – Simplifying Operations 
• Tez is a pure YARN application. Easy and safe to try it out! 
• No deployments to do, no servers to run 
• Enables running different versions concurrently. Easy to test new 
functionality while keeping stable versions for production. 
• Leverages YARN local resources. 
HDFS 
Tez Lib 1 Tez Lib 2 
TezClient TezTask 
TezTask 
© Hortonworks Inc. 2014 
Page 5 
Client 
Machine 
Node 
Manager 
Node 
Manager 
TezClient 
Client 
Machine
Tez – Expressing the computation 
Distributed data processing jobs typically look like DAGs (Directed Acyclic 
Graph). 
• Vertices in the graph represent data transformations 
• Edges represent data movement from producers to consumers 
© Hortonworks Inc. 2014 
Page 6 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Sample 
s 
Ranges 
Distributed Sort
MR is a 2-vertex sub-set of Tez 
© Hortonworks Inc. 2014 
Page 7
But Tez is so much more 
© Hortonworks Inc. 2014 
Page 8
Tez – Expressing the computation 
Tez defines the following APIs to define the work 
• DAG API 
• Defines the structure of the data processing and the relationship 
between producers and consumers 
• Enable definition of complex data flow pipelines using simple graph 
connection API’s. Tez expands the logical DAG at runtime 
• This is how all the tasks in the job get specified 
© Hortonworks Inc. 2014 
• Runtime API 
• Defines the interface using which the framework and app code interact 
with each other 
• App code transforms data and moves it between tasks 
• This is how we specify what actually executes in each task on the cluster 
nodes 
Page 9
© Hortonworks Inc. 2014 
reduce1 
map2 
reduce2 
join1 
map1 
Scatter_Gather 
Bipartite 
Sequential 
Scatter_Gather 
Bipartite 
Sequential 
Tez – DAG API 
// Define DAG 
DAG dag = DAG.create(“sessionize”); 
// Define Vertex-M1 
Vertex m1 = Vertex.create("M_1", 
ProcessorDescriptor.create(MapProcessor_1.class.getName())); 
//Define Vertex-R1 
Vertex r1 = Vertex.create(”R_1", 
ProcessorDescriptor.create(ReduceProcessor_1.class.getName())); 
… 
… 
// Define Edge (edge between m1 and r1) 
Edge e1 =Edge.create(m1, r1, 
OrderedPartitionedKVEdgeConfig.newBuilder(…).build() 
.createDefaultEdgeProperty()); 
// Connect them 
dag.addVertex(m1).addVertex(r1).addEdge(e1)… 
Page 10 
Defines the global processing flow
Tez - Different Edge Properties 
Scatter-Gather Broadcast One-to-One 
11
Tez – Logical DAG expansion at Runtime 
© Hortonworks Inc. 2014 
Page 12 
reduce1 
map2 
reduce2 
join1 
map1
Tez – Runtime API building blocks 
13
Tez – Library of Inputs and Outputs 
Classical ‘Map’ 
Sorted 
Output 
© Hortonworks Inc. 2014 
Page 14 
Classical ‘Reduce’ 
Map 
Processor 
HDFS 
Input 
Intermediate ‘Reduce’ for 
Map-Reduce-Reduce 
Reduce 
Processor 
Shuffle 
Input 
HDFS 
Output 
Reduce 
Processor 
Shuffle 
Input 
Sorted 
Output 
• What is built in? 
– Hadoop InputFormat/OutputFormat 
– OrderedPartitioned Key-Value 
Input/Output 
– UnorderedPartitioned Key-Value 
Input/Output 
– Key-Value Input/Output
Tez – Container Re-Use 
• Reuse YARN containers/JVMs to launch new tasks 
• Reduce scheduling and launching delays 
• Shared in-memory data across tasks 
• JVM JIT friendly execution 
© Hortonworks Inc. 2014 
Page 15 
TezTask Host 
TezTask1 
TezTask2 
Shared Objects 
YARN Container / JVM 
Tez 
Application Master 
YARN Container 
Start Task 
Task Done 
Start Task
© Hortonworks Inc. 2014 
Container reuse 
• Tez specific feature 
• Run an entire DAG using the same containers 
• Different vertices use same container 
• Saves time talking to YARN for new containers
© Hortonworks Inc. 2014 
Tez – Sessions 
Page 17 
Client 
Start 
Session 
Application Master 
Submit 
DAG 
Task Scheduler 
Container Pool 
Shared 
Object 
Registry 
Pre 
Warmed 
JVM 
Sessions 
• Standard concepts of pre-launch 
and pre-warm applied 
• Key for Interactive queries 
• Represents a connection between 
the user and the cluster 
• Multiple DAGs/Queries executed in 
the same AM 
• Containers re-used across queries 
• Takes care of data locality and 
releasing resources when idle
Tez – Re-Use in Action (In Session) 
Task Execution Timeline 
© Hortonworks Inc. 2014
Tez – Auto Reduce Parallelism 
© Hortonworks 
Inc. 2011 
19
Tez – Auto Reduce Parallelism 
20
Tez – End User Benefits 
© Hortonworks Inc. 2014 
• Better Performance 
• Framework performance + application performance 
• Better utilization of cluster resources 
• Efficient use of allocated resources 
• Better predictability of results 
• Minimized queuing delays 
• Reduced load on HDFS 
• Removes unnecessary HDFS writes 
• Reduced network usage 
• Efficient data transfer using new data patterns 
• Increased developer productivity 
• Lets the user concentrate on application logic instead of Hadoop 
internals 
Page 21
Tez – Real World Examples 
© Hortonworks 
Inc. 2011 
22
Tez – Broadcast Edge 
SELECT ss.ss_item_sk, avg_price, inv.inv_quantity_on_hand 
FROM (select avg(ss_sales_price) as avg_price, ss_item_sk from store_sales 
group by ss_item_sk) ss 
JOIN inventory inv 
ON (inv.inv_item_sk = ss.ss_item_sk); 
M M M 
HDFS 
M M M 
© Hortonworks Inc. 2014 
HDFS 
Store Sales 
scan. Group by 
and aggregation. 
Inventory and Store 
Sales (aggr.) output 
scan and shuffle 
join. 
R R 
R R 
M M M 
M M M 
HDFS 
Store Sales 
scan. Group by 
and aggregation. 
Inventory and Store 
Sales (aggr.) output 
scan and shuffle 
join. 
R R 
broadcast
Tez – Multiple Outputs 
FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) 
INSERT INTO TABLE t1 SELECT distinct ss_item_sk 
INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; 
Hive – MR Hive – Tez 
© Hortonworks Inc. 2014 
M 
M M M 
HDFS 
Map join 
date_dim/store 
sales 
Two MR jobs to 
do the distinct 
M M 
M M M 
R R 
HDFS 
HDFS 
M M M 
R 
M M M 
R 
HDFS 
Broadcast Join 
(scan date_dim, 
join store sales) 
Distinct for 
customer + items 
Materialize join on 
HDFS 
Hive : Multi-insert 
queries
Tez – Data at scale 
© Hortonworks Inc. 2014 
Hive TPC-DS 
Scale 10TB 
Page 25
Tez – what if you can’t get enough containers? 
• 78 vertex + 8374 tasks on 50 YARN containers 
© Hortonworks Inc. 2014 
Page 26
Tez – Designed for big, busy clusters 
© Hortonworks Inc. 2014 
• Number of stages in the DAG 
• Higher the number of stages in the DAG, performance of Tez (over MR) will be 
better. 
• Cluster/queue capacity 
• More congested a queue is, the performance of Tez (over MR) will be better due 
to container reuse. 
• Size of intermediate output 
• More the size of intermediate output, the performance of Tez (over MR) will be 
better due to reduced HDFS usage (cross-rack traffic) 
• Size of data in the job 
• For smaller data and more stages, the performance of Tez (over MR) will be 
better as percentage of launch overhead in the total time is high for smaller 
jobs. 
• Move workloads from gateway boxes to the cluster 
• Move as much work as possible to the cluster by modelling it via the job DAG. 
Exploit the parallelism and resources of the cluster. 
Page 27
© Hortonworks Inc. 2014 
Tez – Adoption 
• Hive 
• Hadoop standard for declarative access via SQL-like interface 
• Pig 
• Hadoop standard for procedural scripting and pipeline processing 
• Cascading 
• Developer friendly Java API and SDK 
• Scalding (Scala API on Cascading) 
• Commercial Vendors 
• ETL : Use Tez instead of MR or custom pipelines 
• Analytics Vendors : Use Tez as a target platform for scaling parallel 
analytical tools to large data-sets 
Page 28
© Hortonworks Inc. 2014 
Tez – Community 
• Early adopters and code contributors welcome 
– Adopters to drive more scenarios. Contributors to make them happen. 
• Technical blog series 
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing 
• Useful links 
–Work tracking: https://issues.apache.org/jira/browse/TEZ 
– Code: https://github.com/apache/tez 
– Developer list: dev@tez.apache.org 
User list: user@tez.apache.org 
Issues list: issues@tez.apache.org 
Page 29

More Related Content

What's hot

Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
MapR Technologies
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
DataWorks Summit
 

What's hot (19)

Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer ShiranThe Future of Hadoop: MapR VP of Product Management, Tomer Shiran
The Future of Hadoop: MapR VP of Product Management, Tomer Shiran
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 

Similar to Tez Data Processing over Yarn

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
MapR Technologies
 

Similar to Tez Data Processing over Yarn (20)

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 

More from InMobi Technology

More from InMobi Technology (20)

Optimizer Hints
Optimizer HintsOptimizer Hints
Optimizer Hints
 
Case Studies on PostgreSQL
Case Studies on PostgreSQLCase Studies on PostgreSQL
Case Studies on PostgreSQL
 
PostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major FeaturesPostgreSQL 9.5 - Major Features
PostgreSQL 9.5 - Major Features
 
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQLToro DB- Open-source, MongoDB-compatible database,  built on top of PostgreSQL
Toro DB- Open-source, MongoDB-compatible database, built on top of PostgreSQL
 
Building Spark as Service in Cloud
Building Spark as Service in CloudBuilding Spark as Service in Cloud
Building Spark as Service in Cloud
 
Building Machine Learning Pipelines
Building Machine Learning PipelinesBuilding Machine Learning Pipelines
Building Machine Learning Pipelines
 
Ensemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic TradingEnsemble Methods for Algorithmic Trading
Ensemble Methods for Algorithmic Trading
 
Backbone & Graphs
Backbone & GraphsBackbone & Graphs
Backbone & Graphs
 
24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL24/7 Monitoring and Alerting of PostgreSQL
24/7 Monitoring and Alerting of PostgreSQL
 
Reflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site ScriptingReflective and Stored XSS- Cross Site Scripting
Reflective and Stored XSS- Cross Site Scripting
 
Introduction to Threat Modeling
Introduction to Threat ModelingIntroduction to Threat Modeling
Introduction to Threat Modeling
 
HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
The Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big DataThe Synapse IoT Stack: Technology Trends in IOT and Big Data
The Synapse IoT Stack: Technology Trends in IOT and Big Data
 
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
 
Attacking Web Proxies
Attacking Web ProxiesAttacking Web Proxies
Attacking Web Proxies
 
Security News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet BangaloreSecurity News Bytes Null Dec Meet Bangalore
Security News Bytes Null Dec Meet Bangalore
 
Matriux blue
Matriux blueMatriux blue
Matriux blue
 
PCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder dataPCI DSS v3 - Protecting Cardholder data
PCI DSS v3 - Protecting Cardholder data
 
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale PlatformRunning Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
 
Shodan- That Device Search Engine
Shodan- That Device Search EngineShodan- That Device Search Engine
Shodan- That Device Search Engine
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Tez Data Processing over Yarn

  • 1. Data Processing over YARN Page 1
  • 2. Tez – Introduction • Distributed execution framework targeted towards data-processing © Hortonworks Inc. 2014 applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed. Page 2
  • 3. Hadoop 1 -> Hadoop 2 © Hortonworks Inc. 2014 HADOOP 1.0 Pig (data flow) Hive (sql) Others (cascading) MapReduce (cluster resource management & data processing) HDFS (redundant, reliable storage) HADOOP 2.0 Data Flow (execution engine) YARN Tez (cluster resource management) HDFS2 (redundant, reliable storage) Pig SQL Hive Others (Cascading) Batch MapReduce Real Time Stream Processing Storm Online Data Processing HBase, Accumulo Monolithic • Resource Management • Execution Engine • User API Layered • Resource Management – YARN • Execution Engine – Tez • User API – Hive, Pig, Cascading, Your App!
  • 4. Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow © Hortonworks Inc. 2014 • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable engine to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade Page 4
  • 5. Tez – Simplifying Operations • Tez is a pure YARN application. Easy and safe to try it out! • No deployments to do, no servers to run • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. HDFS Tez Lib 1 Tez Lib 2 TezClient TezTask TezTask © Hortonworks Inc. 2014 Page 5 Client Machine Node Manager Node Manager TezClient Client Machine
  • 6. Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers © Hortonworks Inc. 2014 Page 6 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Sample s Ranges Distributed Sort
  • 7. MR is a 2-vertex sub-set of Tez © Hortonworks Inc. 2014 Page 7
  • 8. But Tez is so much more © Hortonworks Inc. 2014 Page 8
  • 9. Tez – Expressing the computation Tez defines the following APIs to define the work • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how all the tasks in the job get specified © Hortonworks Inc. 2014 • Runtime API • Defines the interface using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes Page 9
  • 10. © Hortonworks Inc. 2014 reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Tez – DAG API // Define DAG DAG dag = DAG.create(“sessionize”); // Define Vertex-M1 Vertex m1 = Vertex.create("M_1", ProcessorDescriptor.create(MapProcessor_1.class.getName())); //Define Vertex-R1 Vertex r1 = Vertex.create(”R_1", ProcessorDescriptor.create(ReduceProcessor_1.class.getName())); … … // Define Edge (edge between m1 and r1) Edge e1 =Edge.create(m1, r1, OrderedPartitionedKVEdgeConfig.newBuilder(…).build() .createDefaultEdgeProperty()); // Connect them dag.addVertex(m1).addVertex(r1).addEdge(e1)… Page 10 Defines the global processing flow
  • 11. Tez - Different Edge Properties Scatter-Gather Broadcast One-to-One 11
  • 12. Tez – Logical DAG expansion at Runtime © Hortonworks Inc. 2014 Page 12 reduce1 map2 reduce2 join1 map1
  • 13. Tez – Runtime API building blocks 13
  • 14. Tez – Library of Inputs and Outputs Classical ‘Map’ Sorted Output © Hortonworks Inc. 2014 Page 14 Classical ‘Reduce’ Map Processor HDFS Input Intermediate ‘Reduce’ for Map-Reduce-Reduce Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output • What is built in? – Hadoop InputFormat/OutputFormat – OrderedPartitioned Key-Value Input/Output – UnorderedPartitioned Key-Value Input/Output – Key-Value Input/Output
  • 15. Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution © Hortonworks Inc. 2014 Page 15 TezTask Host TezTask1 TezTask2 Shared Objects YARN Container / JVM Tez Application Master YARN Container Start Task Task Done Start Task
  • 16. © Hortonworks Inc. 2014 Container reuse • Tez specific feature • Run an entire DAG using the same containers • Different vertices use same container • Saves time talking to YARN for new containers
  • 17. © Hortonworks Inc. 2014 Tez – Sessions Page 17 Client Start Session Application Master Submit DAG Task Scheduler Container Pool Shared Object Registry Pre Warmed JVM Sessions • Standard concepts of pre-launch and pre-warm applied • Key for Interactive queries • Represents a connection between the user and the cluster • Multiple DAGs/Queries executed in the same AM • Containers re-used across queries • Takes care of data locality and releasing resources when idle
  • 18. Tez – Re-Use in Action (In Session) Task Execution Timeline © Hortonworks Inc. 2014
  • 19. Tez – Auto Reduce Parallelism © Hortonworks Inc. 2011 19
  • 20. Tez – Auto Reduce Parallelism 20
  • 21. Tez – End User Benefits © Hortonworks Inc. 2014 • Better Performance • Framework performance + application performance • Better utilization of cluster resources • Efficient use of allocated resources • Better predictability of results • Minimized queuing delays • Reduced load on HDFS • Removes unnecessary HDFS writes • Reduced network usage • Efficient data transfer using new data patterns • Increased developer productivity • Lets the user concentrate on application logic instead of Hadoop internals Page 21
  • 22. Tez – Real World Examples © Hortonworks Inc. 2011 22
  • 23. Tez – Broadcast Edge SELECT ss.ss_item_sk, avg_price, inv.inv_quantity_on_hand FROM (select avg(ss_sales_price) as avg_price, ss_item_sk from store_sales group by ss_item_sk) ss JOIN inventory inv ON (inv.inv_item_sk = ss.ss_item_sk); M M M HDFS M M M © Hortonworks Inc. 2014 HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R R R M M M M M M HDFS Store Sales scan. Group by and aggregation. Inventory and Store Sales (aggr.) output scan and shuffle join. R R broadcast
  • 24. Tez – Multiple Outputs FROM (SELECT * FROM store_sales, date_dim WHERE ss_sold_date_sk = d_date_sk and d_year = 2000) INSERT INTO TABLE t1 SELECT distinct ss_item_sk INSERT INTO TABLE t2 SELECT distinct ss_customer_sk; Hive – MR Hive – Tez © Hortonworks Inc. 2014 M M M M HDFS Map join date_dim/store sales Two MR jobs to do the distinct M M M M M R R HDFS HDFS M M M R M M M R HDFS Broadcast Join (scan date_dim, join store sales) Distinct for customer + items Materialize join on HDFS Hive : Multi-insert queries
  • 25. Tez – Data at scale © Hortonworks Inc. 2014 Hive TPC-DS Scale 10TB Page 25
  • 26. Tez – what if you can’t get enough containers? • 78 vertex + 8374 tasks on 50 YARN containers © Hortonworks Inc. 2014 Page 26
  • 27. Tez – Designed for big, busy clusters © Hortonworks Inc. 2014 • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage (cross-rack traffic) • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Move workloads from gateway boxes to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. Page 27
  • 28. © Hortonworks Inc. 2014 Tez – Adoption • Hive • Hadoop standard for declarative access via SQL-like interface • Pig • Hadoop standard for procedural scripting and pipeline processing • Cascading • Developer friendly Java API and SDK • Scalding (Scala API on Cascading) • Commercial Vendors • ETL : Use Tez instead of MR or custom pipelines • Analytics Vendors : Use Tez as a target platform for scaling parallel analytical tools to large data-sets Page 28
  • 29. © Hortonworks Inc. 2014 Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: dev@tez.apache.org User list: user@tez.apache.org Issues list: issues@tez.apache.org Page 29

Editor's Notes

  1. Container Reuse Fault Tolerance Recovery Routing Data Efficiently Elasticity
  2. Hard to expect the f/w to do the last bit of optimizations. Sometimes, user would like to instruct the framework on what needs to be done at runtime. Tez allows such customizations. It is easy to operate, experiment and upgrade Tez deployment. This is hugely important, since we had to get a downtime for the entire cluster with previous MR deployment.
  3. Talk a little bit about custom edges as well
  4. Hive has written it’s own processor