SlideShare a Scribd company logo
© Hortonworks Inc. 2015 Page 1
Apache Tez – Present and Future
Jeff Zhang (@zjffdu)
Rajesh Balamohan (@rajeshbalamohan)
© Hortonworks Inc. 2015
Agenda
•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of
a Join b)
Job 4 (Join of S & R )
Hive - MR
Example of MR versus Tez
Page 3
Single Job
Hive - Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
© Hortonworks Inc. 2015
Tez – Introduction
Page 4
• Distributed execution framework
targeted towards data-processing
applications.
• Based on expressing a computation
as a dataflow graph (DAG).
• Highly customizable to meet a broad
spectrum of use cases.
• Built on top of YARN – the resource
management framework for
Hadoop.
• Open source Apache project and
Apache licensed.
© Hortonworks Inc. 2015
What is DAG & Why DAG
Projection
Filter
GroupBy
…
Join
Union
Intersect
…
Split
…
• Directed Acyclic Graph
• Any complicated DAG can been composed of the following 3 basic
paradigm
– Sequential
– Merge
– Divide
© Hortonworks Inc. 2015
Expressing DAG in Tez API
• DAG API (Logic View)
–Allow user to build DAG
–Topological structure of the data computation flow
• Runtime API (Runtime View)
–Application logic of each computation unit (vertex)
–How to move/read/write data between vertices
© Hortonworks Inc. 2015
DAG API (Logic View)
Page 7
• Vertex (Processor, Parallelism, Resource, etc…)
• Edge (EdgeProperty)
–DataMovement
– Scatter Gather (Join, GroupBy … )
– Broadcast ( Pig Replicated Join / Hive Broadcast Join )
– One-to-One ( Pig Order by )
– Custom
© Hortonworks Inc. 2015
Runtime API (Runtime View)
Page 8
ProcessorInput Output
• Input
– Through which processor receives data on an edge
– Vertex can have multiple inputs
• Processor
– Application Logic (One vertex one processor)
– Consume the inputs and produce the outputs
• Output
– Through which processor writes data to an edge
– One vertex can have multiple outputs
• Example of Input/Output/Processor
– MRInput & MROutput (InputFormat/OutputFormat)
– OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather)
– UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One)
– PigProcessor/HiveProcessor
© Hortonworks Inc. 2015
Benefit of DAG
• Easier to express computation in DAG
• No intermediate data written to HDFS
• Less pressure on NameNode
• No resource queuing effort & less resource contention
• More optimization opportunity with more global context
© Hortonworks Inc. 2015
Agenda
•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
Container-Reuse
• Reuse the same container across DAG/Vertices/Tasks
• Benefit of Container-Reuse
–Less resources consumed
–Reduce overhead of launching JVM
–Reduce overhead of negotiate with Resource Manager
–Reduce overhead of resource localization
–Reduce network IO
–Object Caching (Object Sharing)
© Hortonworks Inc. 2015
Tez Session
• Multiple Jobs/DAGs in one AM
• Container-reuse across Jobs/DAGs
• Data sharing between Jobs/DAGs
© Hortonworks Inc. 2015
Dynamic Parallelism Estimation
• VertexManager
–Listen to the other vertices
status
–Coordinate and schedule its
tasks
–Communication between
vertices
© Hortonworks Inc. 2015
ATS Integration
• Tez is fully integrated with YARN ATS (Application Timeline
Service)
–DAG Status, DAG Metrics, Task Status, Task Metrics are captured
• Diagnostics & Performance analysis
–Data Source for monitoring & diagnostics
–Data Source for performance analysis
© Hortonworks Inc. 2015
Recovery
• AM can crash in corner cases
–OOM
–Node failure
–…
• Continue from the last checkpoint
• Transparent to end users
AM Crash
© Hortonworks Inc. 2015
Order By of Pig
f = Load ‘foo’ as (x, y);
o = Order f by x;Load
Sample
(Calculate Histogram)
HDFS
Partition
Sort
Broadcast
Load
Sample
(Calculate Histogram)
Partition
Sort
One-to-One
Scatter Gather
Scatter Gather
© Hortonworks Inc. 2015
Agenda
•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
• Performance
–Speculation
–Intermediate File Improvements
–Better use of JVM Memory
–Shuffle Improvements
• Debuggability
–Tez UI
–Local mode
–Job Analysis Tools
–Shuffle Performance Analysis Tool
© Hortonworks Inc. 2015
Speculation
• Good for clusters having good/slow nodes or heterogeneous
hardware.
• Maintains periodic runtime statistics of tasks
• Triggers speculative attempt when estimated runtime > mean
runtime
© Hortonworks Inc. 2015
Intermediate File Format Improvements
• Used for storing intermediate data
in Tez
• Drawbacks of earlier format
–Needs larger buffer in memory (due to
duplicate keys)
–Bigger file size in disk
–Not ideal for all use cases
• New Intermediate File Format
–Works based on (K, List<V>)
–Provides 57% memory efficiency and
23% improvement in disk storage
Task
Spill 1 Spill 2 Spill 3
Merged Spill
………………………
New IFile Format
Key
Len
K1Value Len V1
Value Len V2 V_ENDRLE Value Len V3 …
Key
Len
K2Value Len V1
Value Len V5 V_ENDRLE Value Len V6 …
Old IFile Format
Key
Len
Value Len K1 V1
Key
Len
Value Len K1 V2
Key
Len
Value Len K1 V3
Key
Len
Value Len K2 V1
………………………
Key
Len
Value Len K2 V5
Key
Len
Value Len K2 V6
© Hortonworks Inc. 2015
Better use of JVM Memory
• BytesWritable Improvements
–Provides FastByteSerialization
–Saves 8 bytes per key-value pair
–Reduces IFile size by 25%
–Reduces SERDE costs
• PipelinedSorter can support > 2 GB sort
buffers
–Containers with higher RAM no longer
limited by 2 GB sort buffer limits
–Avoids unnecessary spills in large jobs
• Reduced key comparison costs in
PipelinedSorter
Key Valu
e
Key Size Bytes Val Size Bytes
Key Size BytesSize Val Size BytesSize
Serialize to memory Serialize to memory
Serialize to disk Serialize to disk
© Hortonworks Inc. 2015
Better use of JVM Memory - Contd
• Enabled RLE in reducer codepath
–Reduced key comparisons in merge codepath
–Improved Job Runtime (observed 10% improvement)
–Reduced CPU cost
Without Fix
691 seconds
With Fix
621 seconds
© Hortonworks Inc. 2015
Better use of JVM Memory - Contd
• WeightedMemoryDistributor for better memory management
in tasks
–Observed 26% runtime improvement in tasks
© Hortonworks Inc. 2015
Source Task
….
….
Broadcast Shuffle Improvements
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Broadcast
From local disk
From local disk
Source Task
….
….
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Broadcast
Before Fix After Fix
© Hortonworks Inc. 2015
PipelinedShuffle Improvments
• Final merge in source
task is avoided.
– Less IO
• Consumers are
informed about spill
events in advance
– Better usage of
network bandwidth
– Overlap CPU with
network
– For sorted/unsorted
outputs, send data to
consumers in chunks
• Observed 20% runtime
improvement in
queries involving heavy
skews
Task 1
Spill 1
Task 2
Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N
…..
…..
…..
…..
Spill 1 Spill 2 Spill 3
Task 1
Spill 1
Task 2
Spill 1 Spill 2 Spill 3
Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N
…..
…..
…..
…..
Merged Spill
Normal Shuffle Path
Pipelined Shuffle Path
© Hortonworks Inc. 2015
PipelinedShuffle Improvements
Job Runtime : 925 seconds Job Runtime : 680 seconds
- 26% improvement
- Avoids final merge (less IO, CPU cost)
- Downstream can consume data whenever a spill
is generated
© Hortonworks Inc. 2015
• Performance
–Speculation
–Better use of JVM Memory
–Intermediate File Improvements
–Shuffle Improvements
• Debuggability
–Tez UI
–Local mode
–Job Analysis Tools
–Shuffle Performance Analysis Tool
© Hortonworks Inc. 2015
Tez UI
© Hortonworks Inc. 2015
Tez UI
Tez UI
30
Download data from ATS
© Hortonworks Inc. 2015
Better Debuggability– Local Mode
• Test Tez Jobs without Hadoop Cluster
• Enables Fast Prototyping
• Fast Unit Testing
• Runs on Single JVM (easy for debugging)
• Scheduling / RPC invocations Skipped
© Hortonworks Inc. 2015
Job Analysis Tools
• DAG Swimlane
–“$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>”
Prewarm
Container Reuse
Remote Reads
© Hortonworks Inc. 2015
Shuffle Performance Analysis Tools
• Analyze Tez logs in Hadoop
• Analyze shuffle performance between source / destination
nodes Data transferred
from node 7 to
rest of the nodes are slow
© Hortonworks Inc. 2015
Shuffle Performance Analysis Tools
• Analyze shuffle performance between source / destination
nodes
© Hortonworks Inc. 2015
RoadMap
• Shared output edges
–Same output to multiple vertices
• Local mode stabilization
• Optimizing (include/exclude) vertex at runtime
• Partial completion VertexManager
• Co-Scheduling
• Framework stats for better runtime decisions
© Hortonworks Inc. 2015
Tez – Adoption
• Apache Hive
• Start from Hive 0.13
• set hive.exec.engine = tez
• Apache Pig
• Start from Pig 0.14
• pig -x tez
• Cascading
• Flink
Page 36
© Hortonworks Inc. 2015
Tez Community
• Useful Links
–http://tez.apache.org/
–JIRA : https://issues.apache.org/jira/browse/TEZ
–Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git
–Mailing Lists
– Dev List: dev@tez.apache.org
– User List: user@tez.apache.org
– Issues List: issues@tez.apache.org
• Tez Meetup
–http://www.meetup.com/Apache-Tez-User-Group
© Hortonworks Inc. 2015
Thank You!
Questions & Answers
Page 38

More Related Content

What's hot

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Investing the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resourcesInvesting the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resources
DataWorks Summit/Hadoop Summit
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
DataWorks Summit
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
Rommel Garcia
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
DataWorks Summit
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Tsuyoshi OZAWA
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Cloudera, Inc.
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
Cloudera, Inc.
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
Ilya Ganelin
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
DataWorks Summit
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesDataWorks Summit
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
Command Prompt., Inc
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
Joe Alex
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureDataWorks Summit
 

What's hot (20)

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Investing the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resourcesInvesting the Effects of Overcommitting YARN resources
Investing the Effects of Overcommitting YARN resources
 
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentLessons learned from scaling YARN to 40K machines in a multi tenancy environment
Lessons learned from scaling YARN to 40K machines in a multi tenancy environment
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014Taming YARN @ Hadoop conference Japan 2014
Taming YARN @ Hadoop conference Japan 2014
 
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, ClouderaHadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
Hadoop World 2011: Hadoop and Performance - Todd Lipcon & Yanpei Chen, Cloudera
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Pnuts Review
Pnuts ReviewPnuts Review
Pnuts Review
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 

Viewers also liked

Tarix 6 126_memmedova shirincan evez
Tarix 6 126_memmedova shirincan evezTarix 6 126_memmedova shirincan evez
Tarix 6 126_memmedova shirincan evezmimio_azerbaijan
 
Михаил Рыжиков, Простые практики в сложных условиях
Михаил Рыжиков, Простые практики в сложных условияхМихаил Рыжиков, Простые практики в сложных условиях
Михаил Рыжиков, Простые практики в сложных условиях
ScrumTrek
 
FINAL LI BIO-CAI-GAZELLE June 5 2016
FINAL LI BIO-CAI-GAZELLE June 5 2016FINAL LI BIO-CAI-GAZELLE June 5 2016
FINAL LI BIO-CAI-GAZELLE June 5 2016
Rosemarie Truman
 
Parfiya
ParfiyaParfiya
Parfiya
ayxan12
 
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
主婦聯盟台中分會
 
GR8 nədir? Xidmətlərimiz barədə məlumat.
GR8 nədir? Xidmətlərimiz barədə məlumat.GR8 nədir? Xidmətlərimiz barədə məlumat.
GR8 nədir? Xidmətlərimiz barədə məlumat.
Joshgun Karimov
 
Click for the best love story
Click for the best love storyClick for the best love story
Click for the best love story
Joshgun Karimov
 
Planting Audit Operations Dashboard
Planting Audit Operations DashboardPlanting Audit Operations Dashboard
Planting Audit Operations DashboardCOGS Presentations
 
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
ScrumTrek
 
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
主婦聯盟台中分會
 
Layihenin yazilma qaydasi (1)
Layihenin yazilma qaydasi (1)Layihenin yazilma qaydasi (1)
Layihenin yazilma qaydasi (1)AZERİ AZERBAYCAN
 
Digital transformation Fujitsu London September 2016
Digital transformation Fujitsu London September 2016Digital transformation Fujitsu London September 2016
Digital transformation Fujitsu London September 2016
Scopernia
 
Layihe seki
Layihe sekiLayihe seki
Layihe seki
Jale Abbasova
 
Atropatena dövləti
Atropatena dövlətiAtropatena dövləti
Atropatena dövləti
ayxan12
 

Viewers also liked (14)

Tarix 6 126_memmedova shirincan evez
Tarix 6 126_memmedova shirincan evezTarix 6 126_memmedova shirincan evez
Tarix 6 126_memmedova shirincan evez
 
Михаил Рыжиков, Простые практики в сложных условиях
Михаил Рыжиков, Простые практики в сложных условияхМихаил Рыжиков, Простые практики в сложных условиях
Михаил Рыжиков, Простые практики в сложных условиях
 
FINAL LI BIO-CAI-GAZELLE June 5 2016
FINAL LI BIO-CAI-GAZELLE June 5 2016FINAL LI BIO-CAI-GAZELLE June 5 2016
FINAL LI BIO-CAI-GAZELLE June 5 2016
 
Parfiya
ParfiyaParfiya
Parfiya
 
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
臺中市環保局-臺中市禁行二行程機車政策評估研商會 開會通知單
 
GR8 nədir? Xidmətlərimiz barədə məlumat.
GR8 nədir? Xidmətlərimiz barədə məlumat.GR8 nədir? Xidmətlərimiz barədə məlumat.
GR8 nədir? Xidmətlərimiz barədə məlumat.
 
Click for the best love story
Click for the best love storyClick for the best love story
Click for the best love story
 
Planting Audit Operations Dashboard
Planting Audit Operations DashboardPlanting Audit Operations Dashboard
Planting Audit Operations Dashboard
 
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
Кирилл Толкачев, Александр Тарасов, Хипстеры в энтерпрайзе. Шагаем в ногу со ...
 
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
台中市政府-「臺中市第三屆空氣污染防制基金管理委員會」105年第1次會議 會議紀錄
 
Layihenin yazilma qaydasi (1)
Layihenin yazilma qaydasi (1)Layihenin yazilma qaydasi (1)
Layihenin yazilma qaydasi (1)
 
Digital transformation Fujitsu London September 2016
Digital transformation Fujitsu London September 2016Digital transformation Fujitsu London September 2016
Digital transformation Fujitsu London September 2016
 
Layihe seki
Layihe sekiLayihe seki
Layihe seki
 
Atropatena dövləti
Atropatena dövlətiAtropatena dövləti
Atropatena dövləti
 

Similar to Apache Tez – Present and Future

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
DataWorks Summit
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
Luke Han
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
Yahoo Developer Network
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
bigdatagurus_meetup
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
Data Con LA
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Teddy Choi
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
Lin Qiao
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Caserta
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
DataWorks Summit
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
Evans Ye
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Data Con LA
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
DataWorks Summit
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
Ramya Sunil
 

Similar to Apache Tez – Present and Future (20)

Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
LLAP: Building Cloud First BI
LLAP: Building Cloud First BILLAP: Building Cloud First BI
LLAP: Building Cloud First BI
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 

Recently uploaded

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 

Recently uploaded (20)

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 

Apache Tez – Present and Future

  • 1. © Hortonworks Inc. 2015 Page 1 Apache Tez – Present and Future Jeff Zhang (@zjffdu) Rajesh Balamohan (@rajeshbalamohan)
  • 2. © Hortonworks Inc. 2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
  • 3. © Hortonworks Inc. 2015 I/O Synchronization Barrier I/O Synchronization Barrier Job 1 ( Join a & b ) Job 3 ( Group by of c ) Job 2 (Group by of a Join b) Job 4 (Join of S & R ) Hive - MR Example of MR versus Tez Page 3 Single Job Hive - Tez Join a & b Group by of a Join b Group by of c Job 4 (Join of S & R )
  • 4. © Hortonworks Inc. 2015 Tez – Introduction Page 4 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph (DAG). • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed.
  • 5. © Hortonworks Inc. 2015 What is DAG & Why DAG Projection Filter GroupBy … Join Union Intersect … Split … • Directed Acyclic Graph • Any complicated DAG can been composed of the following 3 basic paradigm – Sequential – Merge – Divide
  • 6. © Hortonworks Inc. 2015 Expressing DAG in Tez API • DAG API (Logic View) –Allow user to build DAG –Topological structure of the data computation flow • Runtime API (Runtime View) –Application logic of each computation unit (vertex) –How to move/read/write data between vertices
  • 7. © Hortonworks Inc. 2015 DAG API (Logic View) Page 7 • Vertex (Processor, Parallelism, Resource, etc…) • Edge (EdgeProperty) –DataMovement – Scatter Gather (Join, GroupBy … ) – Broadcast ( Pig Replicated Join / Hive Broadcast Join ) – One-to-One ( Pig Order by ) – Custom
  • 8. © Hortonworks Inc. 2015 Runtime API (Runtime View) Page 8 ProcessorInput Output • Input – Through which processor receives data on an edge – Vertex can have multiple inputs • Processor – Application Logic (One vertex one processor) – Consume the inputs and produce the outputs • Output – Through which processor writes data to an edge – One vertex can have multiple outputs • Example of Input/Output/Processor – MRInput & MROutput (InputFormat/OutputFormat) – OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather) – UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One) – PigProcessor/HiveProcessor
  • 9. © Hortonworks Inc. 2015 Benefit of DAG • Easier to express computation in DAG • No intermediate data written to HDFS • Less pressure on NameNode • No resource queuing effort & less resource contention • More optimization opportunity with more global context
  • 10. © Hortonworks Inc. 2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
  • 11. © Hortonworks Inc. 2015 Container-Reuse • Reuse the same container across DAG/Vertices/Tasks • Benefit of Container-Reuse –Less resources consumed –Reduce overhead of launching JVM –Reduce overhead of negotiate with Resource Manager –Reduce overhead of resource localization –Reduce network IO –Object Caching (Object Sharing)
  • 12. © Hortonworks Inc. 2015 Tez Session • Multiple Jobs/DAGs in one AM • Container-reuse across Jobs/DAGs • Data sharing between Jobs/DAGs
  • 13. © Hortonworks Inc. 2015 Dynamic Parallelism Estimation • VertexManager –Listen to the other vertices status –Coordinate and schedule its tasks –Communication between vertices
  • 14. © Hortonworks Inc. 2015 ATS Integration • Tez is fully integrated with YARN ATS (Application Timeline Service) –DAG Status, DAG Metrics, Task Status, Task Metrics are captured • Diagnostics & Performance analysis –Data Source for monitoring & diagnostics –Data Source for performance analysis
  • 15. © Hortonworks Inc. 2015 Recovery • AM can crash in corner cases –OOM –Node failure –… • Continue from the last checkpoint • Transparent to end users AM Crash
  • 16. © Hortonworks Inc. 2015 Order By of Pig f = Load ‘foo’ as (x, y); o = Order f by x;Load Sample (Calculate Histogram) HDFS Partition Sort Broadcast Load Sample (Calculate Histogram) Partition Sort One-to-One Scatter Gather Scatter Gather
  • 17. © Hortonworks Inc. 2015 Agenda •Tez Introduction •Tez Feature Deep Dive •Tez Improvement & Debuggability •Tez Status & Roadmap
  • 18. © Hortonworks Inc. 2015 • Performance –Speculation –Intermediate File Improvements –Better use of JVM Memory –Shuffle Improvements • Debuggability –Tez UI –Local mode –Job Analysis Tools –Shuffle Performance Analysis Tool
  • 19. © Hortonworks Inc. 2015 Speculation • Good for clusters having good/slow nodes or heterogeneous hardware. • Maintains periodic runtime statistics of tasks • Triggers speculative attempt when estimated runtime > mean runtime
  • 20. © Hortonworks Inc. 2015 Intermediate File Format Improvements • Used for storing intermediate data in Tez • Drawbacks of earlier format –Needs larger buffer in memory (due to duplicate keys) –Bigger file size in disk –Not ideal for all use cases • New Intermediate File Format –Works based on (K, List<V>) –Provides 57% memory efficiency and 23% improvement in disk storage Task Spill 1 Spill 2 Spill 3 Merged Spill ……………………… New IFile Format Key Len K1Value Len V1 Value Len V2 V_ENDRLE Value Len V3 … Key Len K2Value Len V1 Value Len V5 V_ENDRLE Value Len V6 … Old IFile Format Key Len Value Len K1 V1 Key Len Value Len K1 V2 Key Len Value Len K1 V3 Key Len Value Len K2 V1 ……………………… Key Len Value Len K2 V5 Key Len Value Len K2 V6
  • 21. © Hortonworks Inc. 2015 Better use of JVM Memory • BytesWritable Improvements –Provides FastByteSerialization –Saves 8 bytes per key-value pair –Reduces IFile size by 25% –Reduces SERDE costs • PipelinedSorter can support > 2 GB sort buffers –Containers with higher RAM no longer limited by 2 GB sort buffer limits –Avoids unnecessary spills in large jobs • Reduced key comparison costs in PipelinedSorter Key Valu e Key Size Bytes Val Size Bytes Key Size BytesSize Val Size BytesSize Serialize to memory Serialize to memory Serialize to disk Serialize to disk
  • 22. © Hortonworks Inc. 2015 Better use of JVM Memory - Contd • Enabled RLE in reducer codepath –Reduced key comparisons in merge codepath –Improved Job Runtime (observed 10% improvement) –Reduced CPU cost Without Fix 691 seconds With Fix 621 seconds
  • 23. © Hortonworks Inc. 2015 Better use of JVM Memory - Contd • WeightedMemoryDistributor for better memory management in tasks –Observed 26% runtime improvement in tasks
  • 24. © Hortonworks Inc. 2015 Source Task …. …. Broadcast Shuffle Improvements Task 1 Task 2 Task N … Task 1 Task 2 Task N … Task 1 Task 2 Task N … Broadcast From local disk From local disk Source Task …. …. Task 1 Task 2 Task N … Task 1 Task 2 Task N … Task 1 Task 2 Task N … Broadcast Before Fix After Fix
  • 25. © Hortonworks Inc. 2015 PipelinedShuffle Improvments • Final merge in source task is avoided. – Less IO • Consumers are informed about spill events in advance – Better usage of network bandwidth – Overlap CPU with network – For sorted/unsorted outputs, send data to consumers in chunks • Observed 20% runtime improvement in queries involving heavy skews Task 1 Spill 1 Task 2 Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N ….. ….. ….. ….. Spill 1 Spill 2 Spill 3 Task 1 Spill 1 Task 2 Spill 1 Spill 2 Spill 3 Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N ….. ….. ….. ….. Merged Spill Normal Shuffle Path Pipelined Shuffle Path
  • 26. © Hortonworks Inc. 2015 PipelinedShuffle Improvements Job Runtime : 925 seconds Job Runtime : 680 seconds - 26% improvement - Avoids final merge (less IO, CPU cost) - Downstream can consume data whenever a spill is generated
  • 27. © Hortonworks Inc. 2015 • Performance –Speculation –Better use of JVM Memory –Intermediate File Improvements –Shuffle Improvements • Debuggability –Tez UI –Local mode –Job Analysis Tools –Shuffle Performance Analysis Tool
  • 28. © Hortonworks Inc. 2015 Tez UI
  • 29. © Hortonworks Inc. 2015 Tez UI
  • 31. © Hortonworks Inc. 2015 Better Debuggability– Local Mode • Test Tez Jobs without Hadoop Cluster • Enables Fast Prototyping • Fast Unit Testing • Runs on Single JVM (easy for debugging) • Scheduling / RPC invocations Skipped
  • 32. © Hortonworks Inc. 2015 Job Analysis Tools • DAG Swimlane –“$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>” Prewarm Container Reuse Remote Reads
  • 33. © Hortonworks Inc. 2015 Shuffle Performance Analysis Tools • Analyze Tez logs in Hadoop • Analyze shuffle performance between source / destination nodes Data transferred from node 7 to rest of the nodes are slow
  • 34. © Hortonworks Inc. 2015 Shuffle Performance Analysis Tools • Analyze shuffle performance between source / destination nodes
  • 35. © Hortonworks Inc. 2015 RoadMap • Shared output edges –Same output to multiple vertices • Local mode stabilization • Optimizing (include/exclude) vertex at runtime • Partial completion VertexManager • Co-Scheduling • Framework stats for better runtime decisions
  • 36. © Hortonworks Inc. 2015 Tez – Adoption • Apache Hive • Start from Hive 0.13 • set hive.exec.engine = tez • Apache Pig • Start from Pig 0.14 • pig -x tez • Cascading • Flink Page 36
  • 37. © Hortonworks Inc. 2015 Tez Community • Useful Links –http://tez.apache.org/ –JIRA : https://issues.apache.org/jira/browse/TEZ –Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git –Mailing Lists – Dev List: dev@tez.apache.org – User List: user@tez.apache.org – Issues List: issues@tez.apache.org • Tez Meetup –http://www.meetup.com/Apache-Tez-User-Group
  • 38. © Hortonworks Inc. 2015 Thank You! Questions & Answers Page 38

Editor's Notes

  1. application_1428021179455_0281 vs application_1428021179455_0282 691 vs 626 seconds
  2. application_1428021179455_0240 680 seconds application_1428021179455_0257 925 seconds