Tuning Up With Apache Tez
Gal Vinograd @ Crosswise - 2016/03/09
Agenda
The Pipeline
The Problem
Why we chose Tez
Lessons Learned
Demo
The Batch
Internet
Labels
Data
Internet
Labels
Data
~200
Scripts
250 c3.2xlarge X
30 hours
10TB per Batch
“Tez aims to be a general purpose execution
runtime that enhances various scenarios
that are not well served by classic
Map-Reduce. In the short term the major
focus is to support Hive and Pig ...”
Tez Design v1.1
“Tez aims to be a general purpose execution
runtime that enhances various scenarios that are
not well served by classic Map-Reduce. In the
short term the major focus is to support
Hive and Pig ...”
Tez Design v1.1
Hortonworks
The Batch
Internet
Labels
Data
~200
Scripts
Tez Atomic Components
Tokenizer
Aggregator
Edge
Vertex
Vertex
Logical and Physical Graphs
PhysicalLogical
Hortonworks
Optimizations
No “NOP” Map
Project
Distinct
GroupBy
NOP
Project
Distinct
GroupBy
Tez MR
Optimizations
No Barrier Between Jobs
Project
GroupBy
Project
Project
Distinct
Project
Distinct
GroupBy
Tez MR
Optimizations
No Redundant Resource Allocation
Project
Project
Distinct
GroupBy
Project
Project
Distinct
GroupBy
Pig
Process
Pig
Process
Tez MR
Optimizations
Sessions
Allocate
Submit 2
Submit 1
Cleanup
Client
Lessons Learned
Some Pig Tasks Did Not Compile  Occasionaly Froze
No DistributedCache Support For S3
Poor Amazon Support
No Pre-Built Releases
Additional Deployment for Tez UI
What is it good for?
Earily
Adopters
Pig  Hive
Bounded
Thanks for Listening!

Tuning up with Apache Tez

Editor's Notes

  • #19 -Dpig.tez.opt.union=false