Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Dag Delete:
Improving Reserved Space
Utilization

2
CONCEPTS AND COMPONENTS
● Apache Tez Session Mode
• Runs multiple queries as part of a
single session
• One Application Master, multiple
DAGs
• Apache Hive and Hue primarily
use session mode
• Many transactions = many
DAGs
Fig. Multiple DAGs can form a session under the same AppMaster
DAG 1
Tez AM
DAG 2 DAG 3 DAG 4

3
CONCEPTS AND COMPONENTS (Contd.)
● What is Container Reuse?
• MapReduce has single use
containers
• Apache Tez avoids initialization
cost by reusing containers for
tasks across DAGs and vertices
Task
Attempt1
YARN Container
Fig. DAGs and tasks can share the the same container
Task
Attempt2

4
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA
● DAG shuffle data lives beyond its usefulness.
● Bigger the shuffle, larger the disk footprint.
DAG 1 DAG 2 DAG 3 DAG 4
Fig. Multiple DAGs can form a session under the same AppMaster
Tez AM
Shuffle
Data 1
Shuffle
Data 2
Unused for rest of the session

5
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA (Contd.)
● Session Bloating
• Shuffle data occupies significant chunk of the
disks
• May elbow out other apps on the same node
Fig. Session Bloating
DAG 3
Tez AM
Shuffle
Data
Shuffle
Data
DAG 4
Shuffle
Data
Shuffle
DataShuffle
Data
Shuffle
DataShuffle
Data
Shuffle
Data
Shuffle
Data
Shuffle
Data

7
GETTING TO DAG DELETE
● Associate intermediate shuffle data to its DAG.

8
A SOLUTION : DAG DELETE
● Shuffle Handler that understands DAG deletion requests.
• Container Launchers are clients
• Tez Shuffle Handler is the server
• DAG deletion queries asynchronous HTTP requests sent
over multiple threads.
Tez AM
Container Launcher
Manager
Container
Launcher1
Container
Launcher2
Deletion
Tracker1
Deletion
Tracker2
Tez Shuffle Handler on NM
DagComplete
LAUNCH LAUNCH
<NodeId, ShufflePort>
<NodeId, ShufflePort>
<NodeId, ShufflePort> <NodeId, ShufflePort>
Fig. Dag Deletion Architecture and Control Flow
DAG
I just finished!!

9
● Deletion Policy is pluggable
• Write your own
• Container Launchers may or may not want one
• Allow different “types” of shuffle data deletion services
• Not every type of container may know its shuffle port
• Implement use case specific optimization
A SOLUTION : DAG DELETE

11
● Defining Waste
• More SHUFFLE_BYTES unused.
• More idle time for the same data.
● Defining the metric
DAG DELETE EVALUATION

12
• Percentage Reserved Space-Time
Savings : 54.4%
• Multiple DAGs with comparable sizes and
runtimes = more savings
THE BEST OF TIMES
0
1
2
3
4
5
6
7
8
0 500 1000 1500 2000
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before

13
Savings : 33.4%
• Multiple DAGs with nearly equal sizes
and runtimes = An entire DAG worth of
saving each time.
0
1
2
3
4
0 200 400 600 800
After Before
THE UNEXPECTED TIMES

14
Savings : 26.05%
• 9 DAGs, most of them with no shuffle
data, interspersed by large shuffle DAGs
= Reasonable savings!
0
1
2
0 500 1000 1500
After Before
THE UNEXPECTED TIMES

15
• Percentage Reserved Space-Time Savings
: 0.00004%
• Multiple DAGs with almost no shuffle data
culminating with a shuffle intensive DAG.
0
1
2
3
0 50 100 150 200
After Before
THE WORST TIMES

16
• The Tez Shuffle Handler (TEZ-3334) shows phenomenal performance gain through composite fetch.
• The shuffle times drop by orders of magnitude.
• Performance improvements with actual job times dropping by several minutes
• More jobs per unit of time, more throughput
• The utilization of disk space is markedly better through Dag Delete.
• Shuffle heavy long running sessions show space-time savings of ~50%
• Makes Tez Shuffle Handler a better partner in a multi-tenant setting
CONCLUSION

17
● Vertex Deletion (TEZ-3363)
• Can we predict when vertex intermediate data is stale enough to be deleted?
• How will it impact the worst case we saw during DAG delete?
• Think delete data at a certain depth of a vertex’s completion
● Multi file (Unordered) (TEZ-3367)
• The Unordered case does not need multiple spill files
• How to limit multiple output files from taking up all the INodes?
• Removing buffers for key value reads
• Address skew scenario
FUTURE WORK

18
● Slow Start with MRInput (TEZ-3274)
• Vertices with shuffle input and MRInput do not respect slow start
● Empty partition improvements (TEZ-3605)
• Ordered case writes empty partitions to the IFiles
• In a heavily auto-reduced scenario, such partitions are fetched and then thrown away
FUTURE WORK

19
● The Apache Tez Community
● Rohini Palaniswamy, PMC/Committer, Apache Pig, Apache TEZ, Apache Oozie
● Jason Lowe , PMC/Committer, Apache Hadoop, Apache Tez
● Siddharth Seth, PMC/Committer, Apache Hadoop, Apache Tez, Apache Hive
● Hitesh Shah, PMC/Committer, Apache Hadoop, Apache Tez, Apache Ambari
ACKNOWLEDGEMENTS

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Similar to Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop