In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. Based on our experiences of running Apache Pig and *Hive at scale on Apache Tez at Yahoo!, advanced features like auto-parallelism and session mode expose specific limitations in the shuffle service which was not designed with these features in mind.
A highly auto-reduced job suffers from longer fetch times as the number of fetches per downstream task increases by the auto-reduction factor. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate this performance slow down.
Also, since Apache Tez DAGs are run completely within a single application unlike their equivalent MapReduce jobs, intermediate shuffle data in Tez can linger beyond its usefulness. The Apache Tez Shuffle Handler provides deletion APIs to reduce disk usage for such long running Tez sessions.
As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.
2. 2
CONCEPTS AND COMPONENTS
● Apache Tez Session Mode
• Runs multiple queries as part of a
single session
• One Application Master, multiple
DAGs
• Apache Hive and Hue primarily
use session mode
• Many transactions = many
DAGs
Fig. Multiple DAGs can form a session under the same AppMaster
DAG 1
Tez AM
DAG 2 DAG 3 DAG 4
3. 3
CONCEPTS AND COMPONENTS (Contd.)
● What is Container Reuse?
• MapReduce has single use
containers
• Apache Tez avoids initialization
cost by reusing containers for
tasks across DAGs and vertices
Task
Attempt1
YARN Container
Fig. DAGs and tasks can share the the same container
Task
Attempt2
4. 4
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA
● DAG shuffle data lives beyond its usefulness.
● Bigger the shuffle, larger the disk footprint.
DAG 1 DAG 2 DAG 3 DAG 4
Fig. Multiple DAGs can form a session under the same AppMaster
Tez AM
Shuffle
Data 1
Shuffle
Data 2
Unused for rest of the session
5. 5
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA (Contd.)
● Session Bloating
• Shuffle data occupies significant chunk of the
disks
• May elbow out other apps on the same node
Fig. Session Bloating
DAG 3
Tez AM
Shuffle
Data
Shuffle
Data
DAG 4
Shuffle
Data
Shuffle
DataShuffle
Data
Shuffle
DataShuffle
Data
Shuffle
Data
Shuffle
Data
Shuffle
Data
7. 7
GETTING TO DAG DELETE
● Associate intermediate shuffle data to its DAG.
8. 8
A SOLUTION : DAG DELETE
● Shuffle Handler that understands DAG deletion requests.
• Container Launchers are clients
• Tez Shuffle Handler is the server
• DAG deletion queries asynchronous HTTP requests sent
over multiple threads.
Tez AM
Container Launcher
Manager
Container
Launcher1
Container
Launcher2
Deletion
Tracker1
Deletion
Tracker2
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
DagComplete
LAUNCH LAUNCH
<NodeId, ShufflePort>
<NodeId, ShufflePort>
<NodeId, ShufflePort> <NodeId, ShufflePort>
Fig. Dag Deletion Architecture and Control Flow
DAG
I just finished!!
9. 9
● Deletion Policy is pluggable
• Write your own
• Container Launchers may or may not want one
• Allow different “types” of shuffle data deletion services
• Not every type of container may know its shuffle port
• Implement use case specific optimization
A SOLUTION : DAG DELETE
11. 11
● Defining Waste
• More SHUFFLE_BYTES unused.
• More idle time for the same data.
● Defining the metric
DAG DELETE EVALUATION
12. 12
• Percentage Reserved Space-Time
Savings : 54.4%
• Multiple DAGs with comparable sizes and
runtimes = more savings
THE BEST OF TIMES
0
1
2
3
4
5
6
7
8
0 500 1000 1500 2000
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
13. 13
• Percentage Reserved Space-Time
Savings : 33.4%
• Multiple DAGs with nearly equal sizes
and runtimes = An entire DAG worth of
saving each time.
0
1
2
3
4
0 200 400 600 800
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE UNEXPECTED TIMES
14. 14
• Percentage Reserved Space-Time
Savings : 26.05%
• 9 DAGs, most of them with no shuffle
data, interspersed by large shuffle DAGs
= Reasonable savings!
0
1
2
0 500 1000 1500
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE UNEXPECTED TIMES
15. 15
• Percentage Reserved Space-Time Savings
: 0.00004%
• Multiple DAGs with almost no shuffle data
culminating with a shuffle intensive DAG.
0
1
2
3
0 50 100 150 200
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE WORST TIMES
16. 16
• The Tez Shuffle Handler (TEZ-3334) shows phenomenal performance gain through composite fetch.
• The shuffle times drop by orders of magnitude.
• Performance improvements with actual job times dropping by several minutes
• More jobs per unit of time, more throughput
• The utilization of disk space is markedly better through Dag Delete.
• Shuffle heavy long running sessions show space-time savings of ~50%
• Makes Tez Shuffle Handler a better partner in a multi-tenant setting
CONCLUSION
17. 17
● Vertex Deletion (TEZ-3363)
• Can we predict when vertex intermediate data is stale enough to be deleted?
• How will it impact the worst case we saw during DAG delete?
• Think delete data at a certain depth of a vertex’s completion
● Multi file (Unordered) (TEZ-3367)
• The Unordered case does not need multiple spill files
• How to limit multiple output files from taking up all the INodes?
• Removing buffers for key value reads
• Address skew scenario
FUTURE WORK
18. 18
● Slow Start with MRInput (TEZ-3274)
• Vertices with shuffle input and MRInput do not respect slow start
● Empty partition improvements (TEZ-3605)
• Ordered case writes empty partitions to the IFiles
• In a heavily auto-reduced scenario, such partitions are fetched and then thrown away
FUTURE WORK
19. 19
● The Apache Tez Community
● Rohini Palaniswamy, PMC/Committer, Apache Pig, Apache TEZ, Apache Oozie
● Jason Lowe , PMC/Committer, Apache Hadoop, Apache Tez
● Siddharth Seth, PMC/Committer, Apache Hadoop, Apache Tez, Apache Hive
● Hitesh Shah, PMC/Committer, Apache Hadoop, Apache Tez, Apache Ambari
ACKNOWLEDGEMENTS