Dag Delete:
Improving Reserved Space
Utilization
2
CONCEPTS AND COMPONENTS
● Apache Tez Session Mode
• Runs multiple queries as part of a
single session
• One Application Master, multiple
DAGs
• Apache Hive and Hue primarily
use session mode
• Many transactions = many
DAGs
Fig. Multiple DAGs can form a session under the same AppMaster
DAG 1
Tez AM
DAG 2 DAG 3 DAG 4
3
CONCEPTS AND COMPONENTS (Contd.)
● What is Container Reuse?
• MapReduce has single use
containers
• Apache Tez avoids initialization
cost by reusing containers for
tasks across DAGs and vertices
Task
Attempt1
YARN Container
Fig. DAGs and tasks can share the the same container
Task
Attempt2
4
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA
● DAG shuffle data lives beyond its usefulness.
● Bigger the shuffle, larger the disk footprint.
DAG 1 DAG 2 DAG 3 DAG 4
Fig. Multiple DAGs can form a session under the same AppMaster
Tez AM
Shuffle
Data 1
Shuffle
Data 2
Unused for rest of the session
5
PROBLEM - THE LIFE OF SESSION INTERMEDIATE DATA (Contd.)
● Session Bloating
• Shuffle data occupies significant chunk of the
disks
• May elbow out other apps on the same node
Fig. Session Bloating
DAG 3
Tez AM
Shuffle
Data
Shuffle
Data
DAG 4
Shuffle
Data
Shuffle
DataShuffle
Data
Shuffle
DataShuffle
Data
Shuffle
Data
Shuffle
Data
Shuffle
Data
A Solution:
DAG Delete
7
GETTING TO DAG DELETE
● Associate intermediate shuffle data to its DAG.
8
A SOLUTION : DAG DELETE
● Shuffle Handler that understands DAG deletion requests.
• Container Launchers are clients
• Tez Shuffle Handler is the server
• DAG deletion queries asynchronous HTTP requests sent
over multiple threads.
Tez AM
Container Launcher
Manager
Container
Launcher1
Container
Launcher2
Deletion
Tracker1
Deletion
Tracker2
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
Tez Shuffle Handler on NM
DagComplete
LAUNCH LAUNCH
<NodeId, ShufflePort>
<NodeId, ShufflePort>
<NodeId, ShufflePort> <NodeId, ShufflePort>
Fig. Dag Deletion Architecture and Control Flow
DAG
I just finished!!
9
● Deletion Policy is pluggable
• Write your own
• Container Launchers may or may not want one
• Allow different “types” of shuffle data deletion services
• Not every type of container may know its shuffle port
• Implement use case specific optimization
A SOLUTION : DAG DELETE
Results
11
● Defining Waste
• More SHUFFLE_BYTES unused.
• More idle time for the same data.
● Defining the metric
DAG DELETE EVALUATION
12
• Percentage Reserved Space-Time
Savings : 54.4%
• Multiple DAGs with comparable sizes and
runtimes = more savings
THE BEST OF TIMES
0
1
2
3
4
5
6
7
8
0 500 1000 1500 2000
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
13
• Percentage Reserved Space-Time
Savings : 33.4%
• Multiple DAGs with nearly equal sizes
and runtimes = An entire DAG worth of
saving each time.
0
1
2
3
4
0 200 400 600 800
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE UNEXPECTED TIMES
14
• Percentage Reserved Space-Time
Savings : 26.05%
• 9 DAGs, most of them with no shuffle
data, interspersed by large shuffle DAGs
= Reasonable savings!
0
1
2
0 500 1000 1500
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE UNEXPECTED TIMES
15
• Percentage Reserved Space-Time Savings
: 0.00004%
• Multiple DAGs with almost no shuffle data
culminating with a shuffle intensive DAG.
0
1
2
3
0 50 100 150 200
SHUFFLE_BYTES(GBs) Time (sec)
DAG Shuffle Bytes over Time
After Before
THE WORST TIMES
16
• The Tez Shuffle Handler (TEZ-3334) shows phenomenal performance gain through composite fetch.
• The shuffle times drop by orders of magnitude.
• Performance improvements with actual job times dropping by several minutes
• More jobs per unit of time, more throughput
• The utilization of disk space is markedly better through Dag Delete.
• Shuffle heavy long running sessions show space-time savings of ~50%
• Makes Tez Shuffle Handler a better partner in a multi-tenant setting
CONCLUSION
17
● Vertex Deletion (TEZ-3363)
• Can we predict when vertex intermediate data is stale enough to be deleted?
• How will it impact the worst case we saw during DAG delete?
• Think delete data at a certain depth of a vertex’s completion
● Multi file (Unordered) (TEZ-3367)
• The Unordered case does not need multiple spill files
• How to limit multiple output files from taking up all the INodes?
• Removing buffers for key value reads
• Address skew scenario
FUTURE WORK
18
● Slow Start with MRInput (TEZ-3274)
• Vertices with shuffle input and MRInput do not respect slow start
● Empty partition improvements (TEZ-3605)
• Ordered case writes empty partitions to the IFiles
• In a heavily auto-reduced scenario, such partitions are fetched and then thrown away
FUTURE WORK
19
● The Apache Tez Community
● Rohini Palaniswamy, PMC/Committer, Apache Pig, Apache TEZ, Apache Oozie
● Jason Lowe , PMC/Committer, Apache Hadoop, Apache Tez
● Siddharth Seth, PMC/Committer, Apache Hadoop, Apache Tez, Apache Hive
● Hitesh Shah, PMC/Committer, Apache Hadoop, Apache Tez, Apache Ambari
ACKNOWLEDGEMENTS
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop

  • 1.
  • 2.
    2 CONCEPTS AND COMPONENTS ●Apache Tez Session Mode • Runs multiple queries as part of a single session • One Application Master, multiple DAGs • Apache Hive and Hue primarily use session mode • Many transactions = many DAGs Fig. Multiple DAGs can form a session under the same AppMaster DAG 1 Tez AM DAG 2 DAG 3 DAG 4
  • 3.
    3 CONCEPTS AND COMPONENTS(Contd.) ● What is Container Reuse? • MapReduce has single use containers • Apache Tez avoids initialization cost by reusing containers for tasks across DAGs and vertices Task Attempt1 YARN Container Fig. DAGs and tasks can share the the same container Task Attempt2
  • 4.
    4 PROBLEM - THELIFE OF SESSION INTERMEDIATE DATA ● DAG shuffle data lives beyond its usefulness. ● Bigger the shuffle, larger the disk footprint. DAG 1 DAG 2 DAG 3 DAG 4 Fig. Multiple DAGs can form a session under the same AppMaster Tez AM Shuffle Data 1 Shuffle Data 2 Unused for rest of the session
  • 5.
    5 PROBLEM - THELIFE OF SESSION INTERMEDIATE DATA (Contd.) ● Session Bloating • Shuffle data occupies significant chunk of the disks • May elbow out other apps on the same node Fig. Session Bloating DAG 3 Tez AM Shuffle Data Shuffle Data DAG 4 Shuffle Data Shuffle DataShuffle Data Shuffle DataShuffle Data Shuffle Data Shuffle Data Shuffle Data
  • 6.
  • 7.
    7 GETTING TO DAGDELETE ● Associate intermediate shuffle data to its DAG.
  • 8.
    8 A SOLUTION :DAG DELETE ● Shuffle Handler that understands DAG deletion requests. • Container Launchers are clients • Tez Shuffle Handler is the server • DAG deletion queries asynchronous HTTP requests sent over multiple threads. Tez AM Container Launcher Manager Container Launcher1 Container Launcher2 Deletion Tracker1 Deletion Tracker2 Tez Shuffle Handler on NM Tez Shuffle Handler on NM Tez Shuffle Handler on NM Tez Shuffle Handler on NM DagComplete LAUNCH LAUNCH <NodeId, ShufflePort> <NodeId, ShufflePort> <NodeId, ShufflePort> <NodeId, ShufflePort> Fig. Dag Deletion Architecture and Control Flow DAG I just finished!!
  • 9.
    9 ● Deletion Policyis pluggable • Write your own • Container Launchers may or may not want one • Allow different “types” of shuffle data deletion services • Not every type of container may know its shuffle port • Implement use case specific optimization A SOLUTION : DAG DELETE
  • 10.
  • 11.
    11 ● Defining Waste •More SHUFFLE_BYTES unused. • More idle time for the same data. ● Defining the metric DAG DELETE EVALUATION
  • 12.
    12 • Percentage ReservedSpace-Time Savings : 54.4% • Multiple DAGs with comparable sizes and runtimes = more savings THE BEST OF TIMES 0 1 2 3 4 5 6 7 8 0 500 1000 1500 2000 SHUFFLE_BYTES(GBs) Time (sec) DAG Shuffle Bytes over Time After Before
  • 13.
    13 • Percentage ReservedSpace-Time Savings : 33.4% • Multiple DAGs with nearly equal sizes and runtimes = An entire DAG worth of saving each time. 0 1 2 3 4 0 200 400 600 800 SHUFFLE_BYTES(GBs) Time (sec) DAG Shuffle Bytes over Time After Before THE UNEXPECTED TIMES
  • 14.
    14 • Percentage ReservedSpace-Time Savings : 26.05% • 9 DAGs, most of them with no shuffle data, interspersed by large shuffle DAGs = Reasonable savings! 0 1 2 0 500 1000 1500 SHUFFLE_BYTES(GBs) Time (sec) DAG Shuffle Bytes over Time After Before THE UNEXPECTED TIMES
  • 15.
    15 • Percentage ReservedSpace-Time Savings : 0.00004% • Multiple DAGs with almost no shuffle data culminating with a shuffle intensive DAG. 0 1 2 3 0 50 100 150 200 SHUFFLE_BYTES(GBs) Time (sec) DAG Shuffle Bytes over Time After Before THE WORST TIMES
  • 16.
    16 • The TezShuffle Handler (TEZ-3334) shows phenomenal performance gain through composite fetch. • The shuffle times drop by orders of magnitude. • Performance improvements with actual job times dropping by several minutes • More jobs per unit of time, more throughput • The utilization of disk space is markedly better through Dag Delete. • Shuffle heavy long running sessions show space-time savings of ~50% • Makes Tez Shuffle Handler a better partner in a multi-tenant setting CONCLUSION
  • 17.
    17 ● Vertex Deletion(TEZ-3363) • Can we predict when vertex intermediate data is stale enough to be deleted? • How will it impact the worst case we saw during DAG delete? • Think delete data at a certain depth of a vertex’s completion ● Multi file (Unordered) (TEZ-3367) • The Unordered case does not need multiple spill files • How to limit multiple output files from taking up all the INodes? • Removing buffers for key value reads • Address skew scenario FUTURE WORK
  • 18.
    18 ● Slow Startwith MRInput (TEZ-3274) • Vertices with shuffle input and MRInput do not respect slow start ● Empty partition improvements (TEZ-3605) • Ordered case writes empty partitions to the IFiles • In a heavily auto-reduced scenario, such partitions are fetched and then thrown away FUTURE WORK
  • 19.
    19 ● The ApacheTez Community ● Rohini Palaniswamy, PMC/Committer, Apache Pig, Apache TEZ, Apache Oozie ● Jason Lowe , PMC/Committer, Apache Hadoop, Apache Tez ● Siddharth Seth, PMC/Committer, Apache Hadoop, Apache Tez, Apache Hive ● Hitesh Shah, PMC/Committer, Apache Hadoop, Apache Tez, Apache Ambari ACKNOWLEDGEMENTS