This document discusses scheduling data transfer operations with advance reservation and provisioning. It proposes dividing time into windows where network bandwidth availability is stable. When a data transfer request is received, the scheduler checks all possible time windows to see if the request can fit within bandwidth constraints. If no window is available, it tries shifting existing transfers to earlier windows if they have less "desire" based on number of occupied time slots and order of the window. This allows requests to be scheduled in advance while minimizing disruption to existing transfers.
Unblocking The Main Thread Solving ANRs and Frozen Frames
DATA TRANSFER SCHEDULING WITH ADVANCE RESERVATION
1. DATA TRANSFER SCHEDULING
WITH ADVANCE RESERVATION
1
WITH ADVANCE RESERVATION
AND PROVISIONING
MEHMET BALMAN
Ph.D. defense: May 7, 2010 (11:30 am - 297 Coates Hall, LSU )
2. Motivation
Scientific applications are becoming more data intensive
(dealing with petabytes of data)
Complex middleware is required to manage the end-to-end
distribution of data
Need to orchestrate the use of system, storage and
2
Need to orchestrate the use of system, storage and
network resources between collaborating parties
Need to organize data transfer operations according to
given user requirements
Need to plan in advance and reserve the time period
for the data movement operations
3. Thesis Statement
We need data transfer scheduling with advance
reservation and provisioning to allow researchers to use
data placement as-a-service where they can plan ahead
and reserve time/resources for their data movement
3
and reserve time/resources for their data movement
operations
5. Introduction
We are in a new era that offers new oppurtunities to
conduct scientific research with the help of computation
Computation intensive science: particle physics, climate
modelling, bio-informatics simulations
Scientific simulations and experimental facilities
5
Scientific simulations and experimental facilities
generate massive data sets
Climate modeling data
35 terabytes shared by more than 2500 users worldwide
Next generation archive will be more than 650 terabytes
Large Hadron Collider
Expected to generate 100gigabits per second
6. Introduction
Large scale applications necessitate collaborations
Require mass storage systems
Data need to be transferred to remote sites for
6
further analysis (validate with simulations)
Need on demand high speed data access between
collaborating parties
High performance visualization
Large volume data analysis
7. Existing systems
Next generation research networks such as ESNet and Internet2
provide high-speed on-demand data access between collaborating
institutions by delivering network-as-a-service
On-Demand Secure Circuits and Advance Reservation System
(OSCARS)
7
(OSCARS)
Guaranteed bandwidth (at certain time, for a certain bandwidth and
length of time)
Co-allocation for storage and network resources (HARC)
No scheduling or organization (interface to allocate resources at the same time)
Data Transfer Scheduling (Stork)
Storage Resource Management (SRM)
8. Use Case
A scientific application generates immense amount of simulation
data using supercomputing resources
The generated data is stored in a temporary space and need to
be moved to a data repository for further processing or archiving
Another application may be waiting this generated data as its
8
Another application may be waiting this generated data as its
input to start execution
Delaying the data transfer operation or completing the transfer
far after than the expected time may create several problems
(other resources are waiting for this transfer operation to
complete)
When it will be ready to move data into a remote repository?
9. Problems in existing systems
Data Transfer Scheduling:
Optimizing for performance and resource utilization
What about user requirements and priorities ?
Advance Resource Allocation?
Deadline, allocated for future time (planning)
Coordination between resource managers (very less progress)
9
Coordination between resource managers (very less progress)
Time/Resource Conflicts
Time Constraints (using a strict start/end times)
Users can not allocate/reserve the data placement service in
advance (scheduling with advance reservation and provisioning)
Need to orchestrate advanced system and network allocation
together for data movements
11. Methodology
We developed a new data scheduling paradigm
accept time constraints
allow users to plan ahead
orchestrate resource allocation
provide advance resource reservation
reserve the scheduler’s time for future data movement
11
reserve the scheduler’s time for future data movement
operations
Time Constraints:
Earliest start time
Latest completion time
Resource Constraints:
Data Volume source >network >destination
Source
Destination
12. Methodology
The scheduler checks the availability of resources in a given
time period and justifies whether requested operation can be
satisfied with the given time constraints
The server and the network capacity is allocated for the
future time period in advance
12
future time period in advance
The scheduler considers other requests reserved for future time
windows and re-order operations in the current time period
Execution Phase: re-organization, tuning, and ordering
Failure-awareness
Job Aggregation
Dynamic Adaptation in data transfers
13. Problem
A data transfer job: ( earliest start time, latest completion
time, volume, source, destination)
Constraints:
13
Constraints:
server capacity (data transfer node)
network capacity (network link)
Single job
Advance Network Reservation
Multiple jobs
Scheduling with Time and Resource Constraints (literature)
Scheduling with Advance Reservation
15. Network Reservation
Bandwidth allocation between edge routers
Currently systems provides yes/no answers to a reservation
request for (bandwidth, start_time, end_time).
Clients are not given other possible options
15
Clients are not given other possible options
Does not provide an optimal choice for client
May cause ineffective use of overall system
Overload system with trial-and-error attempts
How can we enhance the reservation system?
Submit constraints and the system suggests possible
reservations satisfying requirements
16. End-to-end data movement
End-to-end High Performance Data Movement
Bandwidth network reservation
Bandwidth provisioning in client sites
Storage allocation
16
Storage allocation
Therefore, we need coordination between Storage Resource
Managers and Network Resource Allocation
But the requested bandwidth can not be guaranteed
Try-and-error until get an available reservation
17. Reservation Engine
Improve advance network reservation systems by presenting to
the clients, the possible reservation options and alternatives for
earliest completion time and shortest transfer duration.
A new service:
17
A new service:
Users provide maximum bandwidth they can use, total size of the
data requested to be transferred, the earliest start time, and the
latest completion time.
Users can set criteria such that they would like to reserve a path
for earliest completion time or reserve a path for shortest transfer
duration.
The reservation engine finds out the reservation for the earliest
completion or for the shortest duration
19. Time dependent network flow
(difference)19
t1 t2 t3 t4 t5 t6
Not suitable for bandwidth guaranteed paths !
20. Approach
In our approach, the search interval is divided into time windows
A time window represents a period of time where we have a
stable status of available bandwidth of all related links
A snaphots of the network topology in this time windows
20
• Search through these time windows to check whether we can
satisfy the requested allocation for that time window.
• First, check the duration of the time window
– Can we satisfy the user request in that time windows?
(we know the max bandwidth user can support)
• Then, calculate the max bandwidth available in the time window
21. Time Windows
21
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
A
CB
800Mbps
900Mbps 500Mbps
1000Mbps
300Mbps
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
D
900Mbps 500Mbps
t4t2 t3t1 t5 t6 t7 t8 t9 t10 t11 t12 t13
Reservation 1Reservation 1
Reservation 2Reservation 2
Reservation 3Reservation 3
22. Time Steps and Time Windows
22
Time windows between t1 and t13
time
t4t2 t3t1 t5 t6 t7 t8 t9 t10 t11 t12 t13
Reservation 1Reservation 1
Reservation 2Reservation 2
Reservation 3Reservation 3
Res 1 Res 1,2
Res
2
Res 3
t4t1 t6 t7 t9 t12 t13
time
time windows
23. 23
Res 1 Res 1,2
Re
s 2
t4t1
t6 t7 t9
A
CB
100 Mbps
800 Mbps
300 Mbps)
A
CB
100 Mbps
400 Mbps
300 Mbps)
A
CB
1000 Mbps
400 Mbps
300 Mbps)
A
CB
1000 Mbps
800 Mbps
300 Mbps)
t4 t6
t7
CB
D
0 Mbps 500 Mbps
300 Mbps)
CB
D
0 Mbps 100 Mbps
300 Mbps)
CB
D
900 Mbps 100 Mbps
300 Mbps)
CB
D
900 Mbps 500 Mbps
300 Mbps)
25. Time Windows
25
Res 1 Res 1,2
Res
2
Res 3
t4t1 t6 t7 t9 t12 t13
time
windows
Res 1
Res 1, 2t t
t1--t4
Max bandwidth from A to D
1. 900Mbps (3)
2. 100Mbps (2)Res 1, 2
Res 1, 2
2
Res 1,2
Res 1, 2
Res 2
Res 1, 2
Res 1, 2
t1--t6
t4—t6
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
2. 100Mbps (2)
3. 100Mbps (5)
4. 900Mbps (1)
5. 100Mbps (3)
6. 100Mbps (6)
7. 900Mpbs (2)
8. 900Mbps (3)
9. 100Mbps (5)
10. 100Mbps (8)
Reservation: ( A to D ) (100Mbps) start=t1 end=t9
27. Time and Resource Conflicts
File Transfer with start/end times - NP-hard!
How to represent time dependency?
Can not benefit from known network algorithms
(max flow, min cut, shortest path)
27
NP-hard even for networks with a single link
Knapsack problem
Unsplittable flow problem (see also network coding in
routing)
Max edge disjoint path problem
Online / Offline ?
Greedy Approaches / practical ?
28. File Transfer Scheduling Demystified
A simple case
n nodes connected to each other
Each node can transfer maximum C(n) files at a time
There are m files to be transferred
a file need to be sent from node i to node j
28
a file need to be sent from node i to node j
Files may have different sizes which defines the amount of time
required for the transfer
Objective is to minimize the total transfer time
This is a common type of assignment problem,
and it is NP-hard!
29. Special cases
Network is a bipartite graph
Max concurrency is 1, one file at a time
File sizes are same, each file takes same amount of time to
transfer
29
Graph coloring
Bipartite cardinality matching
What if each file has a specific cost
cost can be associated with the file size?
Hungarian problem?
But if we are able to transfer more than a single file at a
time using the same node (NP-hard)
If sharing bandwidth, it becomes even harder
30. Source > Network > Destination
30
A
CB
D
800Mbps
900Mbps 500Mbps
1000Mbps
300Mbps
n2
n1
Node capacity?
Now we have multiple jobs, need to find a
schedule
32. With start/end times
Each transfer request has start and end times
n transfer requests are given (each request has a specific
amount of profit)
Objective is to maximize the profit
If profit is same for each request, then objective is to
32
If profit is same for each request, then objective is to
maximize the number of jobs in a give time period
Unsplittable Flow Problem:
An undirected graph,
route demand from source(s) to destinations(s) and
maximize/minimize the total profit/cost
33. Why UFP?
We represent time as a discrete variable (recall time slots and time
windows in Network Reservation Engine)
Ex:
job1: (start time t1, end time t10)
job2: (start time t5, end time t20)
33
job2: (start time t5, end time t20)
Time slots
1: (t1,t5)
2: (t5,t10)
3: (t10,t20)
Job1 spans to time slot 1 and 2
Job 2 spans to time slot 2 and 3
35. Knapsack Problem ?
If there is only one link, edge capacity is same
Profit is also same for each job
If there is start/end times, even for a single link, it is NP hard!
35
Note that: UFP specializes to max edge disjoint path problem
Scheduling with conflicts is hard
Online scheduling is harder
36. Dynamic networks/ Job requirements
At each time slot we may have different edge/node capacity
Transfer request comes with
total amount of data to be sent (volume)
Desired time period (earliest start, latest finish)
Objective is to find a sequence of time slots (time window) in which this
transfer can be sent satisfying the given criteria
36
transfer can be sent satisfying the given criteria
Each time slot has a specific capacity
Each time window consists of one or more time slots
Time window 1: time slot 1 to 5
Time window 3: time slot 3 to 10
….
If there is node capacity?
assigning a request will affect available capacity in two nodes and one
edge
38. Approach
Should make a decision quickly
Is it really a good idea to schedule many jobs at the same
time in which they are overlapping and sharing the total
bandwidth?
38
39. Definition
A network with n nodes
Each connected to each other (mesh)
Each connection (edge) has a specific maximum
capacity
39
capacity
Each node has a maximum capacity separate for
incoming and outgoing transfers
This is implementation specific and does not change the
algorithm complexity ( O(n * s^2) )
40. Definition
Time constraints (Earliest start / latest complete)
When data will be ready?
When is the deadline?
40
Find an allocation (start/end times) for the job
Can shift to another time slot or not
Locked or unlocked jobs
Online scheduling:
Displace other jobs to open space for the new request
we can shift max n jobs?
41. Methodology
Receive a job
Find all possible time windows for this job
If it can fit to any, then allocate
If not, try each time window starting from the earliest
41
If not, try each time window starting from the earliest
If there is a job with less ‘desire/preference’ which can
shift and still satisfy its criteria, allocate the time window
If none found, extend latest finish time by adding time
slot(s)
Search new time windows to fit one
If none found, reject the job
42. Methodology
Never accept a job if it causes other committed jobs to
break their criteria
A job’s reservation is locked if it has delayed/close to
42
A job’s reservation is locked if it has delayed/close to
deadline or failed and restarted
If a job can not be finished by deadline?
Resubmit with the highest priority
43. Methodology
• For each job we calculate the possible time windows
• When a time window is reserved for a job:
• We keep track of the number of time slots in this time
window
43
window
• Ts_num
• The order of the time window (sooner is better)
• Tw_order = tw_id / total time windows for this job
• Desire/Preference is defined by both Ts_num and
Tw_order
44. Methodology
Providing a framework for scheduling data transfers
with advance allocation
Ts_num shows overlaps with other transfers
The job with higher Ts_num has higher priority
44
Already overlapping with more transfers, don’t shift
Ts_order shows time slots left to deadline
The job with higher Ts_order has higher priority
More close to deadline (in terms of time slots, not real time)
Any preference model works (even random ranking)
45. Recall Time Windows
45
Res 1 Res 1,2
Res
2
Res 3
t4t1 t6 t7 t9 t12 t13
time
windows
Res 1
Res 1, 2t t
t1--t4
Max bandwidth from A to D
1. 900Mbps (3)
2. 100Mbps (2)Res 1, 2
Res 1, 2
2
Res 1,2
Res 1, 2
Res 2
Res 1, 2
Res 1, 2
t1--t6
t4—t6
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
2. 100Mbps (2)
3. 100Mbps (5)
4. 900Mbps (1)
5. 100Mbps (3)
6. 100Mbps (6)
7. 900Mpbs (2)
8. 900Mbps (3)
9. 100Mbps (5)
10. 100Mbps (8)
Reservation: ( A to D ) (100Mbps) start=t1 end=t9
49. Evaluation
Not studied before (a special case of UFP)
UFP is already recent
Planning ahead (gives opportunity for co-allocation)
With the help of given search interval (earliest start /
latest complete)
49
latest complete)
flight reservation example
The solution uses a unique approach in preference
Time slots, time windows (novel approach)
Gives a polynomial approximation algorithm
The preference converts the UFP problem into Dijkstra path search
Uses failure-awareness, early error detection
50. Evaluation
Encourages users to submit reasonable time constraints
If cant find in the first round, don’t try to displace any other job
Fair (never dismiss a previously admitted job)
Linear search (displace a job only once in a search round)
50
Utilizes time windows/time steps for ranking (better than
earliest deadline first)
Earliest completion + shortest duration
Minimize concurrency
Even random ranking would work (relaxation in an NP-
hard problem
51. Evaluation
Network Reservation
Can list/search all possible time windows in polynomial time
Searching time windows is FAST!
51
r: reservation
time steps (s): 2r+1
Time windows (w): s(s+1)/2
52. Time Window List
(special data structures)52
now infinite
Time windows list
new reservation: reservation 1, start t1, end t101 10
now t1 t10 infinite
Res 1
new reservation: reservation 2, start t12, end t20
now t1 t10 t12
Res 1
t20 infinite
Res 2
53. Testing the NRE library
Each point is average of 100 measurement
Set 1: sparse graph
Set 2: dense graph
53
Set 2: dense graph
Random graph:
58. Failure-Awareness and Error Detection
Dynamic Environment:
data transfers are prune to frequent failures
what went wrong during data transfer?
No access to the remote resources
Messages get lost due to system malfunction
Instead of waiting for failure to happen
Detect possible failures and malfunctioning services
Search for another data server
58
Search for another data server
Alternate data transfer service
Use Network Exploration Techniques
Check availability of the remote service
Resolve host and determine connectivity failures
Detect available data transfers service
Error while transfer is in progress? Retry or not?
When to re-initiate the transfer? Use alternate protocols?
59. Error Classification
59
•Recover from Failure
•Retry failed operation
•Postpone scheduling of a
failed operations
• Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
• A better logging facility and classification
•Early Error Detection
•Initiate Transfer when
erroneous condition
recovered
•Or use Alternate options
60. Failure-aware scheduling
60
SCOOP data - Hurricane Gustav Simulations
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G)
61. Job Aggregation
Multiple data movement jobs are combined and processed as
a single transfer job
Information about the aggregated job is stored in the job queue
and it is tied to a main job which is actually performing the
transfer operation such that it can be queried and reported
separately.
61
transfer operation such that it can be queried and reported
separately.
Hence, aggregation is transparent to the user
We have seen vast performance improvement, especially with
small data files
decreasing the amount of protocol usage
reducing the number of independent network connections
62. Job Aggregation
62
Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file
per job
63. Dynamic Tuning in Data Transfer
Operations
End-to-end bulk data transfer (latency wall)
Transfer data by chunks (partial transfers) and also set control parameters on
the fly.
63
the fly.
Gradually increase the number of parallel streams till it comes to an
equilibrium point
No need to probe the system and make measurements with external profilers
Does not require any complex model for parameter optimization
Adapts to changing environment
But, overhead in changing parallelism level
Fast start (exponentially increase the number of parallel streams)
64. Adaptive Tuning
Start with single stream (n=1)
Measure instant throughput for every data chunk transferred
(fast start)
Increase the number of parallel streams (n=n*2),
64
transfer the data chunk
measure instant throughput
If current throughput value is better than previous one, continue
Otherwise, set n to the old value and gradually increase parallelism
level (n=n+1)
If no throughput gain by increasing number of streams (found the
equilibrium point)
Increase chunk size (delay measurement period)
66. New Transfer Modules
• Verify the successful completion of the operation by
controlling checksum and file size.
• Transfer module can recover from a failed operation by
restarting from the last transmitted file. In case of a retry
66
restarting from the last transmitted file. In case of a retry
from a failure, scheduler informs the transfer module to
recover and restart the transfer using the information from
a rescue file created by the checkpoint-enabled transfer
module.
• An “intelligent” (dynamic tuning) alternative to Globus RFT
(Reliable File Transfer)
68. Conclusion
We developed a new data transfer scheduling paradigm in which
data movement operations are scheduled in advance
Analyze scheduling with time and resource constraints
68
Show a scheduling model with advance reservation
Our methodology provides a basis for provisioning end-to-end
high performance data transfers in which users submit their jobs
with time and resource constraints to make an advance schedule.
Our future work includes implementation and integration of the
online scheduling algorithm with advance reservation.
69. Contributions
Presented a novel approach for path finding in time-dependent
networks
By taking advantage of user provided parameters of total volume and
time constraints.
Presented a new algorithm to find reservation path with options
for earliest completion time and shortest transfer duration
69
for earliest completion time and shortest transfer duration
Propose an approximation algorithms using time windows and time
steps for data transfer scheduling with advance reservations
Coordination of system (data node) and network (link) resources
Some other contributions include reliability, adaptability, and
performance optimization data placement tasks
70. Selected Publications
Error Detection and Error Classification: Failure Awareness in Data Transfer Scheduling,
IJAC 2010
Semantic Enabled Metadata Management in PetaShare, IJGUC 2009
A New Paradigm: Data-Aware Scheduling in Grid Computing, FGCS, Elsevier 2009
Data-Aware Distributed Computing with Stork Data Scheduler, SEE-Grid, 2010
Dynamic Adaptation of Parallelism Level in Data Transfer Scheduling, ASHEs-CISIS 2009
Early Error Detection and Classification in Data Transfer Scheduling, 3PGIC-CISIS 2009
70
Early Error Detection and Classification in Data Transfer Scheduling, 3PGIC-CISIS 2009
Choosing Between Remote I/O versus Staging in Large Scale Distributed Applications, PDCCS 2008
Dynamically Tuning Level of Parallelism in Wide Area Data Transfers, DADC 2008
Data Scheduling for Large Scale Distributed Applications, ICEIS 2008
Intermediate Gateway Service to Aggregate and Cache I/O operations into Data Repositories, USENIX
FAST 2009
From Micro- to Macro-processing: A Generic Data Management Model, Grid 2007
An Efficient Reservation Algorithm for Advanced Network Provisioning, LBNL –TR 2010
A New Approach in Advance Network Reservation and Provisioning for High Performance Scientific
Data Transfers, LBNL-TR 2010
Advance Network Reservation and Provisioning for Science, LBNL-TR 2009