Balman dissertation Copyright @ 2010 Mehmet Balman

414 views

Published on

Copyright @ 2010 Mehmet Balman

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
414
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Balman dissertation Copyright @ 2010 Mehmet Balman

  1. 1. DATA TRANSFER SCHEDULING WITH ADVANCE RESERVATION 1 WITH ADVANCE RESERVATION AND PROVISIONING MEHMET BALMAN Ph.D. defense: May 7, 2010 (11:30 am - 297 Coates Hall, LSU )
  2. 2. Motivation Scientific applications are becoming more data intensive (dealing with petabytes of data) Complex middleware is required to manage the end-to-end distribution of data Need to orchestrate the use of system, storage and 2 Need to orchestrate the use of system, storage and network resources between collaborating parties Need to organize data transfer operations according to given user requirements Need to plan in advance and reserve the time period for the data movement operations
  3. 3. Thesis Statement We need data transfer scheduling with advance reservation and provisioning to allow researchers to use data placement as-a-service where they can plan ahead and reserve time/resources for their data movement 3 and reserve time/resources for their data movement operations
  4. 4. Outline Introduction Methodology Advance Network Reservation 4 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  5. 5. Introduction We are in a new era that offers new oppurtunities to conduct scientific research with the help of computation Computation intensive science: particle physics, climate modelling, bio-informatics simulations Scientific simulations and experimental facilities 5 Scientific simulations and experimental facilities generate massive data sets Climate modeling data 35 terabytes shared by more than 2500 users worldwide Next generation archive will be more than 650 terabytes Large Hadron Collider Expected to generate 100gigabits per second
  6. 6. Introduction Large scale applications necessitate collaborations Require mass storage systems Data need to be transferred to remote sites for 6 further analysis (validate with simulations) Need on demand high speed data access between collaborating parties High performance visualization Large volume data analysis
  7. 7. Existing systems Next generation research networks such as ESNet and Internet2 provide high-speed on-demand data access between collaborating institutions by delivering network-as-a-service On-Demand Secure Circuits and Advance Reservation System (OSCARS) 7 (OSCARS) Guaranteed bandwidth (at certain time, for a certain bandwidth and length of time) Co-allocation for storage and network resources (HARC) No scheduling or organization (interface to allocate resources at the same time) Data Transfer Scheduling (Stork) Storage Resource Management (SRM)
  8. 8. Use Case A scientific application generates immense amount of simulation data using supercomputing resources The generated data is stored in a temporary space and need to be moved to a data repository for further processing or archiving Another application may be waiting this generated data as its 8 Another application may be waiting this generated data as its input to start execution Delaying the data transfer operation or completing the transfer far after than the expected time may create several problems (other resources are waiting for this transfer operation to complete) When it will be ready to move data into a remote repository?
  9. 9. Problems in existing systems Data Transfer Scheduling: Optimizing for performance and resource utilization What about user requirements and priorities ? Advance Resource Allocation? Deadline, allocated for future time (planning) Coordination between resource managers (very less progress) 9 Coordination between resource managers (very less progress) Time/Resource Conflicts Time Constraints (using a strict start/end times) Users can not allocate/reserve the data placement service in advance (scheduling with advance reservation and provisioning) Need to orchestrate advanced system and network allocation together for data movements
  10. 10. Outline Introduction Methodology Advance Network Reservation 10 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  11. 11. Methodology We developed a new data scheduling paradigm accept time constraints allow users to plan ahead orchestrate resource allocation provide advance resource reservation reserve the scheduler’s time for future data movement 11 reserve the scheduler’s time for future data movement operations Time Constraints: Earliest start time Latest completion time Resource Constraints: Data Volume source >network >destination Source Destination
  12. 12. Methodology The scheduler checks the availability of resources in a given time period and justifies whether requested operation can be satisfied with the given time constraints The server and the network capacity is allocated for the future time period in advance 12 future time period in advance The scheduler considers other requests reserved for future time windows and re-order operations in the current time period Execution Phase: re-organization, tuning, and ordering Failure-awareness Job Aggregation Dynamic Adaptation in data transfers
  13. 13. Problem A data transfer job: ( earliest start time, latest completion time, volume, source, destination) Constraints: 13 Constraints: server capacity (data transfer node) network capacity (network link) Single job Advance Network Reservation Multiple jobs Scheduling with Time and Resource Constraints (literature) Scheduling with Advance Reservation
  14. 14. Outline Introduction Methodology Advance Network Reservation 14 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  15. 15. Network Reservation Bandwidth allocation between edge routers Currently systems provides yes/no answers to a reservation request for (bandwidth, start_time, end_time). Clients are not given other possible options 15 Clients are not given other possible options Does not provide an optimal choice for client May cause ineffective use of overall system Overload system with trial-and-error attempts How can we enhance the reservation system? Submit constraints and the system suggests possible reservations satisfying requirements
  16. 16. End-to-end data movement End-to-end High Performance Data Movement Bandwidth network reservation Bandwidth provisioning in client sites Storage allocation 16 Storage allocation Therefore, we need coordination between Storage Resource Managers and Network Resource Allocation But the requested bandwidth can not be guaranteed Try-and-error until get an available reservation
  17. 17. Reservation Engine Improve advance network reservation systems by presenting to the clients, the possible reservation options and alternatives for earliest completion time and shortest transfer duration. A new service: 17 A new service: Users provide maximum bandwidth they can use, total size of the data requested to be transferred, the earliest start time, and the latest completion time. Users can set criteria such that they would like to reserve a path for earliest completion time or reserve a path for shortest transfer duration. The reservation engine finds out the reservation for the earliest completion or for the shortest duration
  18. 18. Bandwidth Allocation (time-dependent) 18 Bottleneck constraint (max bandwidth) QoS Constraint is Additive (shortest path, etc)
  19. 19. Time dependent network flow (difference)19 t1 t2 t3 t4 t5 t6 Not suitable for bandwidth guaranteed paths !
  20. 20. Approach In our approach, the search interval is divided into time windows A time window represents a period of time where we have a stable status of available bandwidth of all related links A snaphots of the network topology in this time windows 20 • Search through these time windows to check whether we can satisfy the requested allocation for that time window. • First, check the duration of the time window – Can we satisfy the user request in that time windows? (we know the max bandwidth user can support) • Then, calculate the max bandwidth available in the time window
  21. 21. Time Windows 21 Reservation 1: (time t1, t6) A -> B -> D (900Mbps) Reservation 2: (time t4, t7) A -> C -> D (400Mbps) Reservation 3: (time t9, t12) A -> B -> D A CB 800Mbps 900Mbps 500Mbps 1000Mbps 300Mbps Reservation 3: (time t9, t12) A -> B -> D (700Mpbs) D 900Mbps 500Mbps t4t2 t3t1 t5 t6 t7 t8 t9 t10 t11 t12 t13 Reservation 1Reservation 1 Reservation 2Reservation 2 Reservation 3Reservation 3
  22. 22. Time Steps and Time Windows 22 Time windows between t1 and t13 time t4t2 t3t1 t5 t6 t7 t8 t9 t10 t11 t12 t13 Reservation 1Reservation 1 Reservation 2Reservation 2 Reservation 3Reservation 3 Res 1 Res 1,2 Res 2 Res 3 t4t1 t6 t7 t9 t12 t13 time time windows
  23. 23. 23 Res 1 Res 1,2 Re s 2 t4t1 t6 t7 t9 A CB 100 Mbps 800 Mbps 300 Mbps) A CB 100 Mbps 400 Mbps 300 Mbps) A CB 1000 Mbps 400 Mbps 300 Mbps) A CB 1000 Mbps 800 Mbps 300 Mbps) t4 t6 t7 CB D 0 Mbps 500 Mbps 300 Mbps) CB D 0 Mbps 100 Mbps 300 Mbps) CB D 900 Mbps 100 Mbps 300 Mbps) CB D 900 Mbps 500 Mbps 300 Mbps)
  24. 24. 24 Time Steps and Time Windows
  25. 25. Time Windows 25 Res 1 Res 1,2 Res 2 Res 3 t4t1 t6 t7 t9 t12 t13 time windows Res 1 Res 1, 2t t t1--t4 Max bandwidth from A to D 1. 900Mbps (3) 2. 100Mbps (2)Res 1, 2 Res 1, 2 2 Res 1,2 Res 1, 2 Res 2 Res 1, 2 Res 1, 2 t1--t6 t4—t6 t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9 t1—t9 2. 100Mbps (2) 3. 100Mbps (5) 4. 900Mbps (1) 5. 100Mbps (3) 6. 100Mbps (6) 7. 900Mpbs (2) 8. 900Mbps (3) 9. 100Mbps (5) 10. 100Mbps (8) Reservation: ( A to D ) (100Mbps) start=t1 end=t9
  26. 26. Outline Introduction Methodology Advance Network Reservation 26 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  27. 27. Time and Resource Conflicts File Transfer with start/end times - NP-hard! How to represent time dependency? Can not benefit from known network algorithms (max flow, min cut, shortest path) 27 NP-hard even for networks with a single link Knapsack problem Unsplittable flow problem (see also network coding in routing) Max edge disjoint path problem Online / Offline ? Greedy Approaches / practical ?
  28. 28. File Transfer Scheduling Demystified A simple case n nodes connected to each other Each node can transfer maximum C(n) files at a time There are m files to be transferred a file need to be sent from node i to node j 28 a file need to be sent from node i to node j Files may have different sizes which defines the amount of time required for the transfer Objective is to minimize the total transfer time This is a common type of assignment problem, and it is NP-hard!
  29. 29. Special cases Network is a bipartite graph Max concurrency is 1, one file at a time File sizes are same, each file takes same amount of time to transfer 29 Graph coloring Bipartite cardinality matching What if each file has a specific cost cost can be associated with the file size? Hungarian problem? But if we are able to transfer more than a single file at a time using the same node (NP-hard) If sharing bandwidth, it becomes even harder
  30. 30. Source > Network > Destination 30 A CB D 800Mbps 900Mbps 500Mbps 1000Mbps 300Mbps n2 n1 Node capacity? Now we have multiple jobs, need to find a schedule
  31. 31. 31 Unsplittable Flow
  32. 32. With start/end times Each transfer request has start and end times n transfer requests are given (each request has a specific amount of profit) Objective is to maximize the profit If profit is same for each request, then objective is to 32 If profit is same for each request, then objective is to maximize the number of jobs in a give time period Unsplittable Flow Problem: An undirected graph, route demand from source(s) to destinations(s) and maximize/minimize the total profit/cost
  33. 33. Why UFP? We represent time as a discrete variable (recall time slots and time windows in Network Reservation Engine) Ex: job1: (start time t1, end time t10) job2: (start time t5, end time t20) 33 job2: (start time t5, end time t20) Time slots 1: (t1,t5) 2: (t5,t10) 3: (t10,t20) Job1 spans to time slot 1 and 2 Job 2 spans to time slot 2 and 3
  34. 34. 34
  35. 35. Knapsack Problem ? If there is only one link, edge capacity is same Profit is also same for each job If there is start/end times, even for a single link, it is NP hard! 35 Note that: UFP specializes to max edge disjoint path problem Scheduling with conflicts is hard Online scheduling is harder
  36. 36. Dynamic networks/ Job requirements At each time slot we may have different edge/node capacity Transfer request comes with total amount of data to be sent (volume) Desired time period (earliest start, latest finish) Objective is to find a sequence of time slots (time window) in which this transfer can be sent satisfying the given criteria 36 transfer can be sent satisfying the given criteria Each time slot has a specific capacity Each time window consists of one or more time slots Time window 1: time slot 1 to 5 Time window 3: time slot 3 to 10 …. If there is node capacity? assigning a request will affect available capacity in two nodes and one edge
  37. 37. Outline Introduction Methodology Advance Network Reservation 37 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  38. 38. Approach Should make a decision quickly Is it really a good idea to schedule many jobs at the same time in which they are overlapping and sharing the total bandwidth? 38
  39. 39. Definition A network with n nodes Each connected to each other (mesh) Each connection (edge) has a specific maximum capacity 39 capacity Each node has a maximum capacity separate for incoming and outgoing transfers This is implementation specific and does not change the algorithm complexity ( O(n * s^2) )
  40. 40. Definition Time constraints (Earliest start / latest complete) When data will be ready? When is the deadline? 40 Find an allocation (start/end times) for the job Can shift to another time slot or not Locked or unlocked jobs Online scheduling: Displace other jobs to open space for the new request we can shift max n jobs?
  41. 41. Methodology Receive a job Find all possible time windows for this job If it can fit to any, then allocate If not, try each time window starting from the earliest 41 If not, try each time window starting from the earliest If there is a job with less ‘desire/preference’ which can shift and still satisfy its criteria, allocate the time window If none found, extend latest finish time by adding time slot(s) Search new time windows to fit one If none found, reject the job
  42. 42. Methodology Never accept a job if it causes other committed jobs to break their criteria A job’s reservation is locked if it has delayed/close to 42 A job’s reservation is locked if it has delayed/close to deadline or failed and restarted If a job can not be finished by deadline? Resubmit with the highest priority
  43. 43. Methodology • For each job we calculate the possible time windows • When a time window is reserved for a job: • We keep track of the number of time slots in this time window 43 window • Ts_num • The order of the time window (sooner is better) • Tw_order = tw_id / total time windows for this job • Desire/Preference is defined by both Ts_num and Tw_order
  44. 44. Methodology Providing a framework for scheduling data transfers with advance allocation Ts_num shows overlaps with other transfers The job with higher Ts_num has higher priority 44 Already overlapping with more transfers, don’t shift Ts_order shows time slots left to deadline The job with higher Ts_order has higher priority More close to deadline (in terms of time slots, not real time) Any preference model works (even random ranking)
  45. 45. Recall Time Windows 45 Res 1 Res 1,2 Res 2 Res 3 t4t1 t6 t7 t9 t12 t13 time windows Res 1 Res 1, 2t t t1--t4 Max bandwidth from A to D 1. 900Mbps (3) 2. 100Mbps (2)Res 1, 2 Res 1, 2 2 Res 1,2 Res 1, 2 Res 2 Res 1, 2 Res 1, 2 t1--t6 t4—t6 t6—t7 t4—t7 t1—t7 t7—t9 t6—t9 t4—t9 t1—t9 2. 100Mbps (2) 3. 100Mbps (5) 4. 900Mbps (1) 5. 100Mbps (3) 6. 100Mbps (6) 7. 900Mpbs (2) 8. 900Mbps (3) 9. 100Mbps (5) 10. 100Mbps (8) Reservation: ( A to D ) (100Mbps) start=t1 end=t9
  46. 46. 46
  47. 47. 47
  48. 48. 48
  49. 49. Evaluation Not studied before (a special case of UFP) UFP is already recent Planning ahead (gives opportunity for co-allocation) With the help of given search interval (earliest start / latest complete) 49 latest complete) flight reservation example The solution uses a unique approach in preference Time slots, time windows (novel approach) Gives a polynomial approximation algorithm The preference converts the UFP problem into Dijkstra path search Uses failure-awareness, early error detection
  50. 50. Evaluation Encourages users to submit reasonable time constraints If cant find in the first round, don’t try to displace any other job Fair (never dismiss a previously admitted job) Linear search (displace a job only once in a search round) 50 Utilizes time windows/time steps for ranking (better than earliest deadline first) Earliest completion + shortest duration Minimize concurrency Even random ranking would work (relaxation in an NP- hard problem
  51. 51. Evaluation Network Reservation Can list/search all possible time windows in polynomial time Searching time windows is FAST! 51 r: reservation time steps (s): 2r+1 Time windows (w): s(s+1)/2
  52. 52. Time Window List (special data structures)52 now infinite Time windows list new reservation: reservation 1, start t1, end t101 10 now t1 t10 infinite Res 1 new reservation: reservation 2, start t12, end t20 now t1 t10 t12 Res 1 t20 infinite Res 2
  53. 53. Testing the NRE library Each point is average of 100 measurement Set 1: sparse graph Set 2: dense graph 53 Set 2: dense graph Random graph:
  54. 54. Tests 54 No hop count limit in those tests In real life hop count is limited
  55. 55. Test 55 In real life, number of nodes and number of reservation in a given search interval are limited
  56. 56. Outline Introduction Methodology Advance Network Reservation 56 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  57. 57. Data Scheduling 57
  58. 58. Failure-Awareness and Error Detection Dynamic Environment: data transfers are prune to frequent failures what went wrong during data transfer? No access to the remote resources Messages get lost due to system malfunction Instead of waiting for failure to happen Detect possible failures and malfunctioning services Search for another data server 58 Search for another data server Alternate data transfer service Use Network Exploration Techniques Check availability of the remote service Resolve host and determine connectivity failures Detect available data transfers service Error while transfer is in progress? Retry or not? When to re-initiate the transfer? Use alternate protocols?
  59. 59. Error Classification 59 •Recover from Failure •Retry failed operation •Postpone scheduling of a failed operations • Data Transfer Protocol not always return appropriate error codes • Using error messages generated by the data transfer protocol • A better logging facility and classification •Early Error Detection •Initiate Transfer when erroneous condition recovered •Or use Alternate options
  60. 60. Failure-aware scheduling 60 SCOOP data - Hurricane Gustav Simulations Hundreds of files (250 data transfer operation) Small (100MB) and large files (1G, 2G)
  61. 61. Job Aggregation Multiple data movement jobs are combined and processed as a single transfer job Information about the aggregated job is stored in the job queue and it is tied to a main job which is actually performing the transfer operation such that it can be queried and reported separately. 61 transfer operation such that it can be queried and reported separately. Hence, aggregation is transparent to the user We have seen vast performance improvement, especially with small data files decreasing the amount of protocol usage reducing the number of independent network connections
  62. 62. Job Aggregation 62 Experiments on LONI (Louisiana Optical Network Initiative) : 1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file per job
  63. 63. Dynamic Tuning in Data Transfer Operations End-to-end bulk data transfer (latency wall) Transfer data by chunks (partial transfers) and also set control parameters on the fly. 63 the fly. Gradually increase the number of parallel streams till it comes to an equilibrium point No need to probe the system and make measurements with external profilers Does not require any complex model for parameter optimization Adapts to changing environment But, overhead in changing parallelism level Fast start (exponentially increase the number of parallel streams)
  64. 64. Adaptive Tuning Start with single stream (n=1) Measure instant throughput for every data chunk transferred (fast start) Increase the number of parallel streams (n=n*2), 64 transfer the data chunk measure instant throughput If current throughput value is better than previous one, continue Otherwise, set n to the old value and gradually increase parallelism level (n=n+1) If no throughput gain by increasing number of streams (found the equilibrium point) Increase chunk size (delay measurement period)
  65. 65. Outline 65
  66. 66. New Transfer Modules • Verify the successful completion of the operation by controlling checksum and file size. • Transfer module can recover from a failed operation by restarting from the last transmitted file. In case of a retry 66 restarting from the last transmitted file. In case of a retry from a failure, scheduler informs the transfer module to recover and restart the transfer using the information from a rescue file created by the checkpoint-enabled transfer module. • An “intelligent” (dynamic tuning) alternative to Globus RFT (Reliable File Transfer)
  67. 67. Outline Introduction Methodology Advance Network Reservation 67 Advance Network Reservation Scheduling with Time and Resource Constraints Scheduling with Advance Reservation Executing Data Transfer Operations Conclusion
  68. 68. Conclusion We developed a new data transfer scheduling paradigm in which data movement operations are scheduled in advance Analyze scheduling with time and resource constraints 68 Show a scheduling model with advance reservation Our methodology provides a basis for provisioning end-to-end high performance data transfers in which users submit their jobs with time and resource constraints to make an advance schedule. Our future work includes implementation and integration of the online scheduling algorithm with advance reservation.
  69. 69. Contributions Presented a novel approach for path finding in time-dependent networks By taking advantage of user provided parameters of total volume and time constraints. Presented a new algorithm to find reservation path with options for earliest completion time and shortest transfer duration 69 for earliest completion time and shortest transfer duration Propose an approximation algorithms using time windows and time steps for data transfer scheduling with advance reservations Coordination of system (data node) and network (link) resources Some other contributions include reliability, adaptability, and performance optimization data placement tasks
  70. 70. Selected Publications Error Detection and Error Classification: Failure Awareness in Data Transfer Scheduling, IJAC 2010 Semantic Enabled Metadata Management in PetaShare, IJGUC 2009 A New Paradigm: Data-Aware Scheduling in Grid Computing, FGCS, Elsevier 2009 Data-Aware Distributed Computing with Stork Data Scheduler, SEE-Grid, 2010 Dynamic Adaptation of Parallelism Level in Data Transfer Scheduling, ASHEs-CISIS 2009 Early Error Detection and Classification in Data Transfer Scheduling, 3PGIC-CISIS 2009 70 Early Error Detection and Classification in Data Transfer Scheduling, 3PGIC-CISIS 2009 Choosing Between Remote I/O versus Staging in Large Scale Distributed Applications, PDCCS 2008 Dynamically Tuning Level of Parallelism in Wide Area Data Transfers, DADC 2008 Data Scheduling for Large Scale Distributed Applications, ICEIS 2008 Intermediate Gateway Service to Aggregate and Cache I/O operations into Data Repositories, USENIX FAST 2009 From Micro- to Macro-processing: A Generic Data Management Model, Grid 2007 An Efficient Reservation Algorithm for Advanced Network Provisioning, LBNL –TR 2010 A New Approach in Advance Network Reservation and Provisioning for High Performance Scientific Data Transfers, LBNL-TR 2010 Advance Network Reservation and Provisioning for Science, LBNL-TR 2009
  71. 71. 71 Acknowledgements

×