Presentation southernstork 2009-nov-southernworkshop

174 views
131 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
174
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Presentation southernstork 2009-nov-southernworkshop

  1. 1. Data Placement Scheduling between Distributed Repositories Stork 1.0 and beyond Mehmet Balman Louisiana State University Baton Rouge, LA
  2. 2. MotivationMotivation  Scientific applicationsare becoming more data intensive (dealing with petabytes of data)  We use geographically distributed resources to satisfy immense computational requirements  The distributed nature of the resources made data movement is a major bottleneck for end-to-end application performance  Therefore, complex middleware is required to orchestrate the use of these storage and network resources between collaborating parties, and to manage the end-to-end distribution of data.
  3. 3.  Data Movement using Stork  Data Scheduling  Tuning Data Transfer Operations  Failure-Awareness  Job Aggregation  Future Directions AgendaAgenda
  4. 4.  Advance Data Transfer Protocols (i.e. GridFTP)  High throughput data transfer  Data Scheduler: Stork  Organizing data movement activities  Ordering data transfer requests Moving Large Data SetsMoving Large Data Sets
  5. 5. A scientific application generates immense amount of simulation data using supercomputing resources The generated data is stored in a temporary space and need to be moved to a data repository for further processing or archiving Another application may be waiting this generated data as its input to start execution Delaying the data transfer operation or completing the transfer far after than the expected time may create several problems – (other resources are waiting for this transfer operation to complete) Use caseUse case
  6. 6.  Stork: A batch scheduler for Data Placement activities  Supports plug-in data transfer modules for specific protocols/services  Throttling: deciding number of concurrent transfers  Keep a log of data placement activities  Add fault tolerance to data transfers  Tuning protocol transfer parameters (number of parallel TCP streams) Scheduling Data Movement JobsScheduling Data Movement Jobs
  7. 7. [ dest_url = "gsiftp://eric1.loni.org/scratch/user/"; arguments = -p 4 dbg -vb"; src_url = "file:///home/user/test/"; dap_type = "transfer"; verify_checksum = true; verify_filesize = true; set_permission = "755" ; recursive_copy = true; network_check = true; checkpoint_transfer = true; output = "user.out"; err = "user.err"; log = "userjob.log"; ] Stork Job submissionStork Job submission
  8. 8. End-to-end bulk data transfer (latency wall)  TCP based solutions  Fast TCP, Scalable TCP etc  UDP based solutions  RBUDP, UDT etc  Most of these solutions require kernel level changes  Not preferred by most domain scientists Fast Data TransferFast Data Transfer
  9. 9.  Take an application-level transfer protocol (i.e. GridFTP) and tune-up for better performance:  Using Multiple (Parallel) streams  Tuning Buffer size (efficient utilization of available network capacity) Level of Parallelism in End-to-end Data Transfer  number of parallel data streams connected to a data transfer service for increasing the utilization of network bandwidth  number of concurrent data transfer operations that are initiated at the same time for better utilization of system resources. Application Level TuningApplication Level Tuning
  10. 10.  Instead of a single connection at a time, multiple TCP streams are opened to a single data transfer service in the destination host.  We gain larger bandwidth in TCP especially in a network with less packet loss rate; parallel connections better utilize the TCP buffer available to the data transfer, such that N connections might be N times faster than a single connection  Multiple TCP streams result in extra in the system Parallel TCP StreamsParallel TCP Streams
  11. 11. Average Throughput using parallel streams over 1GbpsAverage Throughput using parallel streams over 1Gbps Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  12. 12.  Instead of predictive sampling, use data from actual transfer  transfer data by chunks (partial transfers) and also set control parameters on the fly.  measure throughput for every transferred data chunk  gradually increase the number of parallel streams till it comes to an equilibrium point Adaptive TuningAdaptive Tuning
  13. 13.  No need to probe the system and make measurements with external profilers  Does not require any complex model for parameter optimization  Adapts to changing environment  But, overhead in changing parallelism level  Fast start (exponentially increase the number of parallel streams) Adaptive TuningAdaptive Tuning
  14. 14.  Start with single stream (n=1)  Measure instant throughput for every data chunk transferred (fast start)  Increase the number of parallel streams (n=n*2),  transfer the data chunk  measure instant throughput  If current throughput value is better than previous one, continue  Otherwise, set n to the old value and gradually increase parallelism level (n=n+1)  If no throughput gain by increasing number of streams (found the equilibrium point)  Increase chunk size (delay measurement period) Adaptive TuningAdaptive Tuning
  15. 15. Dynamic Tuning AlgorithmDynamic Tuning Algorithm
  16. 16. Dynamic Tuning AlgorithmDynamic Tuning Algorithm
  17. 17. Dynamic Tuning AlgorithmDynamic Tuning Algorithm
  18. 18. • Dynamic Environment: • data transfers are prune to frequent failures • what went wrong during data transfer? • No access to the remote resources • Messages get lost due to system malfunction • Instead of waiting failure to happen • Detect possible failures and malfunctioning services • Search for another data server • Alternate data transfer service • Classify erroneous cases to make better decisions Failure AwarenessFailure Awareness
  19. 19. • Use Network Exploration Techniques – Check availability of the remote service – Resolve host and determine connectivity failures – Detect available data transfers service – should be Fast and Efficient not to bother system/network resources • Error while transfer is in progress? – Error_TRANSFER • Retry or not? • When to re-initiate the transfer • Use alternate options? Error DetectionError Detection
  20. 20. • Data Transfer Protocol not always return appropriate error codes • Using error messages generated by the data transfer protocol • A better logging facility and classification •Recover from Failure •Retry failed operation •Postpone scheduling of a failed operations •Early Error Detection •Initiate Transfer when erroneous condition recovered •Or use Alternate options Error ClassificationError Classification
  21. 21. Error ReportingError Reporting
  22. 22. Scoop data - Hurricane Gustov Simulations Hundreds of files (250 data transfer operation) Small (100MB) and large files (1G, 2G) Failure Aware SchedulingFailure Aware Scheduling
  23. 23. • Verify the successful completion of the operation by controlling checksum and file size. • for GridFTP, Stork transfer module can recover from a failed operation by restarting from the last transmitted file. In case of a retry from a failure, scheduler informs the transfer module to recover and restart the transfer using the information from a rescue file created by the checkpoint-enabled transfer module. • An “intelligent” (dynamic tuning) alternative to Globus RFT (Reliable File Transfer) New Transfer ModulesNew Transfer Modules
  24. 24. • Multiple data movement jobs are combined and processed as a single transfer job • Information about the aggregated job is stored in the job queue and it is tied to a main job which is actually performing the transfer operation such that it can be queried and reported separately. • Hence, aggregation is transparent to the user • We have seen vast performance improvement, especially with small data files – decreasing the amount of protocol usage – reducing the number of independent network connections Job AggregationJob Aggregation
  25. 25. Experiments on LONI (Louisiana Optical Network Initiative) : 1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file per job Job AggregationJob Aggregation
  26. 26. We need priority-based data transfer scheduling with advance reservation and provisioning to allow researchers to use data placement as-a-service where they can plan ahead and reserve the time period for their data movement operations. Need to orchestrate advance storage and network allocation together for data movements (very less progress in the literature) Future DirectionsFuture Directions
  27. 27. Next generation research networks such as ESNet and Internet2 – provide high-speed on-demand data access between collaborating institutions by delivering network-as-a-service On-Demand Secure Circuits and Advance Reservation System (OSCARS) • Guaranteed bandwidth (at certain time, for a certain bandwidth and length of time) Network ReservationNetwork Reservation
  28. 28. Next generation research networks such as ESNet and Internet2 – provide high-speed on-demand data access between collaborating institutions by delivering network-as-a-service On-Demand Secure Circuits and Advance Reservation System (OSCARS) • Guaranteed bandwidth (at certain time, for a certain bandwidth and length of time) Network ReservationNetwork Reservation
  29. 29. Research ConceptResearch Concept accept time constraints allow users to plan ahead orchestrate resource allocation provide advance resource reservation reserve the scheduler’s time for future data movement operation
  30. 30. MethodologyMethodology two separate queues Planning Phase resource reservation and time allocation − Preemption? − Confirm submission of a request? Execution Phase re-organization, tuning, and ordering Failure-awareness Job Aggregation Dynamic Adaptation in data transfers Priority-based scheduling (earliest deadine?)
  31. 31. MethodologyMethodology Phase 1: The scheduler checks the availability of resources in a given time period and justifies whether requested operation can be satisfied with the given time constraints  The server and the network capacity is allocated for the future time period in advance Phase 2: The scheduler considers other requests reserved for future time windows and re-order operations in the current time period  Aggregation  Pre-processing
  32. 32. www.petashare.org www.cybertools.loni.org www.storkproject.org www.cct.lsu.edu Questions?Questions? Mehmet Balman balman@cct.lsu.edu
  33. 33. Thank youThank you
  34. 34. Data Movement between Distributed Repositories for Large Scale Collaborative Science Mehmet Balman Louisiana State University Baton Rouge, LA

×