Lblc sseminar jun09-2009-jun09-lblcsseminar

212 views

Published on

Berkeley Lab - Computing Sciences Seminar - Reminder

TOMORROW, June 24, 2:00pm - 3:00pm, Bldg. 50F, Room 1647


Berkeley Lab - Computing Sciences Seminar

*/Date/:*

Wednesday, June 24, 2009

*/Time/:*

2:00pm - 3:00pm

*/Location/:*

Bldg. 50F, Room 1647

*/Speaker/:*

Mehmet Balman
Department of Computer Science
Louisiana State University

*/Title/:*

Data Migration between Distributed Repositories for Collaborative
Research

*/Abstract/:*

Scientific applications especially in several areas such as physics,
biology, and astronomy have become more complex and compute
intensive. Often, such applications require geographically
distributed resources to satisfy their immense computational
requirements. Consequently, these applications also have increasing
distributed data intensive requirements, dealing with petabytes of
data. The distributed nature of the resources made data movement
the major bottleneck for end-to-end application performance. Our
approach is to use a dynamic network layer where data placement
middleware needs to adapt to the changing conditions in the
environment. Furthermore, heterogeneous resource and different data
access and security protocols are some of the challenges the data
placement middleware needs to deal with. Complex middleware is
required to orchestrate the use of these storage and network
resources between collaborating parties, and to manage the
end-to-end distribution of data.

We present a data placement scheduler, for mitigating the data
bottleneck in collaborative peta-scale applications. In this talk,
we will give details on recent research in data scheduling, some use
cases for transferring very large data sets into distributed
repositories, and experiments of effective data movement over 1Gpbs
and 10Gbps networks. We will also describe advanced features
including aggregation of data placement jobs with small data files,
dynamic tuning of data transfer operations to minimize the effect of
network latency, error detection and classification, and restarting
transfer operations after transfer interruptions.

*/Host of Seminar/: *

Arie Shoshani

------------------------------------------------------------------------

*/For additional information, such as site access or directions to the
conference room, please contact CSSeminars-Help@hpcrd.lbl.gov
<mailto:csseminars-help@hpcrd.lbl.gov>./*

*/Web Contact: CSSeminars-Help@hpcrd.lbl.govREMINDER:
<mailto:csseminars-help@hpcrd.lbl.gov>/*


_______________________________________________
CSSeminars mailing list
CSSeminars@hpcrdm.lbl.gov
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/csseminars

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
212
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Lblc sseminar jun09-2009-jun09-lblcsseminar

  1. 1. Data Movement between Distributed Repositories for Large Scale Collaborative Science Mehmet Balman Louisiana State University Baton Rouge, LA
  2. 2. Motivation  Scientific applicationsare becoming more data intensive (dealing with petabytes of data)  We use geographically distributed resources to satisfy immense computational requirements  The distributed nature of the resources made data movement is a major bottleneck for end-to-end application performance Therefore, complex middleware is required to orchestrate the use of these storage and network resources between collaborating parties, and to manage the end-to-end distribution of data.
  3. 3. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  4. 4. PetaShare • Distributed Storage System in Louisiana • Spans among seven research institutions • 300TB of disk storage • 400TB of tape (will be online soon) using: IRODS (Integrated Rule-Oriented Data System) www.irods.org PetaShare
  5. 5. PetaShare as an example  Global Namespace among distributed resources  Client tools and interfaces:  Pcommands  Petashell (parrot)  Petafs (fuse)  Windows Browser  Web Portal General scenario is to use an intermediate storage area (limited capacity) and then transfer files to a remote storage for post processing and long term archival
  6. 6. PetaShare Architecture Fast and Efficient Data Migration in PetaShare ? LONI (Louisiana Optical Network Initiative) www.loni.org
  7. 7. Lightweight client tools for transparent access  Petashell, based on Parrot  Petafs, a FUSE client In order to improve throughput performance, we've implemented Advance Buffer Cache in Petafs and Petashell clients by aggregating I/O requests to minimize the number of network messages. Is it efficient for bulk data transfer PetaShare Client Tools
  8. 8. Client performance with Advance Buffer
  9. 9. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  10. 10.  Advance Data Transfer Protocols (i.e. GridFTP)  High throughput data transfer  Data Scheduler: Stork  Organizing data movement activities  Ordering data transfer requests Moving Large Data Sets
  11. 11.  Stork: A batch scheduler for Data Placement activities  Supports plug-in data transfer modules for specific protocols/services  Throttling: deciding number of concurrent transfers  Keep a log of data placement activities  Add fault tolerance to data transfers  Tuning protocol transfer parameters (number of parallel TCP streams) Scheduling Data Movement Jobs
  12. 12. [ dest_url = "gsiftp://eric1.loni.org/scratch/user/"; arguments = -p 4 dbg -vb"; src_url = "file:///home/user/test/"; dap_type = "transfer"; verify_checksum = true; verify_filesize = true; set_permission = "755" ; recursive_copy = true; network_check = true; checkpoint_transfer = true; output = "user.out"; err = "user.err"; log = "userjob.log"; ] Stork Job submission
  13. 13. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  14. 14. End-to-end bulk data transfer (latency wall)  TCP based solutions  Fast TCP, Scalable TCP etc  UDP based solutions  RBUDP, UDT etc  Most of these solutions require kernel level changes  Not preferred by most domain scientists Fast Data Transfer
  15. 15.  Take an application-level transfer protocol (i.e. GridFTP) and tune-up for better performance:  Using Multiple (Parallel) streams  Tuning Buffer size (efficient utilization of available network capacity) Level of Parallelism in End-to-end Data Transfer  number of parallel data streams connected to a data transfer service for increasing the utilization of network bandwidth  number of concurrent data transfer operations that are initiated at the same time for better utilization of system resources. Application Level Tuning
  16. 16.  Instead of a single connection at a time, multiple TCP streams are opened to a single data transfer service in the destination host.  We gain larger bandwidth in TCP especially in a network with less packet loss rate; parallel connections better utilize the TCP buffer available to the data transfer, such that N connections might be N times faster than a single connection  Multiple TCP streams result in extra in the system Parallel TCP Streams
  17. 17. Average Throughput using parallel streams over 1Gbps Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  18. 18. Average Throughput using parallel streams over 1Gpbs Experiments in LONI (www.loni.org) environment - transfer file to QB from IBM m/c
  19. 19. Average Throughput using parallel streams over 10Gpbs
  20. 20. Average Throughput using parallel streams over 10Gpbs
  21. 21. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  22. 22.  Can we predict this behavior?  Yes, we can come up with a good estimation for the parallelism level  Network statistics  Extra measurement  Historical data Parameter Estimation
  23. 23. single stream, theoretical calculation of throughput based on MSS, RTT and packet loss rate: n streams gains as much as total throughput of n single stream: (not correct) A better model: a relation is established between RTT, p and the number of streams n: Parallel Stream Optimization
  24. 24. Parameter Estimation Service
  25. 25.  Might not reflect the best possible current settings (Dynamic Environment)  What if network condition changes?  Requires three sample transfers (curve fitting)  need to probe the system and make measurements with external profilers  Does require a complex model for parameter optimization Parameter Estimation
  26. 26. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  27. 27.  Instead of predictive sampling, use data from actual transfer  transfer data by chunks (partial transfers) and also set control parameters on the fly.  measure throughput for every transferred data chunk  gradually increase the number of parallel streams till it comes to an equilibrium point Adaptive Tuning
  28. 28.  No need to probe the system and make measurements with external profilers  Does not require any complex model for parameter optimization  Adapts to changing environment  But, overhead in changing parallelism level  Fast start (exponentially increase the number of parallel streams) Adaptive Tuning
  29. 29.  Start with single stream (n=1)  Measure instant throughput for every data chunk transferred (fast start)  Increase the number of parallel streams (n=n*2),  transfer the data chunk  measure instant throughput  If current throughput value is better than previous one, continue  Otherwise, set n to the old value and gradually increase parallelism level (n=n+1)  If no throughput gain by increasing number of streams (found the equilibrium point)  Increase chunk size (delay measurement period) Adaptive Tuning
  30. 30. Adaptive Tuning: number of parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from IBM m/c
  31. 31. Adaptive Tuning: number of parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  32. 32. Adaptive Tuning: number of parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  33. 33. Dynamic Tuning Algorithm
  34. 34. Dynamic Tuning Algorithm
  35. 35. Dynamic Tuning Algorithm
  36. 36. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  37. 37. • Dynamic Environment: • data transfers are prune to frequent failures • what went wrong during data transfer? • No access to the remote resources • Messages get lost due to system malfunction • Instead of waiting failure to happen • Detect possible failures and malfunctioning services • Search for another data server • Alternate data transfer service • Classify erroneous cases to make better decisions Failure Awareness
  38. 38. • Use Network Exploration Techniques – Check availability of the remote service – Resolve host and determine connectivity failures – Detect available data transfers service – should be Fast and Efficient not to bother system/network resources • Error while transfer is in progress? – Error_TRANSFER • Retry or not? • When to re-initiate the transfer • Use alternate options? Error Detection
  39. 39. • Data Transfer Protocol not always return appropriate error codes • Using error messages generated by the data transfer protocol • A better logging facility and classification •Recover from Failure •Retry failed operation •Postpone scheduling of a failed operations •Early Error Detection •Initiate Transfer when erroneous condition recovered •Or use Alternate options Error Classification
  40. 40. Error Reporting
  41. 41. Scoop data - Hurricane Gustov Simulations Hundreds of files (250 data transfer operation) Small (100MB) and large files (1G, 2G) Failure Aware Scheduling
  42. 42. • Verify the successful completion of the operation by controlling checksum and file size. • for GridFTP, Stork transfer module can recover from a failed operation by restarting from the last transmitted file. In case of a retry from a failure, scheduler informs the transfer module to recover and restart the transfer using the information from a rescue file created by the checkpoint-enabled transfer module. • An “intelligent” (dynamic tuning) alternative to Globus RFT (Reliable File Transfer) New Transfer Modules
  43. 43. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  44. 44. • Multiple data movement jobs are combined and processed as a single transfer job • Information about the aggregated job is stored in the job queue and it is tied to a main job which is actually performing the transfer operation such that it can be queried and reported separately. • Hence, aggregation is transparent to the user • We have seen vast performance improvement, especially with small data files – decreasing the amount of protocol usage – reducing the number of independent network connections Job Aggregation
  45. 45. Experiments on LONI (Louisiana Optical Network Initiative) : 1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file per job Job Aggregation
  46. 46. ➢PetaShare Environment – as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  47. 47. • Performance bottleneck – Hundreds of jobs submitted to a single batch scheduler, Stork • Single point of failure Stork: Central Scheduling Framework Stork
  48. 48. • Interaction between data schedulers – Manage data activities with lightweight agents in each site – Job Delegation – peer-to-peer data movement – data and server striping – make use of replicas for multi-source downloads Distributed Data Scheduling Future Plans
  49. 49. www.petashare.org www.cybertools.loni.org www.storkproject.org www.cct.lsu.edu Questions? Mehmet Balman balman@cct.lsu.edu
  50. 50. Thank you
  51. 51. Average Throughput of Concurrent Transfer Jobs
  52. 52. Average Throughput of Concurrent Transfer Jobs

×