Data Movement between
Distributed Repositories for
Large Scale Collaborative
Science
Mehmet Balman
Louisiana State University
Baton Rouge, LA
Motivation
 Scientific applicationsare becoming more data intensive
(dealing with petabytes of data)
 We use geographically distributed resources to satisfy
immense computational requirements
 The distributed nature of the resources made data
movement is a major bottleneck for end-to-end
application performance
Therefore, complex middleware is required to
orchestrate the use of these storage and network
resources between collaborating parties, and to manage
the end-to-end distribution of data.
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
PetaShare
• Distributed Storage System in Louisiana
• Spans among seven research institutions
• 300TB of disk storage
• 400TB of tape (will be online soon)
using:
IRODS (Integrated Rule-Oriented Data System)
www.irods.org
PetaShare
PetaShare as an example
 Global Namespace among distributed resources
 Client tools and interfaces:
 Pcommands
 Petashell (parrot)
 Petafs (fuse)
 Windows Browser
 Web Portal
General scenario is to use an intermediate storage
area (limited capacity) and then transfer files to a
remote storage for post processing and long term
archival
PetaShare Architecture
Fast and Efficient Data Migration in PetaShare ?
LONI (Louisiana Optical Network Initiative)
www.loni.org
Lightweight client tools for transparent access
 Petashell, based on Parrot
 Petafs, a FUSE client
In order to improve throughput performance, we've
implemented Advance Buffer Cache in Petafs and
Petashell clients by aggregating I/O requests to minimize
the number of network messages.
Is it efficient for bulk data transfer
PetaShare Client Tools
Client performance with Advance Buffer
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
 Advance Data Transfer Protocols (i.e. GridFTP)
 High throughput data transfer
 Data Scheduler: Stork
 Organizing data movement activities
 Ordering data transfer requests
Moving Large Data Sets
 Stork: A batch scheduler for Data Placement
activities
 Supports plug-in data transfer modules for
specific protocols/services
 Throttling: deciding number of concurrent
transfers
 Keep a log of data placement activities
 Add fault tolerance to data transfers
 Tuning protocol transfer parameters (number
of parallel TCP streams)
Scheduling Data Movement Jobs
[ dest_url = "gsiftp://eric1.loni.org/scratch/user/";
arguments = -p 4 dbg -vb";
src_url = "file:///home/user/test/";
dap_type = "transfer";
verify_checksum = true;
verify_filesize = true;
set_permission = "755" ;
recursive_copy = true;
network_check = true;
checkpoint_transfer = true;
output = "user.out";
err = "user.err";
log = "userjob.log";
]
Stork Job submission
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
End-to-end bulk data transfer (latency wall)
 TCP based solutions
 Fast TCP, Scalable TCP etc
 UDP based solutions
 RBUDP, UDT etc
 Most of these solutions require kernel level
changes
 Not preferred by most domain scientists
Fast Data Transfer
 Take an application-level transfer protocol (i.e.
GridFTP) and tune-up for better performance:
 Using Multiple (Parallel) streams
 Tuning Buffer size
(efficient utilization of available network capacity)
Level of Parallelism in End-to-end Data Transfer
 number of parallel data streams connected to a data transfer
service for increasing the utilization of network bandwidth
 number of concurrent data transfer operations that are
initiated at the same time for better utilization of system
resources.
Application Level Tuning
 Instead of a single connection at a time, multiple
TCP streams are opened to a single data transfer
service in the destination host.
 We gain larger bandwidth in TCP especially in a
network with less packet loss rate; parallel
connections better utilize the TCP buffer available to
the data transfer, such that N connections might be N
times faster than a single connection
 Multiple TCP streams result in extra in the system
Parallel TCP Streams
Average Throughput using parallel streams over 1Gbps
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
Average Throughput using parallel streams over 1Gpbs
Experiments in LONI (www.loni.org) environment - transfer file to QB from
IBM m/c
Average Throughput using parallel streams over 10Gpbs
Average Throughput using parallel streams over 10Gpbs
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
 Can we predict this
behavior?
 Yes, we can come up with
a good estimation for the
parallelism level
 Network statistics
 Extra measurement
 Historical data
Parameter Estimation
single stream, theoretical calculation of
throughput based on MSS, RTT and packet
loss rate:
n streams gains as much as total throughput
of n single stream: (not correct)
A better model: a relation is established
between RTT, p and the number of streams n:
Parallel Stream Optimization
Parameter Estimation Service
 Might not reflect the best possible current settings
(Dynamic Environment)
 What if network condition changes?
 Requires three sample transfers (curve fitting)
 need to probe the system and make
measurements with external profilers
 Does require a complex model for parameter
optimization
Parameter Estimation
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
 Instead of predictive sampling, use data from
actual transfer
 transfer data by chunks (partial transfers) and
also set control parameters on the fly.
 measure throughput for every transferred data
chunk
 gradually increase the number of parallel
streams till it comes to an equilibrium point
Adaptive Tuning
 No need to probe the system and make
measurements with external profilers
 Does not require any complex model for
parameter optimization
 Adapts to changing environment
 But, overhead in changing parallelism level
 Fast start (exponentially increase the number
of parallel streams)
Adaptive Tuning
 Start with single stream (n=1)
 Measure instant throughput for every data chunk transferred
(fast start)
 Increase the number of parallel streams (n=n*2),
 transfer the data chunk
 measure instant throughput
 If current throughput value is better than previous one,
continue
 Otherwise, set n to the old value and gradually increase
parallelism level (n=n+1)
 If no throughput gain by increasing number of streams
(found the equilibrium point)
 Increase chunk size (delay measurement period)
Adaptive Tuning
Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file
to QB from IBM m/c
Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c
Dynamic Tuning Algorithm
Dynamic Tuning Algorithm
Dynamic Tuning Algorithm
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
• Dynamic Environment:
• data transfers are prune to frequent failures
• what went wrong during data transfer?
• No access to the remote resources
• Messages get lost due to system malfunction
• Instead of waiting failure to happen
• Detect possible failures and malfunctioning services
• Search for another data server
• Alternate data transfer service
• Classify erroneous cases to make better decisions
Failure Awareness
• Use Network Exploration Techniques
– Check availability of the remote service
– Resolve host and determine connectivity failures
– Detect available data transfers service
– should be Fast and Efficient not to bother system/network
resources
• Error while transfer is in progress?
– Error_TRANSFER
• Retry or not?
• When to re-initiate the transfer
• Use alternate options?
Error Detection
• Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
• A better logging facility and classification
•Recover from Failure
•Retry failed operation
•Postpone scheduling of
a failed operations
•Early Error Detection
•Initiate Transfer when
erroneous condition
recovered
•Or use Alternate
options
Error Classification
Error Reporting
Scoop data - Hurricane Gustov Simulations
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G)
Failure Aware Scheduling
• Verify the successful completion of the operation
by controlling checksum and file size.
• for GridFTP, Stork transfer module can recover
from a failed operation by restarting from the last
transmitted file. In case of a retry from a failure,
scheduler informs the transfer module to recover
and restart the transfer using the information from
a rescue file created by the checkpoint-enabled
transfer module.
• An “intelligent” (dynamic tuning) alternative to
Globus RFT (Reliable File Transfer)
New Transfer Modules
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
• Multiple data movement jobs are combined and
processed as a single transfer job
• Information about the aggregated job is stored in the
job queue and it is tied to a main job which is actually
performing the transfer operation such that it can be
queried and reported separately.
• Hence, aggregation is transparent to the user
• We have seen vast performance improvement,
especially with small data files
– decreasing the amount of protocol usage
– reducing the number of independent network
connections
Job Aggregation
Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB
data file per job
Job Aggregation
➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda
• Performance bottleneck
– Hundreds of jobs submitted to a single batch
scheduler, Stork
• Single point of failure
Stork: Central Scheduling Framework
Stork
• Interaction between data schedulers
– Manage data activities with lightweight agents in
each site
– Job Delegation
– peer-to-peer data movement
– data and server striping
– make use of replicas for multi-source downloads
Distributed Data Scheduling
Future Plans
www.petashare.org
www.cybertools.loni.org
www.storkproject.org
www.cct.lsu.edu
Questions?
Mehmet Balman balman@cct.lsu.edu
Thank you
Average Throughput of Concurrent Transfer
Jobs
Average Throughput of Concurrent Transfer
Jobs

Lblc sseminar jun09-2009-jun09-lblcsseminar

  • 1.
    Data Movement between DistributedRepositories for Large Scale Collaborative Science Mehmet Balman Louisiana State University Baton Rouge, LA
  • 2.
    Motivation  Scientific applicationsarebecoming more data intensive (dealing with petabytes of data)  We use geographically distributed resources to satisfy immense computational requirements  The distributed nature of the resources made data movement is a major bottleneck for end-to-end application performance Therefore, complex middleware is required to orchestrate the use of these storage and network resources between collaborating parties, and to manage the end-to-end distribution of data.
  • 3.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 4.
    PetaShare • Distributed StorageSystem in Louisiana • Spans among seven research institutions • 300TB of disk storage • 400TB of tape (will be online soon) using: IRODS (Integrated Rule-Oriented Data System) www.irods.org PetaShare
  • 5.
    PetaShare as anexample  Global Namespace among distributed resources  Client tools and interfaces:  Pcommands  Petashell (parrot)  Petafs (fuse)  Windows Browser  Web Portal General scenario is to use an intermediate storage area (limited capacity) and then transfer files to a remote storage for post processing and long term archival
  • 6.
    PetaShare Architecture Fast andEfficient Data Migration in PetaShare ? LONI (Louisiana Optical Network Initiative) www.loni.org
  • 7.
    Lightweight client toolsfor transparent access  Petashell, based on Parrot  Petafs, a FUSE client In order to improve throughput performance, we've implemented Advance Buffer Cache in Petafs and Petashell clients by aggregating I/O requests to minimize the number of network messages. Is it efficient for bulk data transfer PetaShare Client Tools
  • 8.
  • 9.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 10.
     Advance DataTransfer Protocols (i.e. GridFTP)  High throughput data transfer  Data Scheduler: Stork  Organizing data movement activities  Ordering data transfer requests Moving Large Data Sets
  • 11.
     Stork: Abatch scheduler for Data Placement activities  Supports plug-in data transfer modules for specific protocols/services  Throttling: deciding number of concurrent transfers  Keep a log of data placement activities  Add fault tolerance to data transfers  Tuning protocol transfer parameters (number of parallel TCP streams) Scheduling Data Movement Jobs
  • 12.
    [ dest_url ="gsiftp://eric1.loni.org/scratch/user/"; arguments = -p 4 dbg -vb"; src_url = "file:///home/user/test/"; dap_type = "transfer"; verify_checksum = true; verify_filesize = true; set_permission = "755" ; recursive_copy = true; network_check = true; checkpoint_transfer = true; output = "user.out"; err = "user.err"; log = "userjob.log"; ] Stork Job submission
  • 13.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 14.
    End-to-end bulk datatransfer (latency wall)  TCP based solutions  Fast TCP, Scalable TCP etc  UDP based solutions  RBUDP, UDT etc  Most of these solutions require kernel level changes  Not preferred by most domain scientists Fast Data Transfer
  • 15.
     Take anapplication-level transfer protocol (i.e. GridFTP) and tune-up for better performance:  Using Multiple (Parallel) streams  Tuning Buffer size (efficient utilization of available network capacity) Level of Parallelism in End-to-end Data Transfer  number of parallel data streams connected to a data transfer service for increasing the utilization of network bandwidth  number of concurrent data transfer operations that are initiated at the same time for better utilization of system resources. Application Level Tuning
  • 16.
     Instead ofa single connection at a time, multiple TCP streams are opened to a single data transfer service in the destination host.  We gain larger bandwidth in TCP especially in a network with less packet loss rate; parallel connections better utilize the TCP buffer available to the data transfer, such that N connections might be N times faster than a single connection  Multiple TCP streams result in extra in the system Parallel TCP Streams
  • 17.
    Average Throughput usingparallel streams over 1Gbps Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  • 18.
    Average Throughput usingparallel streams over 1Gpbs Experiments in LONI (www.loni.org) environment - transfer file to QB from IBM m/c
  • 19.
    Average Throughput usingparallel streams over 10Gpbs
  • 20.
    Average Throughput usingparallel streams over 10Gpbs
  • 21.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 22.
     Can wepredict this behavior?  Yes, we can come up with a good estimation for the parallelism level  Network statistics  Extra measurement  Historical data Parameter Estimation
  • 23.
    single stream, theoreticalcalculation of throughput based on MSS, RTT and packet loss rate: n streams gains as much as total throughput of n single stream: (not correct) A better model: a relation is established between RTT, p and the number of streams n: Parallel Stream Optimization
  • 24.
  • 25.
     Might notreflect the best possible current settings (Dynamic Environment)  What if network condition changes?  Requires three sample transfers (curve fitting)  need to probe the system and make measurements with external profilers  Does require a complex model for parameter optimization Parameter Estimation
  • 26.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 27.
     Instead ofpredictive sampling, use data from actual transfer  transfer data by chunks (partial transfers) and also set control parameters on the fly.  measure throughput for every transferred data chunk  gradually increase the number of parallel streams till it comes to an equilibrium point Adaptive Tuning
  • 28.
     No needto probe the system and make measurements with external profilers  Does not require any complex model for parameter optimization  Adapts to changing environment  But, overhead in changing parallelism level  Fast start (exponentially increase the number of parallel streams) Adaptive Tuning
  • 29.
     Start withsingle stream (n=1)  Measure instant throughput for every data chunk transferred (fast start)  Increase the number of parallel streams (n=n*2),  transfer the data chunk  measure instant throughput  If current throughput value is better than previous one, continue  Otherwise, set n to the old value and gradually increase parallelism level (n=n+1)  If no throughput gain by increasing number of streams (found the equilibrium point)  Increase chunk size (delay measurement period) Adaptive Tuning
  • 30.
    Adaptive Tuning: numberof parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from IBM m/c
  • 31.
    Adaptive Tuning: numberof parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  • 32.
    Adaptive Tuning: numberof parallel streams Experiments in LONI (www.loni.org) environment - transfer file to QB from Linux m/c
  • 33.
  • 34.
  • 35.
  • 36.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 37.
    • Dynamic Environment: •data transfers are prune to frequent failures • what went wrong during data transfer? • No access to the remote resources • Messages get lost due to system malfunction • Instead of waiting failure to happen • Detect possible failures and malfunctioning services • Search for another data server • Alternate data transfer service • Classify erroneous cases to make better decisions Failure Awareness
  • 38.
    • Use NetworkExploration Techniques – Check availability of the remote service – Resolve host and determine connectivity failures – Detect available data transfers service – should be Fast and Efficient not to bother system/network resources • Error while transfer is in progress? – Error_TRANSFER • Retry or not? • When to re-initiate the transfer • Use alternate options? Error Detection
  • 39.
    • Data TransferProtocol not always return appropriate error codes • Using error messages generated by the data transfer protocol • A better logging facility and classification •Recover from Failure •Retry failed operation •Postpone scheduling of a failed operations •Early Error Detection •Initiate Transfer when erroneous condition recovered •Or use Alternate options Error Classification
  • 40.
  • 41.
    Scoop data -Hurricane Gustov Simulations Hundreds of files (250 data transfer operation) Small (100MB) and large files (1G, 2G) Failure Aware Scheduling
  • 42.
    • Verify thesuccessful completion of the operation by controlling checksum and file size. • for GridFTP, Stork transfer module can recover from a failed operation by restarting from the last transmitted file. In case of a retry from a failure, scheduler informs the transfer module to recover and restart the transfer using the information from a rescue file created by the checkpoint-enabled transfer module. • An “intelligent” (dynamic tuning) alternative to Globus RFT (Reliable File Transfer) New Transfer Modules
  • 43.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 44.
    • Multiple datamovement jobs are combined and processed as a single transfer job • Information about the aggregated job is stored in the job queue and it is tied to a main job which is actually performing the transfer operation such that it can be queried and reported separately. • Hence, aggregation is transparent to the user • We have seen vast performance improvement, especially with small data files – decreasing the amount of protocol usage – reducing the number of independent network connections Job Aggregation
  • 45.
    Experiments on LONI(Louisiana Optical Network Initiative) : 1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file per job Job Aggregation
  • 46.
    ➢PetaShare Environment –as an example ➢ Distributed Data Management in Louisiana ➢Data Movement using Stork ➢ Data Scheduling ➢Tuning Data Transfer Operations ➢ Prediction Service ➢ Adaptive Tuning ➢Failure-Awareness ➢Job Aggregation ➢Future Directions Agenda
  • 47.
    • Performance bottleneck –Hundreds of jobs submitted to a single batch scheduler, Stork • Single point of failure Stork: Central Scheduling Framework Stork
  • 48.
    • Interaction betweendata schedulers – Manage data activities with lightweight agents in each site – Job Delegation – peer-to-peer data movement – data and server striping – make use of replicas for multi-source downloads Distributed Data Scheduling Future Plans
  • 49.
  • 50.
  • 51.
    Average Throughput ofConcurrent Transfer Jobs
  • 52.
    Average Throughput ofConcurrent Transfer Jobs