Lblc sseminar jun09-2009-jun09-lblcsseminar

Data Movement between
Distributed Repositories for
Large Scale Collaborative
Science
Mehmet Balman
Louisiana State University
Baton Rouge, LA

Motivation
 Scientific applicationsare becoming more data intensive
(dealing with petabytes of data)
 We use geographically distributed resources to satisfy
immense computational requirements
 The distributed nature of the resources made data
movement is a major bottleneck for end-to-end
application performance
Therefore, complex middleware is required to
orchestrate the use of these storage and network
resources between collaborating parties, and to manage
the end-to-end distribution of data.

➢PetaShare Environment – as an example
➢ Distributed Data Management in Louisiana
➢Data Movement using Stork
➢ Data Scheduling
➢Tuning Data Transfer Operations
➢ Prediction Service
➢ Adaptive Tuning
➢Failure-Awareness
➢Job Aggregation
➢Future Directions
Agenda

PetaShare
• Distributed Storage System in Louisiana
• Spans among seven research institutions
• 300TB of disk storage
• 400TB of tape (will be online soon)
using:
IRODS (Integrated Rule-Oriented Data System)
www.irods.org
PetaShare

PetaShare as an example
 Global Namespace among distributed resources
 Client tools and interfaces:
 Pcommands
 Petashell (parrot)
 Petafs (fuse)
 Windows Browser
 Web Portal
General scenario is to use an intermediate storage
area (limited capacity) and then transfer files to a
remote storage for post processing and long term
archival

PetaShare Architecture
Fast and Efficient Data Migration in PetaShare ?
LONI (Louisiana Optical Network Initiative)
www.loni.org

Lightweight client tools for transparent access
 Petashell, based on Parrot
 Petafs, a FUSE client
In order to improve throughput performance, we've
implemented Advance Buffer Cache in Petafs and
Petashell clients by aggregating I/O requests to minimize
the number of network messages.
Is it efficient for bulk data transfer
PetaShare Client Tools

Client performance with Advance Buffer

 Advance Data Transfer Protocols (i.e. GridFTP)
 High throughput data transfer
 Data Scheduler: Stork
 Organizing data movement activities
 Ordering data transfer requests
Moving Large Data Sets

 Stork: A batch scheduler for Data Placement
activities
 Supports plug-in data transfer modules for
specific protocols/services
 Throttling: deciding number of concurrent
transfers
 Keep a log of data placement activities
 Add fault tolerance to data transfers
 Tuning protocol transfer parameters (number
of parallel TCP streams)
Scheduling Data Movement Jobs

[ dest_url = "gsiftp://eric1.loni.org/scratch/user/";
arguments = -p 4 dbg -vb";
src_url = "file:///home/user/test/";
dap_type = "transfer";
verify_checksum = true;
verify_filesize = true;
set_permission = "755" ;
recursive_copy = true;
network_check = true;
checkpoint_transfer = true;
output = "user.out";
err = "user.err";
log = "userjob.log";
]
Stork Job submission

End-to-end bulk data transfer (latency wall)
 TCP based solutions
 Fast TCP, Scalable TCP etc
 UDP based solutions
 RBUDP, UDT etc
 Most of these solutions require kernel level
changes
 Not preferred by most domain scientists
Fast Data Transfer

 Take an application-level transfer protocol (i.e.
GridFTP) and tune-up for better performance:
 Using Multiple (Parallel) streams
 Tuning Buffer size
(efficient utilization of available network capacity)
Level of Parallelism in End-to-end Data Transfer
 number of parallel data streams connected to a data transfer
service for increasing the utilization of network bandwidth
 number of concurrent data transfer operations that are
initiated at the same time for better utilization of system
resources.
Application Level Tuning

 Instead of a single connection at a time, multiple
TCP streams are opened to a single data transfer
service in the destination host.
 We gain larger bandwidth in TCP especially in a
network with less packet loss rate; parallel
connections better utilize the TCP buffer available to
the data transfer, such that N connections might be N
times faster than a single connection
 Multiple TCP streams result in extra in the system
Parallel TCP Streams

Average Throughput using parallel streams over 1Gbps
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c

Average Throughput using parallel streams over 1Gpbs
Experiments in LONI (www.loni.org) environment - transfer file to QB from
IBM m/c

Average Throughput using parallel streams over 10Gpbs

 Can we predict this
behavior?
 Yes, we can come up with
a good estimation for the
parallelism level
 Network statistics
 Extra measurement
 Historical data
Parameter Estimation

single stream, theoretical calculation of
throughput based on MSS, RTT and packet
loss rate:
n streams gains as much as total throughput
of n single stream: (not correct)
A better model: a relation is established
between RTT, p and the number of streams n:
Parallel Stream Optimization

 Might not reflect the best possible current settings
(Dynamic Environment)
 What if network condition changes?
 Requires three sample transfers (curve fitting)
 need to probe the system and make
measurements with external profilers
 Does require a complex model for parameter
optimization
Parameter Estimation

 Instead of predictive sampling, use data from
actual transfer
 transfer data by chunks (partial transfers) and
also set control parameters on the fly.
 measure throughput for every transferred data
chunk
 gradually increase the number of parallel
streams till it comes to an equilibrium point
Adaptive Tuning

 No need to probe the system and make
measurements with external profilers
 Does not require any complex model for
parameter optimization
 Adapts to changing environment
 But, overhead in changing parallelism level
 Fast start (exponentially increase the number
of parallel streams)
Adaptive Tuning

 Start with single stream (n=1)
 Measure instant throughput for every data chunk transferred
(fast start)
 Increase the number of parallel streams (n=n*2),
 transfer the data chunk
 measure instant throughput
 If current throughput value is better than previous one,
continue
 Otherwise, set n to the old value and gradually increase
parallelism level (n=n+1)
 If no throughput gain by increasing number of streams
(found the equilibrium point)
 Increase chunk size (delay measurement period)
Adaptive Tuning

Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file
to QB from IBM m/c

Adaptive Tuning: number of parallel streams
Experiments in LONI (www.loni.org) environment - transfer file to
QB from Linux m/c

• Dynamic Environment:
• data transfers are prune to frequent failures
• what went wrong during data transfer?
• No access to the remote resources
• Messages get lost due to system malfunction
• Instead of waiting failure to happen
• Detect possible failures and malfunctioning services
• Search for another data server
• Alternate data transfer service
• Classify erroneous cases to make better decisions
Failure Awareness

• Use Network Exploration Techniques
– Check availability of the remote service
– Resolve host and determine connectivity failures
– Detect available data transfers service
– should be Fast and Efficient not to bother system/network
resources
• Error while transfer is in progress?
– Error_TRANSFER
• Retry or not?
• When to re-initiate the transfer
• Use alternate options?
Error Detection

• Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
• A better logging facility and classification
•Recover from Failure
•Retry failed operation
•Postpone scheduling of
a failed operations
•Early Error Detection
•Initiate Transfer when
erroneous condition
recovered
•Or use Alternate
options
Error Classification

Scoop data - Hurricane Gustov Simulations
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G)
Failure Aware Scheduling

• Verify the successful completion of the operation
by controlling checksum and file size.
• for GridFTP, Stork transfer module can recover
from a failed operation by restarting from the last
transmitted file. In case of a retry from a failure,
scheduler informs the transfer module to recover
and restart the transfer using the information from
a rescue file created by the checkpoint-enabled
transfer module.
• An “intelligent” (dynamic tuning) alternative to
Globus RFT (Reliable File Transfer)
New Transfer Modules

• Multiple data movement jobs are combined and
processed as a single transfer job
• Information about the aggregated job is stored in the
job queue and it is tied to a main job which is actually
performing the transfer operation such that it can be
queried and reported separately.
• Hence, aggregation is transparent to the user
• We have seen vast performance improvement,
especially with small data files
– decreasing the amount of protocol usage
– reducing the number of independent network
connections
Job Aggregation

Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB
data file per job
Job Aggregation

• Performance bottleneck
– Hundreds of jobs submitted to a single batch
scheduler, Stork
• Single point of failure
Stork: Central Scheduling Framework
Stork

• Interaction between data schedulers
– Manage data activities with lightweight agents in
each site
– Job Delegation
– peer-to-peer data movement
– data and server striping
– make use of replicas for multi-source downloads
Distributed Data Scheduling
Future Plans

www.petashare.org
www.cybertools.loni.org
www.storkproject.org
www.cct.lsu.edu
Questions?
Mehmet Balman balman@cct.lsu.edu

Average Throughput of Concurrent Transfer
Jobs

Lblc sseminar jun09-2009-jun09-lblcsseminar

More Related Content

What's hot

Viewers also liked

Similar to Lblc sseminar jun09-2009-jun09-lblcsseminar

More from balmanme

Recently uploaded

Lblc sseminar jun09-2009-jun09-lblcsseminar