SlideShare a Scribd company logo
1 of 26
Download to read offline
Stork 1.0 and Beyond
Data Scheduling for Large‐scale 
ll bCollaborative Science
Mehmet Balman
Louisiana State University, Baton Rouge, LA, USA
Presented at Condor Week 2009 April 20-April 23, 2009
Scheduling Data Placement JobsScheduling Data Placement Jobs
• Data Placement ActivitiesData Placement Activities
• Modular Architecture
– Data Transfer ModulesData Transfer Modules 
for specific protocols/services
• Throttle maximum transfer operations running
• Keep a log of data placement activities• Keep a log of data placement activities
• Add fault tolerance to data transfers
Job SubmissionJob Submission
[ dest_url = "gsiftp://eric1.loni.org/scratch/user/";
arguments = ‐p 4 dbg ‐vb";
src_url = "file:///home/user/test/";
dap_type = "transfer";
verify_checksum = true;
verify_filesize = true;
set_permission = "755" ;
i trecursive_copy = true;
network_check = true;
checkpoint_transfer = true;
output = "userout";output =  user.out ;
err = "user.err";
log = "userjob.log";
]]
AgendaAgenda
• Error Detection and Error ClassificationError Detection and Error Classification
• Data Transfer Operations
D i T i– Dynamic Tuning 
– Prediction Service
– Job Aggregation
• Data Migration using Stork
• Practical example in PetaShare Project
• Future Directions
Failure‐AwarenessFailure Awareness
• Dynamic Environment: 
• data transfers are prune to frequent failures
• what went wrong during data transfer?
• No access to the remote resourcesNo access to the remote resources
• Messages get lost due to system malfunction
• Instead of waiting failure to happen• Instead of waiting failure to happen
• Detect possible failures and malfunctioning services
• Search for another data server
• Alternate data transfer service• Alternate data transfer service
• Classify erroneous cases to make better decisions
Error DetectionError Detection
• Use Network Exploration Techniques
– Check availability of the remote service
– Resolve host and determine connectivity failures
– Detect available data transfers service
– should be Fast and Efficient not to bother system/network resources
• Error while transfer is in progress?
– Error_TRANSFER
• Retry or not?
• When to re‐initiate the transfer
• Use alternate options?• Use alternate options?
Error ClassificationError Classification
•Recover from Failure
•Retry failed operation
•Postpone scheduling of a 
failed operationsfailed operations
•Early Error Detection
I i i T f h•Initiate Transfer when 
erroneous condition 
recovered
•Or use Alternate options
• Data Transfer Protocol not always return appropriate error codes
• Using error messages generated by the data transfer protocol
p
• A better logging facility and classification
Error ReportingError Reporting
Failure‐Aware SchedulingFailure Aware Scheduling
Scoop data  ‐ Hurricane Gustov Simulationsp
Hundreds of files (250 data transfer operation)
Small (100MB) and large files (1G, 2G
New Transfer ModulesNew Transfer Modules
• Verify the successful completion of the operation by y p p y
controlling checksum and file size. 
f G idFTP S k f d l f• for GridFTP, Stork transfer module can recover from a 
failed operation by restarting from the last transmitted 
file. In case of a retry from a failure, scheduler informs 
the transfer module to recover and restart the transfer 
using the information from a rescue file created by the 
checkpoint‐enabled transfer module.checkpoint enabled transfer module.
• Replacing Globus RFT (Reliable File Transfer)
AgendaAgenda
• Error Detection and Error ClassificationError Detection and Error Classification
• Data Transfer Operations
D i T i– Dynamic Tuning 
– Prediction Service
– Job Aggregation
• Data Migration using Stork
• Practical example in PetaShare Project
• Future Directions
Tuning Data TransfersTuning Data Transfers
• Latency Wall
– Buffer Size Optimization
– Parallel TCP Streams
– Concurrent  Transfers
• User level end‐to‐end Tuning
P ll liParallelism
• (1) the number of parallel data streams connected to a data transfer 
service for increasing the utilization of network bandwidthservice for increasing the utilization of network bandwidth
• (2) the number of concurrent data transfer operations that are 
initiated at the same time for better utilization of system resourcesinitiated at the same time for better utilization of system resources.
Parameter EstimationParameter Estimation
• come up with a good estimation for the co e up t a good est at o o t e
parallelism level
– Network statistics
– Extra measurement
– Historical data 
• Might not reflect the best possible current 
settings (Dynamic Environment)
Optimization ServiceOptimization Service
Dynamic TuningDynamic Tuning
Average Throughput using Parallel Streamsg g p g
Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from Linux m/c
Dynamic Setting of Parallel StreamsDynamic Setting of Parallel Streams
Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from IBM m/c
Dynamic Setting of Parallel StreamsDynamic Setting of Parallel Streams
Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from Linux m/c
Job AggregationJob Aggregation
• data placement jobs are combined and processed as a
single transfer job.
• Information about the aggregated job is stored in the job queue and
it is tied to a main job which is actually performing the transfer
operation such that it can be queried and reported separately.operation such that it can be queried and reported separately.
• Hence, aggregation is transparent to the user
W h t f i t i ll• We have seen vast performance improvement, especially
with small data files,
• simply by combining data placement jobs based on their
d ti ti ddsource or destination addresses.
– decreasing the amount of protocol usage
– reducing the number of independent network connections
Job AggregationJob Aggregation 
2000
2500
ec)
1000
1500
2000
time (se
single job at a time
2 parallel jobs
4 ll l j b
0
500
1000
total 
4 parallel jobs
8 parallel jobs
16 parallel jobs
32 parallel jobs
0 10 20 30 40
max aggregation count
32 parallel jobs
Experiments on LONI (Louisiana Optical Network Initiative) :
1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB
data file per job
AgendaAgenda
• Error Detection and Error ClassificationError Detection and Error Classification
• Data Transfer Operations
D i T i– Dynamic Tuning 
– Prediction Service
– Job Aggregation
• Data Migration using Stork
• Practical example in PetaShare Project
• Future Directions
PetaSharePetaShare
• Distributed Storage for Data 
Archive
• Global Namespace among 
distributed resources
• Client tools and interfaces
• Pcommands
• Petashell
• Petafs
• Windows Browser
• Web Portal
• Spans among seven LouisianaSpans among seven Louisiana 
research institutions
• Manages 300TB of disk storage, 
400TB of tape400TB of tape
Broader ImpactBroader Impact
Fast and Efficient Data Migration in PetaShareg
Future DirectionsFuture Directions
Stork: Central Scheduling Framework
f b l k
Stork: Central Scheduling Framework
• Performance bottleneck
– Hundreds of jobs submitted to a single batch 
h d l kscheduler, Stork
• Single point of failure
Future DirectionsFuture Directions
Distributed Data Scheduling
• Interaction between data scheduler
• Manage data activities with lightweight agents in each site
Distributed Data Scheduling
• Manage data activities with lightweight agents in each site
• Better parameter tuning and reordering of data placement 
jobs
– Job Delegation 
– peer‐to‐peer data movement 
– data and server striping 
– make use of replicas for multi‐source downloads
Questions?Questions?
Team:
Tevfik Kosar kosar@cct lsu eduTevfik Kosar kosar@cct.lsu.edu
Mehmet Balman balman@cct.lsu.edu
Dengpan Yin dyin@cct.lsu.edu
Jia "Jacob" Cheng jacobch@cct.lsu.edu
www.petashare.org www.cybertools.loni.org www.storkproject.orgwww.cct.lsu.edu

More Related Content

What's hot

Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
Primal Pappachan
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...
saadjamil31
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
Globus
 

What's hot (20)

Globus status and publication plans
Globus status and publication plansGlobus status and publication plans
Globus status and publication plans
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
20090701 Climate Data Staging
20090701 Climate Data Staging20090701 Climate Data Staging
20090701 Climate Data Staging
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
Grid Computing July 2009
Grid Computing July 2009Grid Computing July 2009
Grid Computing July 2009
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
empirical analysis modeling of power dissipation control in internet data ce...
 empirical analysis modeling of power dissipation control in internet data ce... empirical analysis modeling of power dissipation control in internet data ce...
empirical analysis modeling of power dissipation control in internet data ce...
 
The DBpedia databus
The DBpedia databusThe DBpedia databus
The DBpedia databus
 
hadoop
hadoophadoop
hadoop
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 
"Volunteer Computing With Boinc" por Diamantino Cruz e Ricardo Madeira
"Volunteer Computing With Boinc" por Diamantino Cruz e Ricardo Madeira"Volunteer Computing With Boinc" por Diamantino Cruz e Ricardo Madeira
"Volunteer Computing With Boinc" por Diamantino Cruz e Ricardo Madeira
 
contentDM
contentDMcontentDM
contentDM
 
Globus publication demo screenshots
Globus publication demo screenshotsGlobus publication demo screenshots
Globus publication demo screenshots
 
A Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials ScienceA Data Ecosystem to Support Machine Learning in Materials Science
A Data Ecosystem to Support Machine Learning in Materials Science
 
Understanding Big Data Platform from Patents
Understanding Big Data Platform from PatentsUnderstanding Big Data Platform from Patents
Understanding Big Data Platform from Patents
 
What's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtraWhat's New in Globus - Internet2 TechEXtra
What's New in Globus - Internet2 TechEXtra
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 

Viewers also liked

Aug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminarAug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminar
balmanme
 
Lblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminarLblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminar
balmanme
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
balmanme
 

Viewers also liked (7)

Aug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminarAug17presentation.v2 2009-aug09-lblc sseminar
Aug17presentation.v2 2009-aug09-lblc sseminar
 
Lblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminarLblc sseminar jun09-2009-jun09-lblcsseminar
Lblc sseminar jun09-2009-jun09-lblcsseminar
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
Pdcs2010 balman-presentation
Pdcs2010 balman-presentationPdcs2010 balman-presentation
Pdcs2010 balman-presentation
 
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
Nersc dtn-perf-100121.test_results-nercmeeting-jan21-2010
 
Presentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summerPresentation summerstudent 2009-aug09-lbl-summer
Presentation summerstudent 2009-aug09-lbl-summer
 
Sc10 nov16th-flex res-presentation
Sc10 nov16th-flex res-presentation Sc10 nov16th-flex res-presentation
Sc10 nov16th-flex res-presentation
 

Similar to Balman stork cw09

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
Shiyong Lu
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
Finalyear Projects
 

Similar to Balman stork cw09 (20)

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
M.Phil Computer Science Cloud Computing Projects
M.Phil Computer Science Cloud Computing ProjectsM.Phil Computer Science Cloud Computing Projects
M.Phil Computer Science Cloud Computing Projects
 
M.Phil Computer Science Cloud Computing Projects
M.Phil Computer Science Cloud Computing ProjectsM.Phil Computer Science Cloud Computing Projects
M.Phil Computer Science Cloud Computing Projects
 
Subhabrata Deb Resume
Subhabrata Deb ResumeSubhabrata Deb Resume
Subhabrata Deb Resume
 
M.E Computer Science Cloud Computing Projects
M.E Computer Science Cloud Computing ProjectsM.E Computer Science Cloud Computing Projects
M.E Computer Science Cloud Computing Projects
 
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud StorageIRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
IRJET- A Survey on Remote Data Possession Verification Protocol in Cloud Storage
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
My C.V
My C.VMy C.V
My C.V
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Estimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics PlatformEstimating the Total Costs of Your Cloud Analytics Platform
Estimating the Total Costs of Your Cloud Analytics Platform
 
Scalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehousesScalable scheduling of updates in streaming data warehouses
Scalable scheduling of updates in streaming data warehouses
 
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...REAL TIME PROJECTS  IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
REAL TIME PROJECTS IEEE BASED PROJECTS EMBEDDED SYSTEMS PAPER PUBLICATIONS M...
 
Venkatachandu rajana
Venkatachandu rajanaVenkatachandu rajana
Venkatachandu rajana
 
Cloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive ApplicationsCloud-Native Patterns for Data-Intensive Applications
Cloud-Native Patterns for Data-Intensive Applications
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Clearing Airflow Obstructions
Clearing Airflow ObstructionsClearing Airflow Obstructions
Clearing Airflow Obstructions
 
Real time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.jsReal time data viz with Spark Streaming, Kafka and D3.js
Real time data viz with Spark Streaming, Kafka and D3.js
 

More from balmanme

Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
balmanme
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
balmanme
 
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
balmanme
 
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
balmanme
 
Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...
balmanme
 
Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...
balmanme
 
Cybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-poster
balmanme
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
balmanme
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011
balmanme
 
Welcome ndm11
Welcome ndm11Welcome ndm11
Welcome ndm11
balmanme
 
2011 agu-town hall-100g
2011 agu-town hall-100g2011 agu-town hall-100g
2011 agu-town hall-100g
balmanme
 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2
balmanme
 
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CAAPM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
balmanme
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
balmanme
 

More from balmanme (20)

Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
Hpcwire100gnetworktosupportbigscience 130725203822-phpapp01-1
 
Experiences with High-bandwidth Networks
Experiences with High-bandwidth NetworksExperiences with High-bandwidth Networks
Experiences with High-bandwidth Networks
 
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
A 100 gigabit highway for science: researchers take a 'test drive' on ani tes...
 
Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...Available technologies: algorithm for flexible bandwidth reservations for dat...
Available technologies: algorithm for flexible bandwidth reservations for dat...
 
Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...Berkeley lab team develops flexible reservation algorithm for advance network...
Berkeley lab team develops flexible reservation algorithm for advance network...
 
Dynamic adaptation balman
Dynamic adaptation balmanDynamic adaptation balman
Dynamic adaptation balman
 
Cybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-posterCybertools stork-2009-cybertools allhandmeeting-poster
Cybertools stork-2009-cybertools allhandmeeting-poster
 
Balman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet BalmanBalman dissertation Copyright @ 2010 Mehmet Balman
Balman dissertation Copyright @ 2010 Mehmet Balman
 
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation NetworksAnalyzing Data Movements and Identifying Techniques for Next-generation Networks
Analyzing Data Movements and Identifying Techniques for Next-generation Networks
 
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
 
Opening ndm2012 sc12
Opening ndm2012 sc12Opening ndm2012 sc12
Opening ndm2012 sc12
 
Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011Balman climate-c sc-ads-2011
Balman climate-c sc-ads-2011
 
Welcome ndm11
Welcome ndm11Welcome ndm11
Welcome ndm11
 
2011 agu-town hall-100g
2011 agu-town hall-100g2011 agu-town hall-100g
2011 agu-town hall-100g
 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2
 
Streaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networksStreaming exa-scale data over 100Gbps networks
Streaming exa-scale data over 100Gbps networks
 
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CAAPM project meeting - June 13, 2012 - LBNL, Berkeley, CA
APM project meeting - June 13, 2012 - LBNL, Berkeley, CA
 
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 -  Delft, The NetherlandsHPDC 2012 presentation - June 19, 2012 -  Delft, The Netherlands
HPDC 2012 presentation - June 19, 2012 - Delft, The Netherlands
 

Recently uploaded

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Balman stork cw09

  • 2. Scheduling Data Placement JobsScheduling Data Placement Jobs • Data Placement ActivitiesData Placement Activities • Modular Architecture – Data Transfer ModulesData Transfer Modules  for specific protocols/services • Throttle maximum transfer operations running • Keep a log of data placement activities• Keep a log of data placement activities • Add fault tolerance to data transfers
  • 3. Job SubmissionJob Submission [ dest_url = "gsiftp://eric1.loni.org/scratch/user/"; arguments = ‐p 4 dbg ‐vb"; src_url = "file:///home/user/test/"; dap_type = "transfer"; verify_checksum = true; verify_filesize = true; set_permission = "755" ; i trecursive_copy = true; network_check = true; checkpoint_transfer = true; output = "userout";output =  user.out ; err = "user.err"; log = "userjob.log"; ]]
  • 4. AgendaAgenda • Error Detection and Error ClassificationError Detection and Error Classification • Data Transfer Operations D i T i– Dynamic Tuning  – Prediction Service – Job Aggregation • Data Migration using Stork • Practical example in PetaShare Project • Future Directions
  • 5. Failure‐AwarenessFailure Awareness • Dynamic Environment:  • data transfers are prune to frequent failures • what went wrong during data transfer? • No access to the remote resourcesNo access to the remote resources • Messages get lost due to system malfunction • Instead of waiting failure to happen• Instead of waiting failure to happen • Detect possible failures and malfunctioning services • Search for another data server • Alternate data transfer service• Alternate data transfer service • Classify erroneous cases to make better decisions
  • 6. Error DetectionError Detection • Use Network Exploration Techniques – Check availability of the remote service – Resolve host and determine connectivity failures – Detect available data transfers service – should be Fast and Efficient not to bother system/network resources • Error while transfer is in progress? – Error_TRANSFER • Retry or not? • When to re‐initiate the transfer • Use alternate options?• Use alternate options?
  • 7. Error ClassificationError Classification •Recover from Failure •Retry failed operation •Postpone scheduling of a  failed operationsfailed operations •Early Error Detection I i i T f h•Initiate Transfer when  erroneous condition  recovered •Or use Alternate options • Data Transfer Protocol not always return appropriate error codes • Using error messages generated by the data transfer protocol p • A better logging facility and classification
  • 9. Failure‐Aware SchedulingFailure Aware Scheduling Scoop data  ‐ Hurricane Gustov Simulationsp Hundreds of files (250 data transfer operation) Small (100MB) and large files (1G, 2G
  • 10. New Transfer ModulesNew Transfer Modules • Verify the successful completion of the operation by y p p y controlling checksum and file size.  f G idFTP S k f d l f• for GridFTP, Stork transfer module can recover from a  failed operation by restarting from the last transmitted  file. In case of a retry from a failure, scheduler informs  the transfer module to recover and restart the transfer  using the information from a rescue file created by the  checkpoint‐enabled transfer module.checkpoint enabled transfer module. • Replacing Globus RFT (Reliable File Transfer)
  • 11. AgendaAgenda • Error Detection and Error ClassificationError Detection and Error Classification • Data Transfer Operations D i T i– Dynamic Tuning  – Prediction Service – Job Aggregation • Data Migration using Stork • Practical example in PetaShare Project • Future Directions
  • 12. Tuning Data TransfersTuning Data Transfers • Latency Wall – Buffer Size Optimization – Parallel TCP Streams – Concurrent  Transfers • User level end‐to‐end Tuning P ll liParallelism • (1) the number of parallel data streams connected to a data transfer  service for increasing the utilization of network bandwidthservice for increasing the utilization of network bandwidth • (2) the number of concurrent data transfer operations that are  initiated at the same time for better utilization of system resourcesinitiated at the same time for better utilization of system resources.
  • 13. Parameter EstimationParameter Estimation • come up with a good estimation for the co e up t a good est at o o t e parallelism level – Network statistics – Extra measurement – Historical data  • Might not reflect the best possible current  settings (Dynamic Environment)
  • 16. Average Throughput using Parallel Streamsg g p g Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from Linux m/c
  • 17. Dynamic Setting of Parallel StreamsDynamic Setting of Parallel Streams Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from IBM m/c
  • 18. Dynamic Setting of Parallel StreamsDynamic Setting of Parallel Streams Experiments in LONI (www.loni.org) environment ‐ transfer file to QB from Linux m/c
  • 19. Job AggregationJob Aggregation • data placement jobs are combined and processed as a single transfer job. • Information about the aggregated job is stored in the job queue and it is tied to a main job which is actually performing the transfer operation such that it can be queried and reported separately.operation such that it can be queried and reported separately. • Hence, aggregation is transparent to the user W h t f i t i ll• We have seen vast performance improvement, especially with small data files, • simply by combining data placement jobs based on their d ti ti ddsource or destination addresses. – decreasing the amount of protocol usage – reducing the number of independent network connections
  • 20. Job AggregationJob Aggregation  2000 2500 ec) 1000 1500 2000 time (se single job at a time 2 parallel jobs 4 ll l j b 0 500 1000 total  4 parallel jobs 8 parallel jobs 16 parallel jobs 32 parallel jobs 0 10 20 30 40 max aggregation count 32 parallel jobs Experiments on LONI (Louisiana Optical Network Initiative) : 1024 transfer jobs from Ducky to Queenbee (rtt avg 5.129 ms) - 5MB data file per job
  • 21. AgendaAgenda • Error Detection and Error ClassificationError Detection and Error Classification • Data Transfer Operations D i T i– Dynamic Tuning  – Prediction Service – Job Aggregation • Data Migration using Stork • Practical example in PetaShare Project • Future Directions
  • 22. PetaSharePetaShare • Distributed Storage for Data  Archive • Global Namespace among  distributed resources • Client tools and interfaces • Pcommands • Petashell • Petafs • Windows Browser • Web Portal • Spans among seven LouisianaSpans among seven Louisiana  research institutions • Manages 300TB of disk storage,  400TB of tape400TB of tape
  • 24. Future DirectionsFuture Directions Stork: Central Scheduling Framework f b l k Stork: Central Scheduling Framework • Performance bottleneck – Hundreds of jobs submitted to a single batch  h d l kscheduler, Stork • Single point of failure
  • 25. Future DirectionsFuture Directions Distributed Data Scheduling • Interaction between data scheduler • Manage data activities with lightweight agents in each site Distributed Data Scheduling • Manage data activities with lightweight agents in each site • Better parameter tuning and reordering of data placement  jobs – Job Delegation  – peer‐to‐peer data movement  – data and server striping  – make use of replicas for multi‐source downloads
  • 26. Questions?Questions? Team: Tevfik Kosar kosar@cct lsu eduTevfik Kosar kosar@cct.lsu.edu Mehmet Balman balman@cct.lsu.edu Dengpan Yin dyin@cct.lsu.edu Jia "Jacob" Cheng jacobch@cct.lsu.edu www.petashare.org www.cybertools.loni.org www.storkproject.orgwww.cct.lsu.edu