SCIDAC CENTER FOR ENABLING DISTRIBUTED PETASCALE SCIENCE
STATUS REPORT FOR THE PERIOD
MAY 01, 2007 THROUGH NOVEMBER 31, 2008
The Center for Enabling Distributed Petascale Science Team:
1 2 3 2
Andrew Baranovski Joshua Boverhof , Ann Chervernak , Ian Foster ,
2 4 3 5
Dan Gunter , Kate Keahey , Carl Kesselman , Miron Livny ,
4 3 4
Tim Freeman , Robert Schuler , Ravi K Madduri ,
4 4 4 5
Rajkumar Kettimuthu , John Bresnahan , Michael Link , Nick LeRoy
Fermi National Accelerator Laboratory
Lawrence Berkeley National Laboratory
University of Southern California, Information Sciences Institute
Argonne National Laboratory
University of Wisconsin, Madison
TABLE OF CONTENTS
1. Executive Summary.........................................................................................................3
2.1. Data Area Highlights................................................................................................4
2.2. Scalable Services Area Highlights............................................................................6
2.3. Troubleshooting Area Highlights.............................................................................7
3. Data Area Progress..........................................................................................................8
3.1. High-performance Transport: GridFTP ...................................................................8
3.2. Data Replication and Placement ............................................................................11
3.3. Resource Management: LotMan and Lease Manager ...........................................13
3.4. dCache Improvements............................................................................................14
4. Services Area Progress..................................................................................................16
4.1. Nimbus ...................................................................................................................16
4.2. Service Construction Tools.....................................................................................19
5. Troubleshooting Area Progress......................................................................................21
5.1. Collection and Archive Service Improvements......................................................21
5.2. Log Data Analysis...................................................................................................23
5.3. Integration of log data with MDS4.........................................................................24
6.1. National Energy Research Supercomputing Center (NERSC)...............................25
6.2. Argonne Leadership Computing Facility (ALCF)..................................................26
6.3. Earth System Grid (ESG) Center for Enabling Technology...................................27
6.4. Pegasus group, USC/ISI..........................................................................................28
6.5. Scientific Data Management group, LBNL............................................................29
6.6. Globus Team, ANL.................................................................................................29
6.7. Solenoidal Tracker At RHIC (STAR) experiment..................................................29
6.8. TechX Corporation.................................................................................................30
6.9. Open Science Grid (OSG)......................................................................................31
6.10. Advanced Photon Source (APS)...........................................................................32
6.11. Nuclear Physics Groups........................................................................................32
7. Presentations and Publications.......................................................................................32
Center for Enabling Distributed Petascale Science
Progress Report, May 2007 - November 2008
1. EXECUTIVE SUMMARY
The SciDAC-funded Center for Enabling Distributed Petascale Science (CEDPS) was
established to address technical challenges that arise due to the frequent geographic
distribution of data producers (in particular, supercomputers and scientific instruments)
and data consumers (people and computers) within the DOE laboratory system. Its goal
is to produce technical innovations that meet DOE end-user needs for (a) rapid and
dependable placement of large quantities of data within a distributed high-performance
environment, and (b) the convenient construction of scalable science services that
provide for the reliable and high-performance processing of computation and data
analysis requests from many remote clients. The Center is also addressing (c) the
important problem of troubleshooting these and other related ultra-high-performance
distributed activities from the perspective of both performance and functionality.
This report summarizes work carried out by the CEDPS-CET during the period May,
2007 through November, 2008. It includes discussion of highlights, overall progress,
period goals, collaborations, papers, and presentations. The CEDPS-CET team brings
together researchers and scientists with diverse domain knowledge, whose home
institutions include three DOE laboratories and two universities: Argonne National
Laboratory (ANL), Fermi National Accelerator Laboratory (FNAL), Lawrence Berkeley
National Laboratory (LBNL), University of Wisconsin (Wisc) and University of Southern
California, Information Sciences Institute (USC/ISI). The CEDPS-CET PI is Ian Foster,
ANL; PI area leads are Ann Chevernak (USC/ISI), Ravi Madduri (ANL), and Dan Gunter
(LBNL). All work is accomplished in close collaboration with the project’s stakeholders
and domain researchers and scientists. To learn more about our project, please visit the
CEDPS website (http://cedps-scidac.org).
The CEDPS-CET team is working in three sub-areas: Data (CEDPS-Data), Scalable
Services (CEDPS-Services), and Troubleshooting (CEDPS-Troubleshooting). We list
highlights for each area in this section, and then provide details in the section that
follows. While for convenience we present each area separately, there are numerous
cross-connections among the different activities, as we make clear in the text that
2.1. DATA AREA HIGHLIGHTS
Data work in the CEDPS project takes place in several collaborating groups, including
the GridFTP team at ANL, the data replication and placement team at ISI, the storage
allocation and placement team at UW, and the dCache team at Fermi.
The work of this team is focused on enabling reliable, high-performance data placement
within high-end distributed systems. The word placement here is used to denote policy-
driven data movement—for example, to ensure that data is moved from an Advanced
Photon Source beamline to an end-user laboratory in a timely manner, or to ensure that
data produced by a supercomputer simulation is replicated to collaborator sites.
Challenges addressed in this work include efficient end-to-end transport over high-speed
networks; the management of scarce resources, such as space and bandwidth;
detection and recovery from failures; and high-level specification of user policies. A
workhorse for much of this work is the GridFTP data movement system, which provides
the basic data transport capabilities.
Highlights for this year include the following:
• An optimization for Lots of Small Files (LOSF) transfers, allowing multiple files in
transit at the same time. This optimization can improve performance by an order
of magnitude or more in some situations. The Advanced Photon Source has
used the concurrency optimization in conjunction with pipelining to transfer
terabytes of data (partitioned into lots of small files) to a user in Australia at a rate
30 times faster than standard FTP.
• Capabilities to dynamically improve the scalability of GridFTP servers. Additional
data mover nodes can be added to the GridFTP server at run time to handle
more transfer requests. This work addressed problems reported by DOE users
on Open Science Grid and a range of non-DOE users on TeraGrid.
• Deployment of GridFTP on the HPSS storage system at the Argonne Leadership
Computing Facility (ALCF), enabling high-performance remote access to and
from ALCF storage.
• The design of tools for supporting data replication capabilities, with the goal of
supporting the needs for data mirroring of application communities including
Earth System Grid, the STAR physics experiment, and the Spallation Neutron
• Continued support for the Replica Location Service, which is used by a variety of
scientific collaborations, including Earth System Grid and the Nordugrid ATLAS
high energy physics application.
• Implementation of LotMan, a lightweight storage allocation software that has a
plug-in interface to GridFTP. (This work is an essential step towards enabling
space management in distributed systems.) The LotMan software was integrated
into the Virtual Data Toolkit (VDT), providing convenient access for Open
Science Grid and other users .
• Hardening and productizion of data placement code that uses two components,
the Stork data placement service and a newly-developed Lease Manager
component, to provide dynamic match making for data placement jobs.
• Modifications to the dCache system to provide robust, end-to-end data integrity
verification. This work has proven beneficial to the CMS high energy physics
application, which uses the capability to verify checksums on approximately 10
Terabytes per day of data downloaded from CERN in Geneva, Switzerland, to
Fermi Lab in Illinois.
2.2. SCALABLE SERVICES AREA HIGHLIGHTS
Work in the scalable services area is motivated by the fact that moving data to
computation is not always feasible—and may be expected to be far less feasible in the
future, as data volumes continue to grow. Thus, we seek methods for enabling remote
access to code, and for moving computation easily to remote computers. The two major
initiatives are the grid Resource Allocation and Virtualization Environment (gRAVI) tools
for wrapping science applications as services, and the Nimbus infrastructure as a
service (IaaS: aka “Cloud”) software.
A major accomplishment for the CEDPS-Services area is completion of full
implementation of rapid creation and deployment of application services using gRAVI
and several releases of the Nimbus Toolkit providing tools that enable scientists to easily
leverage cloud computing capabilities. We were also able to successfully integrate
gRAVI with Nimbus, thus providing a full spectrum of scalable services functionality to
our user communities.
A major application success was enabling the first production run of the nuclear physics
STAR applications on Amazon’s EC2 cloud computing infrastructure, in September
2007. The deployment of the STAR cluster on EC2 was orchestrated by the Nimbus
Context Broker service that enables automatic and secure deployment of “turnkey”
virtual clusters, bridging the gap between functionality provided by EC2 and the “end
product” that scientific communities need to deploy their applications. Scientific
production runs require careful and involved environment preparation and reliability: this
run was a significant step towards convincing the broad STAR community that real
science can be done using cloud computing.
The gRAVI tools have been adopted enthusiastically by DOE groups at the Advanced
Photon Source at Argonne National Laboratory and the NERSC group at Lawrence
Berkeley Lab, enabling rapid application virtualization and provisioning of applications.
The tools we developed are also being used and adopted in communities like NIH
sponsored cancer Bioinformatics Grid (caBIG) project, Cardio Vascular Grid Research
Project and by the OMII-UK team.
2.3. TROUBLESHOOTING AREA HIGHLIGHTS
The CEDPS-Troubleshooting area finished implementation of the prototype tools to
parse existing logs and load them into a SQL database, collectively called the "log
pipeline". The log parser framework was populated over a dozen new parsers covering
software components from PBS, SGE, Condor, Globus, BeStMan, and HSI.
Documentation and internal logging were improved dramatically across the board.
Together, these improvements prepared the CEDPS troubleshooting tools to be
deployed and used on OSG.
Collaborations with a variety of groups have been very fruitful. The troubleshooting tools
are now in active use by at least four groups:
• The CEDPS log pipeline was deployed on the NERSC Parallel Distributed
Systems Facility (PDSF). Analysis of the data from the STAR BeStMan data
transfers have revealed unexpected network performance, which have triggered
upgrades and configuration changes on PDSF.
• The NERSC Project Accounts team is using CEDPS log parsing to normalize the
logs and log database, to perform traceability analysis.
• The Pegasus team at USC/ISI uses the CEDPS pipeline for large computational
seismology workflows. The CEDPS tools were able to efficiently analyze
execution logs of earthquake science workflows consisting of upwards of a
• The Tech-X STAR job submission portal uses the CEDPS log database to drill
down to site-specific information for a portal job. A prototype of this functionality
was demonstrated at SC08.
Enhancements to the NetLogger log summarization library developed as part of CEDPS
is used in GridFTP to implement a "bottleneck detection" algorithm that answers the
(increasingly important) question of whether disk or network is the data transfer
As part of our collaboration with ESG, we released a dramatically improved version of
the administrative interface to the MDS Trigger Service. This work was done in response
to feedback from ESG, who requested this capability to support a portal that they plan to
develop as well as providing a simpler command-line interface.
3. DATA AREA PROGRESS
This year, CEDPS-Data had accomplishments in several important areas.
3.1. HIGH-PERFORMANCE TRANSPORT: GRIDFTP
Optimization of many small file transfers. GridFTP has long been used to move large
files rapidly over wide area networks, with methods such as striping, parallelism, and
alternative protocols used to achieve high performance. Unfortunately, scientific data is
often partitioned into many small files. For example, microtomographic data produced at
the Advanced Photon Source is typically organized as a large number of slice files. In
these circumstances, GridFTP suffered from lower transfer rates due to synchronization
costs. To address this problem, the GridFTP team developed a pipelining solution last
year to address this issue. Though pipelining improved the performance significantly,
there was room for further optimizations. This year, the team developed additional
optimization for Lots of Small Files (LOSF) transfers related to Concurrency.
Concurrency refers to having multiple control channel connections between the client
and the server, and thus having multiple files in transit at the same time. This is
equivalent to starting up n different clients for n different files, and having them all
running at the same time. The Advanced Photon Source has used the concurrency
optimization in conjuction with pipelining to transfer terabytes of data (partitioned into lots
of small files) to a user in Australia at a rate 30 times faster than standard FTP. In
addition, the LIGO project has used these optimizations to transfer large volumes of data
on a non-LHC type of network from Milwaukee to Germany at a sustained rate of 80
Instrumentation with NetLogger and automated bottleneck detection. The team also
added a capability to instrument GridFTP with the Netlogger performance measurement
tools. This capability has been helpful to DOE ESNet users and TeraGrid users. The
GridFTP server now logs messages that can be postprocessed using Netlogger tools
and collected using syslog-ng logging records. Fine-grained disk and net IO
characteristics can be visualized and analyzed. The commonly used GridFTP client
called globus-url-copy takes advantage of this feature by telling its user which one of the
following is the bottleneck for a transfer: Disk Read, Network Write, Network Read, or
Dynamic scaling of GridFTP servers. Open Science Grid (OSG) participants reported
that their single biggest problem with running GridFTP servers is that the servers can
overwhelm the transfer host and/or the underlying storage system. TeraGrid reported
that their major problem with running striped GridFTP servers is the disruptions caused
due to the failure of one of the data mover nodes. In response to these problem reports,
the GridFTP team developed capabilities to dynamically improve the scalability of
GridFTP servers. Additional data mover nodes can be added to the GridFTP server at
run time to handle more transfer requests. The GridFTP group also improved the
resiliency of striped GridFTP servers. The GridFTP server now continues to operate
after any data mover node failure as long as at least one of the data mover nodes is
alive. The GridFTP server with these new features was released as part of OSG’s VDT.
Deplopyment of GridFTP at a leadership class computing center. The GridFTP team has
worked closely with the Argonne Leadership Computing Facility on deploying GridFTP
on their HPSS storage system. A number of issues on IBM’s Parallel I/O interface for
HPSS and the GridFTP’s HPSS DSI have been uncovered and most of them were fixed.
Advertisement of GridFTP server properties. To further address the issue of
overwhelming the transfer hosts and to augment the overall quality of service with data
transfers for the DOE applications, the GridFTP team prototyped a GridFTP information
provider service that publishes information such as server load, number of open
connections, and the maximum number of connections allowed for a GridFTP server for
use by higher level services to improve QoS for data transfers.
Year 2 Milestones
• MS2.2. Work with VDT team to include MOPS 1.0 in release
o Status: Complete
• MS2.5. Prototype a non-striped connection management capability using NeST
o Status: Complete. GridFTP is capable of passing the attributes of a
connection to NeST.
• MS2.8. Document use cases and performance of ways to manage transfers
o Status: Complete. Preliminary results are available at
• MS2.9. Design and prototype implementation of common interface to storage
o Status: Partially complete. We have created a design document, but the
prototype implementation is not complete.
• MS2.10. Prototype methods of incorporating troubleshooting into MOPS
o Status: Complete. The integrated software is available as part of Globus
4.2. More information on this is available at:
• MS2.12. Deliver a MOPS 2.0 release that includes additional optimizations
o Status: Complete. This functionality has been released as enhancements
to GridFTP as part of Globus 4.2.x.
Our plans for the coming year include the following:
• Augment the prototype information provider service and create a production
implementation, suitable for use by higher-level data movement services.
• Prototype a GridFTP control channel brokering service that controls access to a
GridFTP server. Such a service is needed to provide a better-than-best-effort
data movement service.
• Investigate ways to integrate bandwidth reservation services with GridFTP and
higher-level data movement services such as the Reliable File Transfer (RFT)
• Continue to work with APS to deploy new capabilities and obtain feedback.
• Plan to work with the Spallation Neutron Source (SNS) to identify how our data
movement tools can help SNS scientists and users.
• Work with the services team on prototyping a simple storage cloud using
GridFTP as an underlying data transfer mechanism.
3.2. DATA REPLICATION AND PLACEMENT
During the past year, the data relication and placement group has focused on three main
areas. In the area of data replication and placement services, the team did a significant
re-evaluation of the functionality that we should provide based on our interactions with
DOE users. We reoriented our work toward providing simple replication utilities rather
than higher-level data placement services, because the former better met the needs and
requirements of DOE application communities. In particular, we are focused on
supporting the needs of the Earth System Grid, which currently uses the Replica
Location Service to manage location information for data sets.
ESG is working on the next-generation of replica management functionality for their
application domain, and the CEDPS group is participating in the design of replication
services that will be used by ESG. In addition, we had extensive discussions with the
STAR and SNS applications. All three application communities (ESG, STAR, and SNS)
identified a need for data mirroring functionality, and their requirements will drive the
ongoing implementation of data mirroring tools. Our design and development work is
focused on providing this functionality.
In addition, we continued to do research on data placement policies in two areas. First,
we looked at the requirements of DOE virtual organizations, such as the high-energy
physics community, to disseminate data according to policies at the Virtual Organization
level, such as the tiered distribution of data produced by the LHC at CERN. We looked
at whether we could use a policy engine to enforce similar policies and had some initial
success, as reflected in a poster at the SC2008 conference.
A second area of increasing interest to DOE communities is the use of workflow engines
to manage complex scientific workflows. We have done research to characterize realistic
scientific workflows (resulting in a paper at the Workshop on Workflows in Support of
Large-Scale Science WORKS08). Based on those scientific workflows, we have
simulated a variety of data placement strategies that could work in conjunction with a
workflow management system to improve the efficiency of execution of scientific
workflows. A paper on this work was recently submitted to the DADC 2009 Workshop.
Year 2 Milestones
• MS2.4 Design and prototype of reliable distribution service
o Status: Complete. The BuTrS service is available
• MS2.13. Release version 2.0 of the DPS with additional functionality
o Based on our discussions with DOE user communities, we re-oriented our
design and development efforts to providing a simple data mirroring
capability. We produced a design document for the initial phase of this
work, and the implementation is in progress. An initial prototype of this
capability will be available in early 2009 (first quarter).
• MS2.14. Work with troubleshooting to include additional data in logs for
o This work is pending, since we are still in an implementation stage in
providing new data mirroring functionality. We are committed to
incorporating CEDPS troubleshooting interfaces into future data services.
Plans for 2009 include working closely with the Earth System Grid, SNS and other
application communities to understand their requirements in the areas of data replication
and mirroring and providing functionality that allows these groups to manage their data
better. Initially, we will provide a very simple data replication capability based on existing
GridFTP and SRM functionality. Over the coming year, we plan to add features to the
data replication and mirroring capabilities to provide richer functionality to DOE science
3.3. RESOURCE MANAGEMENT: LOTMAN AND LEASE MANAGER
During that past year, we have integrated LotMan, a lightweight storage allocation
software that has a plug-in interface to GridFTP, into the Virtual Data Toolkit (VDT).
Through the VDT, this functionality was made available to many groups, including the
Open Science Grid (OSG). Some basic testing of LotMan has been added to confirm its
The other primary effort of this team has been work on the hardening and productizing
the data placement code initially developed for the Super Computing 2005 conference.
This code uses two components: the Stork data placement service and a newly-
developed Lease Manager component, to provide dynamic match making for data
placement jobs. Significant effort has been put into converting the Lease Manager from
a proof-of-concept prototype into a mature component. The Lease Manager has been
integrated into Condor, and it is built and tested regularly as part of Condor's nightly
build and test.
Stork development has continued in parallel in two different groups. In addition to the
CEDPS team, a group at Louisiana State University led by Tevfik Kosar, the former UW
student responsible for Stork's initial development, has also continued to add
functionality to Stork. Kosar’s group recently released Stork 1.0. At the same time, the
UW team has been working to harden the Stork / Lease Manager interface, and the
dynamic matching of data placement jobs. Several additional uses for the Lease
Manager as a part of Condor are planned. Going forward, the CEDPS team will work
with Kosar’s group to integrate their enhancements to Stork with his work on Stork, with
the goal of provide a single Stork distribution in the future.
Year 2 Milestones
• MS2.7. Develop a managed storage capability for non-striped MOPS.
Future plans in this area involve additional testing of both LotMan and the Lease
Manager. Further developments of LotMan will include development of an external
interface into LotMan, perhaps via a Web Services interface, to allow storage allocations
to be more easily created and managed. Work on the Lease Manager will include
performance measurements and enhancements to the Lease Manager / Stork interface.
Finally, the UW team will work with Kosar’s group to provide a single Stork release.
3.4. DCACHE IMPROVEMENTS
The CEDPS team has worked to modify the dCache system to provide robust end-to-
end data integrity verification. This work has proven beneficial to the CMS high energy
physics application, which uses the capability to verify checksums on approximately 10
Terabytes per day of data downloaded to Fermi Lab. CMS has not been willing to reveal
the number of what would otherwise have been undetected errors that were identified by
this method, but we understand that it is greater than zero.
dCache provides a system for storing, retrieving and managing petabytes of data
distributed among a large number of heterogeneous server nodes. dCache supports a
variety of management and access protocols, such as gridftp,srm, dccp, xrootd, all
representing a single virtual filesystem tree. The project is a joint effort between the
DESY (Deutsches Elektronen-Synchrotron) in Hamburg and the FNAL (Fermi National
Accelerator Laboratory) near Chicago and is aimed at serving data the needs of US and
European based LHC (Large Hadron Collider) experiments. The core part of the dCache
functionality is in combining separate disk storage systems of several hundred terabytes
into a uniformly accessible filesystem tree. In order to make this process manageable,
dCache does load balancing among data nodes, data integrity verification, detects failing
hardware and attempts to ensure existence of important data in multiple replicas.
End-to-end data integrity verification in dCache is designed to prevent propagation of
incorrect data. In order to implement this feature, the following work has been done:
• Storing of checksum values and their types inside the dCache metadata catalog
• Implementation of GridFTP version 2 standard extensions, specifically those that
communicate checksum data between client/server and server/server
• Server to server negotiation of the checksum type algorithm for verification of
integrity of subsequent transfer
• Extension to algorithms that calculate data checksum (or file digests) values.
Specifically, adding support for MD5, MD4 and CRC.
Before starting the data transfer of a file to dCache, the client computes the checksum
value over the original data file on his or her local disk. This value is sent to dCache
using GridFTP checksum protocol extensions. After the transfer completes, dCache
verifies the received data for consistency with the client checksum and either rejects or
accepts the transaction.
Before a file is read from dCache, a client or other server on its behalf negotiates the
checksum algorithm with dCache to ensure that dCache supports the type of checksum
consistent with client requirements. This process ends when dCache determines and
sends to the client the value of the checksum that reflects the true content of the original
file. After the file is transferred to the client, the client verifies the checksum of the data
on its local disk and either accepts or rejects the transaction.
This end-to-end data integrity verification process ensures that server to client, client to
server and server to server data movement operations preserve the content of the file
originally stored by a user.
In addition, dCache checksum failures on particular hardware are routinely used to flag
hardware for preemptive replacement or maintenance.
The first deployment of this checksum functionality revealed several deficiencies in a
partnering storage system in Europe. Enabling this new functionality triggered further
development in other storage products, such as CASTOR at CERN, which now ensure
better quality of served data. Currently, the data integrity verification code verifies
approximately 10TB of data a day incoming to the FNAL storage system.
An additional effort of the CEDPS data team focused on research on quality of service
and opportunistic use. Based on experience in improving efficiency of operations of large
scale data reconstruction effort in the OSG opportunistic storage environment, the Fermi
group delivered a document outlining ideas and further work needed to virtualize grid
storage with the goal of providing data storage with a predefined and sustainable level of
Year 2 Milestones
• MS2.11. Produce Design Document on incorporating mechanisms for quality of
o Status: Complete
Our plans for the coming year: The scale of the data requirements of the CMS
experiment to the dCache project has shown that existing implementations of algorithms
for replication of high demand data are too simplistic and create substantial inefficiencies
during peak usage of the storage system. With varying degrees of success, these
inefficiencies are manually addressed on a case-by-case basis. Our future work in this
area will focus on researching and then adding automations to optimally replicate “hot”
data. This should reduce the need for continuous and hence costly manual parameter
adjustments to the dCache system.
4. SERVICES AREA PROGRESS
During the report period the services area has developed and applied tools for the
construction, operation, and provisioning of scalable science services.
The Nimbus system provides mechanisms for the dynamic allocation of virtual machine
images: what is sometimes referred to as a “private cloud.” It also provides mechanisms
for the creation of the required images, for the creation of virtual clusters based around
virtual machine images, and other management tasks.
During the evaluation period the focus of the Nimbus team has been on working with
application communities and providing tools that enable scientists to easily leverage
cloud computing capabilities.
Our particular focus was interaction with DOE-related communities as follows:
• We enabled the first production run of the nuclear physics STAR applications on
Amazon’s EC2 cloud computing infrastructure. This took place in September
2007. The deployment of the STAR cluster on EC2 was orchestrated by the
Nimbus Context Broker service that enables automatic and secure deployment of
“turnkey” virtual clusters bridging the gap between functionality provided by EC2
and the “end product” that scientific communities need to deploy their
applications. Scientific production runs require careful and involved environment
preparation and reliability: this run was a significant step towards convincing the
broad STAR community that real science can be done using cloud computing.
We further worked with STAR on evaluating I/O on EC2 STAR instances which
at 5MB/sec was deemed to be adequate for production runs of I/O intensive
applications. We continue to collaborate with the project to enable further runs.
• Using the Context Broker we also implemented a proof-of-concept that enabled
the integration of dynamically provisioned environments (e.g., on EC2 or on
clouds created in scientific domain) for the ALICE HEP experiment at CERN
(07/08, CHEP submission pending). This work was done in collaboration with the
CERNVM project that produces VM images that support all four LHC
experiments. Our prototype dynamically deployed VMs that were automatically
added to the ALICE Alien infrastructure registering their availability for job
• We interacted internationally with multiple members of the ATLAS HEP
experiment. Ian Gable’s group from the University of Victoria (UVIC) has long
been a demanding user of Nimbus and contributing bug fixes and thorough
testing of Nimbus capabilities. In the Fall of 2008 they contribute a Nagios-based
monitoring component, required to better adapt the project to their needs, and as
a result we invited them to join us on the committer team. We also initiated
collaboration with a group of ATLAS scientists in the Max-Planck Institute who
are interested in open source implementation of EC2 to facilitate moving
environments from their resources to EC2.
In terms of software development, this project supported the following developments:
• It contributed towards the development of an EC2 gateway (06/07), allowing
scientists to submit resource requests to Amazon using grid interfaces and
credentials and credits associated with a specific project. This gateway enabled
scientists to seamlessly move between clouds configured in scientific space to a
commercial target (in this case EC2) for overflow demand.
• It contributed to early design and development of the Context Broker technology
images enabling automated creation of “turnkey” virtual clusters.
In addition, these collaborations above contributed requirements and informed the
design on the following developments:
• They helped us define requirements that informed six software releases of the
Nimbus toolkit between 05/07 and 11/08. Among others, the releases also
contained features such as the Context Broker service, the “workspace pilot” – an
non-invasive adaptation of batch schedulers that facilitate Nimbus adoption on
existing scientific platforms, EC2-compatible interfaces to our technology,
improved extensibility, and better configuration tools.
• From 03/08 onwards, we worked with site administrators at UC and other sites to
configure Science Clouds – cloud computing platforms available to science. see
• We added several new images to the the “workspace bookshelf”, including
contextualizable images that can be used for the creation of virtual clusters, most
recently an OSG virtual cluster (10/08).
Year 2 Milestones
• Develop protocols for specifying targets for scalable services, including
performance and resource provisioning targets; continue the implementation of
• Further work on “workspace bookshelf”: developing schemas for describing and
identifying execution environments.
• Release the first version of services for on-demand provisioning of workspaces.
4.2. SERVICE CONSTRUCTION TOOLS
The Grid Remote Application Virtualization Interfaces (gRAVI) is an extension to the
Introduce grid service authoring tool that adds capabilities for wrapping legacy
applications. The first stable release of gRAVI occurred last April, and it was well
received by the APS (DOE) and caBIG (NIH) communities who were able to make
immediate use of it and began building and deploying services. It has subsequently also
been adopted at NERSC. Another release was made this past August that included
several improvements suggested by users.
One feature added in the last release was a simple portal, generated as part of the
authored gRAVI application, which can be deployed into a Tomcat servlet container to
enable users to interact with the corresponding Web Service through a web browser.
This feature is based on the feedback we received from the early users of the gRAVI tool
that it would be useful if gRAVI would generate a simple web/portal client that could be
deployed in a web server and could be shared with the community quickly. After
providing users with a proof-of-concept we started mulling over requirements for a
production environment with a large user community. Out of these discussions evolved a
collaboration with NERSC’s Open Software & Programming Group to develop tools for
generating portals to expose various scientific applications on NERSC resources. In late
November, an initial portal was in place and was well received by group lead David
Skinner, he pointed out several improvements that would need to be made.
Year 2 Milestones
• Develop a preliminary architecture document integrating the Web service
application infrastructure with provisioning backends.
o Status: Completed. PhD student Ioan Raicu conducted extensive
investigations and experiments in this area.
• Work with biology applications on creating science services using initial AHS job
management based as well as resource management based solutions wherever
o Status: Completed. We conducted a promising study with the Argonne
computational biology team.
• Continued development of pyGridWare to support new protocol versions.
o Status: Modified; completed. We chose to focus effort on gRAVI rather
than pyGridWare, and thus this milestone has been deleted. Equivalent
functionality is provided by gRAVI.
• Develop a version of PyCLST that supports wrapping non-command line
o Status: Modified; completed. We chose to focus effort on gRAVI rather
than pyGridWare, and thus this milestone has been deleted. Equivalent
functionality is provided by Introduce.
• Continue to deploy new services on OSG, ESG, and others.
o Status: Good results achieved with APS, caBIG, and NERSC. (OSG and
ESG have proved to be less interested in the technology for the moment
o Researchers at the APS used gRAVI to generate secure grid services for
controlling a beamline experiment, data analysis, visualization and
o The caBIG initiative (funded by National Cancer Institute) is using gRAVI
to create its “caGrid” infrastructure for creating, registering, discovering,
and invoking analytical routines.
o NERSC is using gRAVI portal generation tools to create and deploy a
5. TROUBLESHOOTING AREA PROGRESS
In this reporting period, CEDPS-Troubleshooting began several new collaborations, and
spent a significant amount of time in engagement with these collaborators. Details are in
the Collaborations section.
Development work has focused on providing a production-ready version of the log
parsing and database loading tools. Software enhancements and bug fixes have
proceeded in parallel with deployment and feedback from collaborators who are using
5.1. COLLECTION AND ARCHIVE SERVICE IMPROVEMENTS
We further designed, developed, documented, and tested the “log pipeline”, which
continuously parses and loads log data into a database, and wrote a variety of parsers to
normalize log data from Grid middleware. The log pipeline has three main components:
a manager, a log parser, and a database loader. Major improvements to the components
in this reporting period are given below. This development work was a necessary pre-
requisite to deploying the software on OSG.
The major improvement in the manager component was the addition of a simple UDP
messaging protocol to allow it to tell the managed loader/parser components when to roll
over, re-parse configurations, or shutdown cleanly. This is more robust and easier to
distribute than the previously used UNIX signals, though these still work.
Major improvements to the log parser were a number of new parser modules, including
Condor and Globus components, improved error handling, the ability to “throttle” the
parser so it doesn’t consume all the host CPU if it is pointed to a very large input file, and
other enhancements. The availability of a simple framework that makes developing new
parsers a snap is very useful: it has encouraged contributions from NERSC, and been
the primary interaction point with other systems. We now have 15 parsers in all, covering
software components from PBS, SGE, Condor, Globus, SRM, and HIS. Some relatively
straightforward extensions in the framework made quick work of a tool for the Pegasus
workflow logs that could traverse tens of directories of hundreds of files each, loading
them all into the same database for analysis.
Major improvements to the log loader were PostgreSQL (www.postgresql.org) support,
greatly improved performance for the SQLite (www.sqlite.org) module, CPU throttling,
and a more thorough treatment of the performance/safety tradeoff involved in unique
integrity constraints. The PostgreSQL support is important for wider deployments, but in
the near term is also the database of choice for the NERSC Project Accounts work. The
performance tradeoffs of data loading are particularly important for both the Pegasus
team, which needs speed at all costs, and for long-running OSG deployments, which
need constistency and robustness across system restarts.
Year 2 Milestones
• Add authentication and authorization capability to the log Collection Service and
log Archive Service.
o Status: In progress. We chose to use the OGSA-DAI technology to
perform this function, but did not make much progress implementing it.
This was serendipitous, as in the meantime the OGSA-DAI software has
developed a much fuller implementation of the required "view" and
distributed join functionality.
• Develop tools to filter and feed log data from the Collection Service to the Archive
o Status: Complete (see above for details).
• Continue to deploy new services on OSG, ESG, and others.
o Status: Ongoing. We have deployed on NERSC PDSF and are packaging
with VDT to deploy on other OSG resources in the near future.
• Continue outreach to Grid application developers to instrument their applications.
o Status: Ongoing. We have had very positive interactions in this regard
with members of the Pegasus team (see Section 6.3) and also the SDM
group at LBNL (see Section 6.5).
In addition to finishing necessary parts of milestones above, there are a few minor and
one large development tasks in the near future. The minor tasks are tools and scripts for
more robust operation of the log pipeline, including log and database rotation, and self-
monitoring. All these new capabilities will be documented and packaged using the
Virtual Data Toolkit (VDT) so they can be easily deployed on the Open Science Grid
The major task is an OGSA-DAI interface that allows fully authenticated cross-site
database access. The state-of-the-art today is to either keep the logs on the site or to
centralize a subset of them (as OSG does today with most of its monitoring). Neither
approach is adequate for troubleshooting, as many current and potential users have
pointed out. What we are aiming for with OGSA-DAI is a way to control who views which
logs at a fine grain using the existing Grid credentials, and to be able to combine
information easily from more than one site in the process. We are fortunate to have a
project like OGSA-DAI which has done much of the difficult groundwork in mapping Grid
credentials to database roles and views, but there is still a considerable amount of work
to be done.
5.2. LOG DATA ANALYSIS
The data analysis work has focused on three tasks: analysis of complex Pegasus
workflows from the SCEC CyberShake computations, profiling of BeSTMan data
transfers, and correlation of Sun Grid Engine and Globus Job Manager logs for the
TechX STAR job submission portal. These projects are all discussed in more detail in
the collaboration sections below: Pegasus workflows in Section 6.3, BeStMan in Section
6.5, and TechX/STAR in Section 6.7.
Year 2 Milestones
• Use the Archive Service to establish performance baselines, and trigger events if
performance deviates too much from the baseline.
o Status: In progress. We have demonstrated the ability of the Archive
Service to profile complex workflows in collaboration with the Pegasus
team. The resulting profiles do form, in a sense, a baseline. We are still
working on the triggering of events (see below) based on this information.
Future plans include establishing performance baselines for BeSTMan transfers to and
from NERSC PDSF systems (primarily for the STAR project), and continuing the
analysis with Pegasus workflows for SCEC CyberShake workflows and other users of
the Pegasus technology. We are working towards a real-time view of the status of
Pegasus workflows for SCEC/CyberShake, which will be a huge improvement over their
current "run and come back 9 hours later" situation. These tools should apply with only
small modifications to other users of the Pegasus workflow tools, and pieces of the
functionality should be useful for monitoring Condor job DAGs. So we expect to get a
much broader impact from what started as very focused work.
5.3. INTEGRATION OF LOG DATA WITH MDS4
To integrate the MDS4 with the CEDPS Log Collection and Log Archive services, we
have invested significant time enhancing the administrative interface to the MDS Trigger
service. This work lays the groundwork for "action scripts" that can query the CEDPS log
database and trigger alarms based on the results.
Year 2 Milestones
• Develop MDS4 Trigger Service action scripts to securely restart failed services.
o Status: Dropped. Preliminary discussions with system administrators
have indicated that this is not properly in the scope of the CEDPS project.
We are instead investigating instead better integration with site “trouble
• Develop MDS4 Triggers for missing log events (based on NetLogger anomaly
o Status: In progress. MDS triggers for log events are a work in progress.
However, as part of our collaboration with ESG, we have released a
dramatically improved version of the administrative interface to the MDS
Trigger Service. See the ESG Collaboration for details.
• Integration of Log Collection Service with MDS4 to provide a log file location
o Status: Deferred. Because we plan to use OGSA-DAI as the access
mechanism for log data, we will implement the log file location service by
having each OGSA-DAI server register to a central MDS Index server.
Because OGSA-DAI already publishes resource properties, no additional
development work is needed to provide this functionality; this is a
deployment task that we plan to do as we deploy OGSA-DAI.
As is appropriate for a SciDAC Center for Enabling Technology, collaborations are the heart of
the CEDPS project, defining requirements for technology R&D, and providing the context within
which new technologies are evaluated. We detail some of our major collaborations in the
6.1. NATIONAL ENERGY RESEARCH SUPERCOMPUTING CENTER (NERSC)
CEDPS-Troubleshooting worked with the National Energy Research Supercomputing
Center (NERSC) on two fronts: log collection on PDSF, and help with the Project
Accounts auditing tasks.
The CEDPS log collection tools were deployed on the NERSC Parallel Distributed
Systems Facility (PDSF). Parsers have been developed for the Sun Grid Engine (SGE)
scheduler used on PDSF. The site administrators at PDSF use this information to track
the resource consumption "tokens" claimed by PDSF users. This deployment is also
used for tracking STAR data transfers and providing troubleshooting to the TechX/STAR
job submission portal, both discussed below.
The NERSC Project Accounts team is designing and implementing a framework to allow
NERSC users to run jobs under a shared, or project, account while retaining full
traceability to the actual user involved. This model would improve the usability of
NERSC resources for many users. In order to perform the necessary auditing of user's
actions in the systems, the Project Accounts team is using the CEDPS log parsing to
normalize the logs and log database, to perform traceability analysis.
CEDPS-Services is developing portal generation tools that are being used to create science
gateways for NERSC resources. The initial target application is VASP, a package for performing
ab-initio quantum mechanical molecular dynamics, which has a relatively large user community
at NERSC. To date the portal has only been available internally to NERSC staff for
testing, early next year we plan on rolling it out to a select group of community testers.
6.2. ARGONNE LEADERSHIP COMPUTING FACILITY (ALCF)
The focus of our collaboration here is on enabling ALCF to achieve high-speed remote
access to and from their HPSS mass store system. This work has involved close
collaboration with ALCF staff on the design, deployment, and evaluation of their GridFTP
solution, including extensive work with the HPSS interface. Initial results are extremely
6.3. EARTH SYSTEM GRID (ESG) CENTER FOR ENABLING TECHNOLOGY
ESG-CET is a major user of CEDPS data movement technologies: in particular, GridFTP
and RLS. It is also a driver for CEDPS work on data replication.
ESG-CET has continued to make aggressive use of GridFTP, and thus has benefited
from the significant performance and functionality enhancements to GridFTP that have
been developed under the CEDPS project. ESG-CET uses the Storage Resource
Manager from LBNL to perform wide area bulk data movement of large (terabyte) data
sets, and SRM in turn calls GridFTP to perform these transfers among ESG sites. In
addition, OpenDAP-G uses GridFTP directly for high-performance transfers.
The ESG-CET and CEDPS projects also collaborate in several areas related to data
management. ESG-CET uses the GridFTP data transfer service extensively, and thus
the project has benefited from the significant performance and functionality
enhancements to GridFTP that have been developed under the CEDPS project. In
particular, ESG-CET uses the Storage Resource Manager from LBNL to perform wide
area bulk data movement of large (terabyte) data sets, and SRM in turn calls GridFTP to
perform these transfers among ESG site.
ESG uses the Replica Location Service to track and catalog data sets. For the next
generation ESG architecture, we are working with the ESG team to provide data
mirroring functionality. This new functionalioty will allow key sites on several continents
to host large capacity (terabytes) mirror sites for ESG data sets. We are investigating
whether data mirroring tools being developed under CEDPS can help ESG to manage
their data replication.
For monitoring, we have released a redesigned version of the MDS Trigger Service for
Globus Toolkit Version 4.2. This work provides an improved service interface for
administrative tasks such as modifying, enabling, and disabling existing triggers. This
work was done in response to feedback from ESG, who requested this capability to
support a portal that they plan to develop as well as providing a simpler command-line
In our initial plans, we also identified ESG server-side processing as an important driver
for CEDPS-Services work. However, while ESG continues to view server-side
processing as important for their long-term plans, they have not been able to prioritize
effort in this area, and thus collaboration in that area has not yet eventuated.
6.4. PEGASUS GROUP, USC/ISI
The Pegasus group at USC/ISI provides a workflow engine called Pegasus-WMS that is
used by the Southern California Earthquake Center (SCEC) to run computational
simulations on their CyberShake platform. By combining Pegasus-WMS and CEDPS
logging tools, we were able to efficiently process execution logs of earthquake science
workflows consisting of hundreds of thousands to one million tasks. In an accepted
poster for the SC08 conference [7.4], we show results of processing logs of
CyberShake, a workflow application running on the TeraGrid.
Although workflow analysis was not in our list of milestones for this year, it has turned
out to be a fruitful area. Just as one can view any distributed job as a type of “workflow”,
solutions to the problems of scale and correlation found in the Pegasus workflow logs
are re-usable in the context of single Grid submissions. For example, the same types of
queries we developed for mining Pegasus logs were also used to correlate the TechX
job submissions with the SGE scheduler information (see Section 3.6, below).
We also collaborated with the Pegasus team on exploring Infrastructure-as-a-Service
(IaaS) cloud computing for scientific communities: a platform can be flexibly provisioned
from academic or commercial provider in response to a developing resource need in a
workflow. Our first exploration – comparing performance of workflow-based scientific
applications on local platforms and platforms available in the Science Clouds – took
place in the summer ’08 and was recently published [7.3].
6.5. SCIENTIFIC DATA MANAGEMENT GROUP, LBNL
CEDPS-Troubleshooting collaborated with Arie Shoshani’s SDM group, the developers
of the BeStMan implementation of the Storage Resource Manager (SRM) protocol, to
improve and normalize their log information. The short-term goal of this collaboration has
been to collect SRM logs from the STAR project's transfers between Brookhaven
National Laboratory and PDSF.
There is now a deployed version of BeStMan on PDSF that contains new and improved
logging; and the requisite parsers in the CEDPS log collection to process these logs and
load them into our database for analysis. Impacts on the STAR project are discussed
6.6. GLOBUS TEAM, ANL
CEDPS-Troubleshooting continued the effort to help Globus Toolkit logs follow the
CEDPS "Logging Best Practices" guidelines. Although the initial target for this was 4.1.3,
deployment realities have pushed it back to 4.2.1. Continued engagement and feedback
to Globus Toolkit logs in order to make them as useful as possible when GT4.2.1 is
deployed through OSG and elsewhere.
6.7. SOLENOIDAL TRACKER AT RHIC (STAR) EXPERIMENT
The Solenoidal Tracker At RHIC (STAR) nuclear physics experiment based at
Brookhaven National Laboratory (BNL) performs data analysis at several sites, including
PDSF. The input and output data are transferred between a node at BNL and a node at
PDSF using the BeStMan implementation of SRM data management protocols.
CEDPS-Troubleshooting has begun analyzing the actual end-to-end throughput
experienced by these transfers, and have so far found some surprising numbers:
transfers from BNL to PDSF showed reasonable (100-200Mbps) throughput, whereas
transfers from PDSF back to BNL were 1-2 orders of magnitude slower. The results from
GridFTP logs on BNL and BeStMan logs at PDSF were correlated (by the CEDPS
logging tools) to verify this result. Until the network bottleneck can be removed, this
information was fed back to modify the number of concurrent streams in the BeStMan
deployment, and improve the transfer rate.
In addition, CEDPS-Data has had a series of discussions with the STAR team to
understand their data mirroring and replication requirements. The STAR team indicated
that simple replication tools would be very useful to their project in the future. STAR
requirements are feeding into current design work with CEDPS-Data.
CEDPS-Services began its collaboration with STAR early in the project and continued it
in the evaluation period. The objective of this collaboration is to demonstrate that STAR
applications can use available IaaS cloud resources for production runs and evaluate the
paradigm’s usefulness to the STAR community in comparison to existing resources.
Working with STAR scientists, we developed contextualization tools allowing for
automatic creation of tightly-coupled clusters on IaaS platforms (such as Amazon’s EC2
or Science Clouds). This tool enabled us to prepare the first significant STAR run on
EC2 in 09/07, supporting a platform for production codes. Subsequently, we worked with
STAR community members to evaluate performance impact of I/O operations in the
cloud on STAR applications (found to be within 10% and deemed acceptable) and on
preparations for another STAR production run in the cloud – this time, for a critical code
that will generate publication worthy results.
6.8. TECHX CORPORATION
CEDPS-Troubleshooting has coordinated with the TechX team (www.txcorp.com), which
has developed a STAR job submission portal and also added custom application-level
monitoring to STAR middleware. The STAR production managers would like to be able
to "drill down" beyond the job start / job end type of monitoring provided by TechX to
site-specific logs and errors. The CEDPS log database provides this information, which
we have agreed to provide to the TechX portal.
A prototype version of the Tech-X portal can query the SGE log information stored in
the CEDPS log database. We are working together to finish a prototype of this
functionality by SC08. When complete, this will greatly enhance the usability of the portal
for STAR jobs.
CEDPS-Services worked with the TechX team on coordinating and enhancing their
collaboration with STAR. We are currently exploring two collaborations. The first one
consists in supplying cloud computing expertise for the development of an elastic cloud
computing infrastructure to enhance access to a nuclear physics relational database
developed by TechX in collaboration with STAR and are working on a project exploring
integration of cloud computing technology into the current OSG fabric following up on
successful demonstrations of STAR production runs on IaaS infrastructure.
6.9. OPEN SCIENCE GRID (OSG)
CEDPS-Troubleshooting is continuing collaboration with OSG, with a goal of deploying
the CEDPS troubleshooting tools to provide a central log database for troubleshooting
and to provide an early warning system by detecting deviations from baseline
performance. Progress on this front has been delayed somewhat by delays in the
release of CEDPS logging in Globus Toolkit components, but also by slower than
expected progress on the integration of CEDPS logging with the MDS.
In a separate activity, the CEDPS-Troubleshooting log normalization tools are in
production use for the OSG accounting function to analyze GridFTP activity.
CEDPS-Services collaborates with OSG by providing vehicles facilitating IaaS
exploration to OSG scientists. Specifically, we made available (10/08) an OSG virtual
cluster that can be deployed by OSG scientists on Science Clouds resources as well as
(via the use of Nimbus contextualization tools) EC2: this allows scientists to easily
deploy an OSG cluster in the cloud. In addition, we also interact with OSG via
participation at OSG events (e.g., organized a cloud computing BOF at the OSG AHM in
03/08) to explain and popularize cloud computing ideas.
6.10.ADVANCED PHOTON SOURCE (APS)
Researchers at APS beamlines have used gRAVI in a project aimed at automating large
parts of the end-to-end experiment operation, data analysis, data visualization, and data-
driven modeling workflows that define their work processes. This work has involved the
use of gRAVI to generate secure Web Services for controlling a beamline experiment,
and for data analysis, visualization and modeling. The results of this work have been
profiled within APS, presented at meetings, and highlighted on the DOE web site.
6.11.NUCLEAR PHYSICS GROUPS
CEDPS-Services established several collaborations with international nuclear physics
groups. In the summer of ’08 we enabled seamless integration of VMs deployed on the
Science Clouds platforms for the ALICE HEP experiment at CERN (07/08, CHEP
submission pending). This is significant because it shows how cloud computing can be
integrated into existing community computing mechanisms (the VMs served as platforms
for jobs in the ALICE production queue first come first served). Further, we began a
collaboration with multiple members of the ATLAS HEP experiment. We work with a
group of scientists in the Max-Planck Institute to evaluate Nimbus as an open source
EC2-compatible platform. We also continued our collaboration with ATLAS scientists at
the University of Victoria (UVIC) exploring Nimbus as a platform for their community.
These interactions resulted, among others, in open source contributions from the group
at UVIC who joined us on the committer team.
7. PRESENTATIONS AND PUBLICATIONS
1. I. Foster, “Enabling Distributed Petascale Science,” Scientific Discovery through
Advanced Computing Conference, Boston, Mass., May 2007.
2. I. Foster, “Services for Science,” Keynote talk at INGRID Conference, Ischia,
Italy, April 2008.
3. K. Keahey and Tim Freeman, "Cloud Computing and Virtualization with Globus
(Tutorial)". Open Source Grid Cluster, Oakland, CA, May 2008
4. K. Keahey, "Globus Virtual Workspaces". HEPiX Fall 2007, St. Louis, MO.
5. R. Kettimuthu, "Data Movement Tools for Distributed Petascale Science,"
Maseeh College of Engineering and Computer Science, Portland State
University, Portland, OR, Sep 2008
6. R. Kettimuthu, "Reliable Data Movement Framework for Distributed Science
Environments," The 2008 International Conference on Parallel and Distributed
Processing Techniques and Applications, Las Vegas, NV, July 2008.
7. R. Kettimuthu, "Globus GridFTP and RFT: An Overview and New Features,"
National Energy Research Scientific Computing Center (NERSC), Oakland, CA,
8. R. Kettimuthu, J. Bresnahan and M. Link, "Configuring and Deploying GridFTP
for Managing Data Movement in Grid/HPC Environments," SC 2008, Austin, TX,
9. R. Kettimuthu and J. Bresnahan, "Managing Data Movement Using GridFTP in
Distributed Environments," Open Source Grid and Cluster Conference, Oakland,
CA, May 2008.
10. “Characterization of Scientific Workflows,” Shishir Bharathi, Ann Chervenak,
Ewa Deelman, Gaurang Mehta, Mei-Hui Su, Karan Vahi, The 3rd Workshop on
Workflows in Support of Large-Scale Science (WORKS08), in conjunction with
Supercomputing (SC08) Conference, Austin, Texas, November, 2008.
11. “Enabling petascale science: data management, troubleshooting, and scalable
science services,” A. Baranovski, K. Beattie, S. Bharathi, J. Boverhof, J.
Bresnahan, A. Chervenak, I. Foster, T. Freeman, D. Gunter, K. Keahey, C.
Kesselman, R. Kettimuthu, N. Leroy, M. Link, M. Livny, R. Madduri, G.Oleynik, L.
Pearlman, R. Schuler and B.Tierney, Journal of Physics: Conference Series,
Volume 125, 2008. (Also appeared in Proceedings of SciDAC 2008 Conference,
13-17 July, 2008, Seattle, Washington, USA.)
12. “Reducing Time-to-Solution Using Distributed High-Throughput Mega-Workflows
– Experiences from SCEC CyberShake", Scott Callaghan, Phil Maechling, Ewa
Deelman, Karan Vahi, Gaurang Mehta, Gideon Juve, Kevin Milner, Robert
Graves, Edward Field, David Okaya, Dan Gunter, Keith Beattie, Thomas Jordan.
Fourth IEEE International Conference on eScience (eScience 2008),
Indianapolis, IN, USA, December 2008.
13. "Exploration of the Applicability of Cloud Computing to Large-Scale Scientific
Workflows", Hoffa, C., T. Freeman, G. Metha, E. Deelman, and K. Keahey,. to be
submitted to SWBES08: Challenging Issues in Workflow Applications, 2008.
14. "Virtual Workspaces for Scientific Applications", Keahey, K., T. Freeman, J.
Lauret, D. Olson. SciDAC 2007 Conference, Boston, MA. June 2007
15. “Center for Enabling Distributed Petascale Science”, SciDAC Conference, July
16. “Policy-Driven Data Management for Distributed Scientific Collaborations Using
a Rule Engine”, Sara Alspaugh, Ann Chervenak, Ewa Deelman, Supercomputing
(SC08) Conference, Austin, Texas, November 2008. Received Best
Undergraduate Student Poster award in ACM Student Poster competition.
17. "When Workflow Management Systems and Logging Systems Meet: Analyzing
Large-Scale Execution Traces". Dan Gunter, Scott Callaghan, Gaurang Mehta,
Gideon Juve, Keith Beattie, Ewa Deelman, Phil Maechling, Brian Tierney, Karan
Vahi. SC08, Austin, TX
18. "Virtual Workspaces for Scientific Applications", Kate Keahey. SciDAC 200
Conference, Boston, MA. June 2007