A talk given at the US Department of Energy, covering our work on research data management and analysis. Three themes:
(1) Eliminate data friction (use of SaaS for research data management)
(2) Liberate scientific data (research on data extraction, organization, publication)
(3) Create discovery engines at DOE facilities (services that organize data + computation)
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
Advances in genomics and data analytics create new opportunities for cancer research and personalized medical treatment via large-scale federation of genomic, clinical, imaging and other data from many thousands of patients across institutions around the world. Despite these opportunities and promising early results, cancer research is often stymied by information technology barriers. One major barrier is a lack of tools for the reliable, secure, rapid, and easy transfer, sharing, and management of large collections of human data. In the absence of such tools, security and performance concerns often prevent sharing altogether or force researchers to resort to slow and error prone shipping of physical media. If data are received, timely analysis is further impeded by the difficulties inherent in verifying data integrity and managing who can access data and for what purpose. I will discuss how the mature Globus data management platform addresses these obstacles to discovery and explain how its intuitive, web-based interfaces enable use by researchers without specialized IT knowledge. I also describe how Globus technologies can be extended to meet the security requirements of human data so as to enable use in data-intensive cancer research.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
Advances in genomics and data analytics create new opportunities for cancer research and personalized medical treatment via large-scale federation of genomic, clinical, imaging and other data from many thousands of patients across institutions around the world. Despite these opportunities and promising early results, cancer research is often stymied by information technology barriers. One major barrier is a lack of tools for the reliable, secure, rapid, and easy transfer, sharing, and management of large collections of human data. In the absence of such tools, security and performance concerns often prevent sharing altogether or force researchers to resort to slow and error prone shipping of physical media. If data are received, timely analysis is further impeded by the difficulties inherent in verifying data integrity and managing who can access data and for what purpose. I will discuss how the mature Globus data management platform addresses these obstacles to discovery and explain how its intuitive, web-based interfaces enable use by researchers without specialized IT knowledge. I also describe how Globus technologies can be extended to meet the security requirements of human data so as to enable use in data-intensive cancer research.
My talk at the Winter School on Big Data in Tarragona, Spain.
Abstract: We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers.
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
Argonne’s Discovery Engines for Big Data project is working to enable new research modalities based on the integration of advanced computing with experiments at facilities such as the Advanced Photon Source (APS). I review science drivers and initial results in diffuse scattering, high energy diffraction microscopy, tomography, and pythography. I also describe the computational methods and infrastructure that we leverage to support such applications, which include the Petrel online data store, ALCF supercomputers, Globus research data management services, and Swift parallel scripting. This work points to a future in which tight integration of DOE’s experimental and computational facilities enables both new science and more efficient and rapid discovery.
We've all heard about how on-demand computing and storage will transform scientific practice. But by focusing on resources alone, we're missing the real benefit of the large-scale outsourcing and consequent economies of scale that cloud is about. The biggest IT challenge facing science today is not volume but complexity. Sure, terabytes demand new storage and computing solutions. But they're cheap. It is establishing and operating the processes required to collect, manage, analyze, share, archive, etc., that data that is taking all of our time and killing creativity. And that's where outsourcing can be transformative. An entrepreneur can run a small business from a coffee shop, outsourcing essentially every business function to a software-as-a-service provider--accounting, payroll, customer relationship management, the works. Why can't a young researcher run a research lab from a coffee shop? For that to happen, we need to make it easy for providers to develop "apps" that encapsulate useful capabilities and for researchers to discover, customize, and apply these "apps" in their work. The effect, I will argue, will be a dramatic acceleration of discovery.
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
The speaker will review case studies from real-world projects that built AI systems using Natural Language Processing (NLP) in healthcare. These case studies cover projects that deployed automated patient risk prediction, automated diagnosis, clinical guidelines, and revenue cycle optimization.
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
We've all heard about how on-demand computing and storage will transform scientific practice. But by focusing on resources alone, we're missing the real benefit of the large-scale outsourcing and consequent economies of scale that cloud is about. The biggest IT challenge facing science today is not volume but complexity. Sure, terabytes demand new storage and computing solutions. But they're cheap. It is establishing and operating the processes required to collect, manage, analyze, share, archive, etc., that data that is taking all of our time and killing creativity. And that's where outsourcing can be transformative. An entrepreneur can run a small business from a coffee shop, outsourcing essentially every business function to a software-as-a-service provider--accounting, payroll, customer relationship management, the works. Why can't a young researcher run a research lab from a coffee shop? For that to happen, we need to make it easy for providers to develop "apps" that encapsulate useful capabilities and for researchers to discover, customize, and apply these "apps" in their work. The effect, I will argue, will be a dramatic acceleration of discovery.
WoSC19: Serverless Workflows for Indexing Large Scientific DataUniversity of Chicago
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific “data lakes” quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows.
The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
The speaker will review case studies from real-world projects that built AI systems using Natural Language Processing (NLP) in healthcare. These case studies cover projects that deployed automated patient risk prediction, automated diagnosis, clinical guidelines, and revenue cycle optimization.
These slides were presented at AGU 2018 by Tanu Malik from DePaul University, in a session convened by Dr. Ian Foster, director of the Data Science and Learning division at Argonne National Laboratory.
Opening Keynote Lecture
15th Annual ON*VECTOR International Photonics Workshop
Calit2’s Qualcomm Institute
University of California, San Diego
February 29, 2016
As the volume and complexity of data from myriad Earth Observing platforms, both remote sensing and in-situ increases so does the demand for access to both data and information products from these data. The audience no longer is restricted to an investigator team with specialist science credentials. Non-specialist users from scientists from other disciplines, science-literate public, to teachers, to the general public and decision makers want access. What prevents them from this access to resources? It is the very complexity and specialist developed data formats, data set organizations and specialist terminology. What can be done in response? We must shift the burden from the user to the data provider. To achieve this our developed data infrastructures are likely to need greater degrees of internal code and data structure complexity to achieve (relatively) simpler end-user complexity. Evidence from numerous technical and consumer markets supports this scenario. We will cover the elements of modern data environments, what the new use cases are and how we can respond to them.
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
I describe the origins, current state and potential future directions for the Earth System Grid Federation, an international consortium that develops infrastructure for sharing of climate simulation and related datasets.
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterGlobus
This poster was presented at the 2019 NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium by Brigitte Raumann and Ian Foster of Globus, University of Chicago and Argonne National Lab.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022HostedbyConfluent
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
The Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy Office of Science user facility, provides world-class high-performance computing (HPC) resources for open science as well as world-class expertise in scientific computing. The OLCF operates 2 of the top 5 supercomputers in the world: Frontier and Summit. Our Kafka cluster was built in 2018 to stream data from Summit, a 200 Petaflop system with 4,000 compute nodes—but is the cluster ready for Exascale? The OLCF has recently delivered Frontier, the world's first exascale system, and we engineered a significant increase in streaming bandwidth and volume to serve its performance metrics, system events, utilization metrics, job metadata, and facilities monitoring. Data is indexed and served through an Elasticsearch cluster and provided in real time to Grafana dashboards.
In this talk we will discuss scaling and planning a system to meet the streaming demands of the world’s only exascale and most energy efficient supercomputer. Tune in to learn more about HPC and how streaming fits in to monitoring large-scale systems. We will discuss aggregating data from many clusters into a central streaming system, shedding technical debt by pivoting to Confluent Operator on Kubernetes, and how we use real-time data to optimize supercomputer performance.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
Keynote presentation at GlobusWorld 2021. Highlights product updates and roadmap, as well as user success stories in research data management. Presented by Ian Foster, Rachana Ananthakrishnan, Kyle Chard and Vas Vasiliadis.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
The computational requirements of next generation sequencing is placing a huge demand on IT organisations .
Building compute clusters is now a well understood and relatively straightforward problem. However, NGS sequencing applications require large amounts of storage, and high IO rates.
This talk details our approach for providing storage for next-gen sequencing applications.
Talk given at BIO-IT World, Europe, 2009.
Next-Generation Search Engines for Information RetrievalWaqas Tariq
In the recent years, there have been significant advancements in the areas of scientific data management and retrieval techniques, particularly in terms of standards and protocols for archiving data and metadata. Scientific data is generally rich, not easy to understand, and spread across different places. In order to integrate these pieces together, a data archive and associated metadata should be generated. This data should be stored in a format that can be locatable, retrievable and understandable, more importantly it should be in a form that will continue to be accessible as technology changes, such as XML. New search technologies are being implemented around these protocols, which makes searching easy, fast and yet robust. One such system, Mercury, a metadata harvesting, data discovery, and access system, built for researchers to search to, share and obtain spatiotemporal data used across a range of climate and ecological sciences.
Global Services for Global Science March 2023.pptxIan Foster
We are on the verge of a global communications revolution based on ubiquitous high-speed 5G, 6G, and free-space optics technologies. The resulting global communications fabric can enable new ultra-collaborative research modalities that pool sensors, data, and computation with unprecedented flexibility and focus. But realizing these modalities requires new services to overcome the tremendous friction currently associated with any actions that traverse institutional boundaries. The solution, I argue, is new global science services to mediate between user intent and infrastructure realities. I describe our experiences building and operating such services and the principles that we have identified as needed for successful deployment and operations.
Keynote talk at 2022-10-11 ESnet6 launch. A lovely event by a great team. It was a pleasure to talk about how ESnet6 will enable new "smart instruments"--and some of the work that we are doing to that end.
Linking Scientific Instruments and ComputationIan Foster
[Talk presented at Monterey Data Conference, August 31, 2022]
Powerful detectors at modern experimental facilities routinely collect data at multiple GB/s. Online analysis methods are needed to enable the collection of only interesting subsets of such massive data streams, such as by explicitly discarding some data elements or by directing instruments to relevant areas of experimental space. Thus, methods are required for configuring and running distributed computing pipelines—what we call flows—that link instruments, computers (e.g., for analysis, simulation, AI model training), edge computing (e.g., for analysis), data stores, metadata catalogs, and high-speed networks. We review common patterns associated with such flows and describe methods for instantiating these patterns. We present experiences with the application of these methods to the processing of data from five different scientific instruments, each of which engages powerful computers for data inversion, machine learning model training, or other purposes. We also discuss implications of such methods for operators and users of scientific facilities.
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
Talk in the National Science Data Fabric (NSDF) Distinguished Speaker Series
The Globus team has spent more than a decade developing software-as-a-service methods for research data management, available at globus.org. Globus transfer, sharing, search, publication, identity and access management (IAM), automation, and other services enable reliable, secure, and efficient managed access to exabytes of scientific data on tens of thousands of storage systems. For developers, flexible and open platform APIs reduce greatly the cost of developing and operating customized data distribution, sharing, and analysis applications. With 200,000 registered users at more than 2,000 institutions, more than 1.5 exabytes and 100 billion files handled, and 100s of registered applications and services, the services that comprise the Globus platform have become essential infrastructure for many researchers, projects, and institutions. I describe the design of the Globus platform, present illustrative applications, and discuss lessons learned for cyberinfrastructure software architecture, dissemination, and sustainability.
Video is at https://www.youtube.com/watch?v=p8pCHkFFq1E
Daniel Lopresti, Bill Gropp, Mark D. Hill, Katie Schuman, and I put together a white paper on "Building a National Discovery Cloud" for the Computing Community Consortium (http://cra.org/ccc). I presented these slides at a Computing Research Association "Best Practices on using the Cloud for Computing Research Workshop" (https://cra.org/industry/events/cloudworkshop/).
Abstract from White Paper:
The nature of computation and its role in our lives have been transformed in the past two decades by three remarkable developments: the emergence of public cloud utilities as a new computing platform; the ability to extract information from enormous quantities of data via machine learning; and the emergence of computational simulation as a research method on par with experimental science. Each development has major implications for how societies function and compete; together, they represent a change in technological foundations of society as profound as the telegraph or electrification. Societies that embrace these changes will lead in the 21st Century; those that do not, will decline in prosperity and influence. Nowhere is this stark choice more evident than in research and education, the two sectors that produce the innovations that power the future and prepare a workforce able to exploit those innovations, respectively. In this article, we introduce these developments and suggest steps that the US government might take to prepare the research and education system for its implications.
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
I presented to the Environmental Data Science group at UChicago, with the goal of getting them excited about the opportunities inherent in big data, big computing, and AI--and to think about how to collaborate with Argonne in those areas. We had a great and long conversation about Takuya Kurihana's work on unsupervised learning for cloud classification. I also mentioned our work making NASA and CMIP data accessible on AI supercomputers.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
Research Automation for Data-Driven DiscoveryIan Foster
Talk presented at Workshop on Maximizing the Scientific Return of NASA Data. Makes the case that automation and outsourcing of data management tasks to cloud services is essential for effective data-driven discovery. Describes how the Globus research data management platform addresses this need.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
Going Smart and Deep on Materials at ALCFIan Foster
As we acquire large quantities of science data from experiment and simulation, it becomes possible to apply machine learning (ML) to those data to build predictive models and to guide future simulations and experiments. Leadership Computing Facilities need to make it easy to assemble such data collections and to develop, deploy, and run associated ML models.
We describe and demonstrate here how we are realizing such capabilities at the Argonne Leadership Computing Facility. In our demonstration, we use large quantities of time-dependent density functional theory (TDDFT) data on proton stopping power in various materials maintained in the Materials Data Facility (MDF) to build machine learning models, ranging from simple linear models to complex artificial neural networks, that are then employed to manage computations, improving their accuracy and reducing their cost. We highlight the use of new services being prototyped at Argonne to organize and assemble large data collections (MDF in this case), associate ML models with data collections, discover available data and models, work with these data and models in an interactive Jupyter environment, and launch new computations on ALCF resources.
Software Infrastructure for a National Research PlatformIan Foster
A presentation at the First National Research Platform workshop. "The purpose of this workshop is to bring together representatives from interested institutions to discuss implementation strategies for deployment of interoperable Science DMZs at a national scale." I present eight desirable properties for a software infrastructure for such a platforms, and describe our experience realizing these properties in the Globus system.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
and Geo
Sciences
Can we determine
pathways that lead
to novel states and
nonequilibrium
assemblies?
Can we observe –
and control –
nanoscale chemical
transformations in
macroscopic
systems?
Can we create new materials with
extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and
ultimately harness –
dynamic heterogeneity
in complex correlated
systems?
Can we unravel the
secrets of biological
function – across
length scales?
Can we understand
physical and chemical
processes in the most
extreme environments?
2
New tools are needed to answer
the most pressing scientific Qs
3. The resulting data deluge
Spans biology, climate, cosmology, materials,
physics, urban sciences, …
Simulation data
Petascale exascale simulations;
simulation datasets as laboratories;
high-throughput characterization; etc.
Experimental data
Light sources, genome sequencing,
next-gen ARM radar, sky surveys,
high-throughput experiments, etc.
New research methods that depend on coupling
1) Of computation and experiment 2) Across data sources and types
- inverse problems, computer control - knowledge integration, analysis
4. Scientific progress requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Rick Stevens
5. Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
7. Eliminating data friction is
essential to modern science
Civilization advances
by extending the number
of important operations
which we can perform
without thinking about
them (Whitehead, 1912)
Obstacles to data access,
movement, discovery,
sharing, and analysis slow
research, distort research
directions, and waste time
(DOE reports, 2005-2015)
8. Software as a service (SaaS)
as lubricant
Customer relationship
management (CRM):
A knowledge-intensive process
Historically, handled manually
or via expensive, inflexible on-
premise software
SaaS has revolutionized
how CRM is consumed
Outsource to provider who
runs software on cloud
Access via simple interfaces
Ease of use Cost
Flexibility
SaaS
On-premise
9. Globus: Research data
management as a service
Essential research data
management services
File transfer
Data sharing
Data publication
Identity and groups
Builds on 15 years of DOE
research
Outsourced and automated
High availability, reliability,
performance, scalability
Convenient for
Casual users: Web interfaces
Power users: APIs
Administrators: Install, manage
globus.org
10. “I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer
Public Cloud
10
11. “I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
11
12. “I need to easily and securely
share my data with my colleagues.”
12
13. Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS Only a web
browser required
• Use storage system
of your choice
• Access using your
campus credentials
13
14. Globus at a glance
4
major services
13
national labs
use Globus
services
100 PB
petabytes transferred
8,000
active endpoints
20 billion
files processed
>300
users are active
daily
25,000
registered users
99.95%
uptime over the
past two years
>30
subscribers
The biggest
transfer to date is
1 petabyte
The longest-
running transfer to
date took
3 months
We’re eager to
learn what
you
want to do with
Globus services
20. Response has been gratifying
"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory
"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates,
Professional Research Assistant, NOAA
“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was
trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing
"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" -
Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications
"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service
was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory
"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user
"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley
National Lab
"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator,
Oak Ridge National Laboratory
"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user
"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC
support team, NCSA
21. Globus service APIs serve
as a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
GlobusAPIs
GlobusConnect
Data Publication & Discovery
File Sharing
File Transfer & Replication
21
25. Operating a sustainable service
Globus is a not-for-profit
service for researchers
We adopt a subscription-
supported freemium model
Subscribers get extra
features, rapid support
We’re engaged in crossing
the chasm
Support from DOE will
contribute to long-term
success
27. Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if online, not described;
if described, not indexed
Not accessible
Not discoverable
Not used
Contrast with common practice
for consumer photos (iPhoto)
Automated capture
Publish then curate
Processing to add value
Outsourced storage
28. We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publication pipelines
Example application: Materials
Data Facility for materials
simulation and experiment data
Proposed distributed virtual
collections index, organize,
tag, & manage distributed data
Think iPhoto on steroids –
backed by domain knowledge
and supercomputing power
29. We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Huggins (𝞆) parameters from
polymers literature
R. Tchoua et al.
Plenario: Spatially and
temporally integrated, linked,
and searchable database of
urban data
C. Catlett, B. Goldstein, T. Malik et al.
30. “I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Community
Collaboration
30
46. Recall: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
47. Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
Science automation services
Scripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-
correlation
Feature
detection
Scientific opportunities
Probe material structure and
function at unprecedented scales
Technical challenges
Many experimental modalities
Data rates and computation
needs vary widely; increasing
Knowledge management,
integration, synthesis
Towards discovery engines for
energy science (Argonne LDRD)
48. Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wilde, Wozniak, et al.)
Estimate structure via inverse modeling:
many-simulation evolutionary optimization on
100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy
Microstructure in bulk materials (Almer, Sharma, et al.)
Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,
vs. >5 hours on APS cluster or months if data taken home. Used to
detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomography
Bio, geo, and material science imaging.
(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).
Innovative in-slice parallelization method gives
reconstruction of 360x2048x1024 dataset in ~1
minute, using 32K BG/Q cores, vs. many days
on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper
wire, 0.2mm diameter
Advanced
Photon Source
Experimental and simulated
scattering from manganite
49. 1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
Compute
facilities
4: Run app
6: Update catalogs
5: Transfer results
External
collaborators
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
49
Researchers
Tying it all together: An energy
sciences infrastructure
50. informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Summary: Big opportunities
and challenges for energy data
Immediate opportunities
Reduce data friction and
accelerate discovery by
deploying Globus services
across all DOE facilities
Develop new services to
capture, link energy data
Important research agenda
Discovery engines to answer
major scientific questions
New research modalities
linking computation and data
Organization and analysis of
massive science data
51. Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
51
52. For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55,
2014.
Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for
Collaborative Science Applications. Concurrency - Practice and Experience,
27(2):290-305, 2014.
Publication (globus.org/data-publication)
Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible
Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines
Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an
experimental facility. Big Data and High Performance Computing, 2015.
Editor's Notes
One useful thing
One exciting initiative
New tools are needed to answer the most pressing scientific questions
Accelerate “knowledge turns.”
Unleash the 99% of not-easily-accessible data.
Integrate data and computation.
Fix IMAGE
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base.
Challenge: takes months to do a single loop through cycle.
Just as important, it is an incredibly labor intensive and expensive process.
Change to OLCF, NERSC
And sys admin response?
Other potential quotes:
"Thanks to Globus it's easy for our users to move big files. ...Globus is an awesome tool that is really helping our user community." - John Hanks, Senior HPC Analyst, University of Colorado
"I am very impressed - Globus is the most beneficial grid technology I have ever seen." - Steven Gottlieb, Indiana University
"With Globus, I’m averaging 40 Mb/s and can even reach 400 Mb/s on occasion – that’s insanely fast!" - Luke Van Roekel, University of Colorado
"I have been very impressed with Globus - both the speed and ease of use." - William Daughton, Los Alamos National Laboratory – User of the Month, May 2012
"Without a service such as Globus it would have been basically impossible to move this large amount of data." - Katrin Heitmann, LANL and Argonne National Laboratory
"The biggest benefit to Globus by far is the auto performance tuning. … Globus is an invaluable tool to me." - Luke Van Roekel, University of Colorado
Highlight XSEDE’s planned adoption of user, group and profile management
RDA: outsource data sharing and transfer
kBase: Outsource identity and group management
The publish dashboard shows all current submissions at any stage of the submission workflow. Here users can view accepted submissions, see a list of all submissions currently in the curation process, view/edit their unfinished submissions, and start a new submission.
"The Scientist" will now start a new submission.
The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research.
Note: "The Scientist" can only see collections he is allowed to publish to.
"The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined.
Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication.
Note: "The Scientist" has missed an ORCID for one of his co-authors.
The second type of metadata required by the CNM relates to the materials science research at the Advanced Photon Source.
Here, "The Scientist" enters information such as keywords describing the dataset, information about the sponsors who funded this research, a description of the dataset, the experiment name, the materials analyzed in this dataset, the energy density of the materials (this is important for research into battery development) and the Argonne General User Proposal (GUP) number. The GUP number is a unique identifier for all beam time allocations at the APS and is used by administrators to associate researchers, experiments, and allocations.
All of this entered information can be subsequently used by other researchers with appropriate access to discover this dataset.
Having described the dataset, "The Scientist" must now assemble the dataset. To do so, he first chooses to select the files to be published.
Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11).
This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist"
The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy.
At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
When "The Scientist" is happy with his assembled dataset, he can return to the publication workflow. Here, he sees a summary of the dataset and may confirm the correct file sizes and names are associated. The system attempts to determine the file types for each of the dataset’s files.
"The Scientist" can choose to edit, remove or add files if necessary.
When submitted, the dataset now enters a pre-determined curation workflow. "The Scientist” can check the progress of the submission through his dashboard. If any further attention is required, it will be displayed through his dashboard.
“The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags.
Each of these fields can be used to search for a particular dataset.
Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
Having found the desired published dataset, “The Researcher” can navigate to the summary page.
Finally, “The Researcher” can view the downloaded dataset on their desktop PC.
Description: another aspect - general metadata (Dublin Core) and scientific metadata
Curation: another aspect – self, project owner, librarian
Fix IMAGE
“Most of materials science is bottlenecked by disordered structures”—Littlewood.
Solve inverse problem.
How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base.
Challenge: takes months to do a single loop through cycle.
Just as important, it is an incredibly labor intensive and expensive process.
Add CNM
- Innovative in-slice parallelization method permits reconstruction of 720x2160x2560 dataset (7-BM) in less than 3 minutes (for each iteration), using 34K BG/Q cores, vs. many days on typical cluster.
Innovative in-slice parallelization for iterative algorithms permits large-scale image reconstruction. Execution times are reduced to minutes for many large datasets and algorithms using 32K BG/Q cores, vs. many days on typical cluster.
This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity.
It shows the major components of the CMTS cyberinfrastructure we are integrating.
Here’s how CMTS will use it and how it will help them.
0. Develop script
A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services.
The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel.
Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist:1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all.
Locate input file locations via dataset catalogCMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated.
Transfer inputsSwift will automatically transport input datasets to the selected computational resource for an application run (if needed)
Run appSwift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings.
Transfer resultsSwift will automatically transport output datasets to the selected archival or temporary storage resource (if needed)
Update catalogs…and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs.
Collaborate! (2 clicks)All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
Talk about the Globus as being part of UChicago + ANL, as well as other context setting about how this work came about and is funded