This document discusses scientific workflows and the author's experience working with them. It begins by describing how the author got involved with scientific workflows through the SciDAC SDM Center. It then discusses how workflows help scientists by automating tasks to allow them to focus on science. The author describes some early challenges with workflows not being portable across machines. Finally, it discusses the need to better integrate workflows into the broader scientific process and allow scientists to more easily collect and share information about experiments.
This document proposes applying kanban scheduling techniques to systems engineering activities in rapid response environments. It describes how systems engineering could be modeled as a set of continuous and taskable services that flow through a kanban scheduling system. This approach aims to improve integration and use of scarce SE resources, provide flexibility and predictability, enable visibility and coordination across projects, and reduce governance overhead. The document defines key aspects of a kanban scheduling system for SE, including work items, activities, resources, queues, and flow metrics. It argues this approach could better support SE in rapid response compared to traditional methods.
This slideshow was used in a Preparing Your Research Data for the Future course taught in the Social Sciences Division, University of Oxford, on 2015-03-02. It provides an overview of some key issues, focusing on long-term data management, sharing, and curation.
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
This document provides an overview of scientific workflows, including what they are, common elements, and problems they help address. A workflow is a formal way to express a calculation as a series of tasks with dependencies. Workflow tools automate task execution, data management, scheduling, and more. They can help scale applications from a local system to large clusters. An example is provided of how the CyberShake project uses the Pegasus workflow system to automate probabilistic seismic hazard analysis calculations involving hundreds of thousands of tasks and petabytes of data.
1. The document discusses using eResearch approaches like shared data, analyses, and cyberinfrastructure to support collaborative research on free/libre and open source software (FLOSS).
2. The authors are replicating and extending several FLOSS research papers using workflow tools to make the analyses reusable, flexible, and easy to share.
3. Preliminary results found that eResearch approaches show promise for advancing social science research by facilitating analysis extension, replication, and sensitivity testing.
Towards an Infrastructure for Enabling Systematic Development and Research of...Rafael Ferreira da Silva
Presentation held at the 17th IEEE eScience Conference
Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. In this talk, I will provide a view of the state of the art and some of my previous research and technical contributions, and identify crucial research challenges in the workflow community.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
This document proposes applying kanban scheduling techniques to systems engineering activities in rapid response environments. It describes how systems engineering could be modeled as a set of continuous and taskable services that flow through a kanban scheduling system. This approach aims to improve integration and use of scarce SE resources, provide flexibility and predictability, enable visibility and coordination across projects, and reduce governance overhead. The document defines key aspects of a kanban scheduling system for SE, including work items, activities, resources, queues, and flow metrics. It argues this approach could better support SE in rapid response compared to traditional methods.
This slideshow was used in a Preparing Your Research Data for the Future course taught in the Social Sciences Division, University of Oxford, on 2015-03-02. It provides an overview of some key issues, focusing on long-term data management, sharing, and curation.
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.
This document provides an overview of scientific workflows, including what they are, common elements, and problems they help address. A workflow is a formal way to express a calculation as a series of tasks with dependencies. Workflow tools automate task execution, data management, scheduling, and more. They can help scale applications from a local system to large clusters. An example is provided of how the CyberShake project uses the Pegasus workflow system to automate probabilistic seismic hazard analysis calculations involving hundreds of thousands of tasks and petabytes of data.
1. The document discusses using eResearch approaches like shared data, analyses, and cyberinfrastructure to support collaborative research on free/libre and open source software (FLOSS).
2. The authors are replicating and extending several FLOSS research papers using workflow tools to make the analyses reusable, flexible, and easy to share.
3. Preliminary results found that eResearch approaches show promise for advancing social science research by facilitating analysis extension, replication, and sensitivity testing.
Towards an Infrastructure for Enabling Systematic Development and Research of...Rafael Ferreira da Silva
Presentation held at the 17th IEEE eScience Conference
Scientific workflows have been used almost universally across scientific domains, and have underpinned some of the most significant discoveries of the past several decades. Many of these workflows have high computational, storage, and/or communication demands, and thus must execute on a wide range of large-scale platforms, from large clouds to upcoming exascale high-performance computing (HPC) platforms. These executions must be managed using some software infrastructure. Due to the popularity of workflows, workflow management systems (WMSs) have been developed to provide abstractions for creating and executing workflows conveniently, efficiently, and portably. While these efforts are all worthwhile, there are now hundreds of independent WMSs, many of which are moribund. As a result, the WMS landscape is segmented and presents significant barriers to entry due to the hundreds of seemingly comparable, yet incompatible, systems that exist. As a result, many teams, small and large, still elect to build their own custom workflow solution rather than adopt, or build upon, existing WMSs. This current state of the WMS landscape negatively impacts workflow users, developers, and researchers. In this talk, I will provide a view of the state of the art and some of my previous research and technical contributions, and identify crucial research challenges in the workflow community.
A talk at the RPI-NSF Workshop on Multiscale Modeling of Complex Data, September 12, 2011, Troy NY, USA.
We have made much progress over the past decade toward effectively
harnessing the collective power of IT resources distributed across the
globe. In fields such as high-energy physics, astronomy, and climate,
thousands benefit daily from tools that manage and analyze large
quantities of data produced and consumed by large collaborative teams.
But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that far more--ultimately
most?--researchers will soon require capabilities not so different from those used by these big-science teams. How is the general population of researchers and institutions to meet these needs? Must every lab be filled
with computers loaded with sophisticated software, and every researcher become an information technology (IT) specialist? Can we possibly afford to equip our labs in this way, and where would we find the experts to operate them?
Consumers and businesses face similar challenges, and industry has
responded by moving IT out of homes and offices to so-called cloud providers (e.g., GMail, Google Docs, Salesforce), slashing costs and complexity. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity. More importantly, we can free researchers from the burden of managing IT, giving them back their time to focus on research and empowering them to go beyond the scope of what was previously possible.
I describe work we are doing at the Computation Institute to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date and suggest a path towards
large-scale delivery of these capabilities.
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2016-02-03. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This slideshow was used in an Introduction to Research Data Management course for the Social Sciences Division, University of Oxford, on 2015-05-27. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
Integrating scientific laboratories into the cloudData Finder
The document discusses scientific data management practices over time from paper-based notebooks to modern systems, and proposes enhancements using cloud computing. It describes current use of a data management system called DataFinder, and examples of how it could be enhanced to integrate scientific laboratories with the cloud by allowing remote data storage, automated simulation jobs, and collection of provenance data. DataFinder is concluded to help scientists store and access data without configuration of grid and cloud resources.
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2015-02-09. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2014-02-26. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This slideshow was used at a lunchtime session delivered at the Humanities Division, University of Oxford, on 2014-05-12. It provides a general overview of some key data management topics, plus some pointers on where to find further information.
The document describes scientific workflows for big data and the challenges they present. It discusses Prof. Shiyong Lu's work on developing the VIEW system for designing, executing, and analyzing scientific workflows. The VIEW system provides a runtime environment for workflows, supports their execution on servers or clouds, and enables efficient storage, querying and visualization of workflow provenance data.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
eScience: A Transformed Scientific MethodDuncan Hull
The document discusses the concept of eScience, which involves synthesizing information technology and science. It explains how science is becoming more data-driven and computational, requiring new tools to manage large amounts of data. It recommends that organizations foster the development of tools to help with data capture, analysis, publication, and access across various scientific disciplines.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
The document discusses efforts by the United States Air Force Research Laboratory (AFRL) to streamline the management of its research and development case files. From 2002-2005, AFRL aimed to improve processes and use modern workflow tools. They faced challenges due to inconsistent practices across different research directorates. A team standardized procedures in an official manual and selected a tool to automate case file workflows. This provided standardized processes across AFRL and reduced administrative burdens on scientists and engineers.
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...faflrt
ALA/FAFLRT Workshop on Open Archival Information Service (OAIS). Presented by Robin Dale, RLG. Sponsored by ALA Federal and Armed Forces Libraries Roundtable (FAFLRT). Presented on June 16, 2001 at the ALA Annual Conference.
This slideshow was used in a Preparing Your Research Material for the Future course for the Humanities Division, University of Oxford, on 2016-11-16. It provides an overview of some key issues, focusing on the long-term management of data and other research material, including sharing and curation.
Curation-Friendly Tools for the Scientific Researcherbwestra
Presentation for Online Northwest Conference, in Corvallis Oregon, February 10, 2012.
Highlights electronic lab notebooks (ELN) and OMERO (Open Microscopy Environment) as two tools that enable researchers to better manage their research data.
This slideshow was used in a Preparing Your Research Material for the Future course for the Humanities Division, University of Oxford, on 2015-11-16. It provides an overview of some key issues, focusing on the long-term management of data and other research material, including sharing and curation.
The document discusses the development of guidance material to help NASA implement its software engineering requirements and best practices. It describes:
1) The creation of an electronic handbook on the NASA Engineering Network to provide guidance on NASA's software engineering requirements and examples/templates.
2) The process of gathering input from NASA's software community, prioritizing topics, and developing the content for the handbook.
3) The benefits of an electronic handbook such as easy updating and searchability.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
FocalCXM’s LabPulse is an effective yet simple tool which manages to connect all the dots and enable the features previously outlined under a single pane. Its comprehensive research management tool enables a researcher to plan, initiate and monitor all the activities involved in running a laboratory successfully. It boasts several user-friendly features, from simple planning to complex report generation.
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
More Related Content
Similar to Scientific workflow-overview-2012-01-rev-2
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2016-02-03. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This slideshow was used in an Introduction to Research Data Management course for the Social Sciences Division, University of Oxford, on 2015-05-27. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
Integrating scientific laboratories into the cloudData Finder
The document discusses scientific data management practices over time from paper-based notebooks to modern systems, and proposes enhancements using cloud computing. It describes current use of a data management system called DataFinder, and examples of how it could be enhanced to integrate scientific laboratories with the cloud by allowing remote data storage, automated simulation jobs, and collection of provenance data. DataFinder is concluded to help scientists store and access data without configuration of grid and cloud resources.
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2015-02-09. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
The title of this talk is a crass attempt to be catchy and topical, by referring to the recent victory of Watson in Jeopardy.
My point (perhaps confusingly) is not that new computer capabilities are a bad thing. On the contrary, these capabilities represent a tremendous opportunity for science. The challenge that I speak to is how we leverage these capabilities without computers and computation overwhelming the research community in terms of both human and financial resources. The solution, I suggest, is to get computation out of the lab—to outsource it to third party providers.
Abstract follows:
We have made much progress over the past decade toward effective distributed cyberinfrastructure. In big-science fields such as high energy physics, astronomy, and climate, thousands benefit daily from tools that enable the distributed management and analysis of vast quantities of data. But we now face a far greater challenge. Exploding data volumes and new research methodologies mean that many more--ultimately most?--researchers will soon require similar capabilities. How can we possible supply information technology (IT) at this scale, given constrained budgets? Must every lab become filled with computers, and every researcher an IT specialist?
I propose that the answer is to take a leaf from industry, which is slashing both the costs and complexity of consumer and business IT by moving it out of homes and offices to so-called cloud providers. I suggest that by similarly moving research IT out of the lab, we can realize comparable economies of scale and reductions in complexity, empowering investigators with new capabilities and freeing them to focus on their research.
I describe work we are doing to realize this approach, focusing initially on research data lifecycle management. I present promising results obtained to date, and suggest a path towards large-scale delivery of these capabilities. I also suggest that these developments are part of a larger "revolution in scientific affairs," as profound in its implications as the much-discussed "revolution in military affairs" resulting from more capable, low-cost IT. I conclude with some thoughts on how researchers, educators, and institutions may want to prepare for this revolution.
This slideshow was used in an Introduction to Research Data Management course taught for the Mathematical, Physical and Life Sciences Division, University of Oxford, on 2014-02-26. It provides an overview of some key issues, looking at both day-to-day data management, and longer term issues, including sharing, and curation.
This slideshow was used at a lunchtime session delivered at the Humanities Division, University of Oxford, on 2014-05-12. It provides a general overview of some key data management topics, plus some pointers on where to find further information.
The document describes scientific workflows for big data and the challenges they present. It discusses Prof. Shiyong Lu's work on developing the VIEW system for designing, executing, and analyzing scientific workflows. The VIEW system provides a runtime environment for workflows, supports their execution on servers or clouds, and enables efficient storage, querying and visualization of workflow provenance data.
Accelerating data-intensive science by outsourcing the mundaneIan Foster
Talk at eResearch New Zealand Conference, June 2011 (given remotely from Italy, unfortunately!)
Abstract: Whitehead observed that "civilization advances by extending the number of important operations which we can perform without thinking of them." I propose that cloud computing can allow us to accelerate dramatically the pace of discovery by removing a range of mundane but timeconsuming research data management tasks from our consciousness. I describe the Globus Online system that we are developing to explore these possibilities, and propose milestones for evaluating progress towards smarter science.
eScience: A Transformed Scientific MethodDuncan Hull
The document discusses the concept of eScience, which involves synthesizing information technology and science. It explains how science is becoming more data-driven and computational, requiring new tools to manage large amounts of data. It recommends that organizations foster the development of tools to help with data capture, analysis, publication, and access across various scientific disciplines.
Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
The document discusses efforts by the United States Air Force Research Laboratory (AFRL) to streamline the management of its research and development case files. From 2002-2005, AFRL aimed to improve processes and use modern workflow tools. They faced challenges due to inconsistent practices across different research directorates. A team standardized procedures in an official manual and selected a tool to automate case file workflows. This provided standardized processes across AFRL and reduced administrative burdens on scientists and engineers.
OAIS and It's Applicability for Libraries, Archives, and Digital Repositories...faflrt
ALA/FAFLRT Workshop on Open Archival Information Service (OAIS). Presented by Robin Dale, RLG. Sponsored by ALA Federal and Armed Forces Libraries Roundtable (FAFLRT). Presented on June 16, 2001 at the ALA Annual Conference.
This slideshow was used in a Preparing Your Research Material for the Future course for the Humanities Division, University of Oxford, on 2016-11-16. It provides an overview of some key issues, focusing on the long-term management of data and other research material, including sharing and curation.
Curation-Friendly Tools for the Scientific Researcherbwestra
Presentation for Online Northwest Conference, in Corvallis Oregon, February 10, 2012.
Highlights electronic lab notebooks (ELN) and OMERO (Open Microscopy Environment) as two tools that enable researchers to better manage their research data.
This slideshow was used in a Preparing Your Research Material for the Future course for the Humanities Division, University of Oxford, on 2015-11-16. It provides an overview of some key issues, focusing on the long-term management of data and other research material, including sharing and curation.
The document discusses the development of guidance material to help NASA implement its software engineering requirements and best practices. It describes:
1) The creation of an electronic handbook on the NASA Engineering Network to provide guidance on NASA's software engineering requirements and examples/templates.
2) The process of gathering input from NASA's software community, prioritizing topics, and developing the content for the handbook.
3) The benefits of an electronic handbook such as easy updating and searchability.
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
The document discusses scientific workflows, provenance, and linked data. It covers:
1) Scientific workflows can automate data analysis at scale, abstract complex processes, and capture provenance for transparency.
2) Provenance represents the origin and history of data and can be represented using standards like PROV. It allows reasoning about how results were produced.
3) Capturing and publishing provenance as linked open data can help make scientific results more reusable and queryable, but challenges remain around multi-site studies and producing human-readable reports.
FocalCXM’s LabPulse is an effective yet simple tool which manages to connect all the dots and enable the features previously outlined under a single pane. Its comprehensive research management tool enables a researcher to plan, initiate and monitor all the activities involved in running a laboratory successfully. It boasts several user-friendly features, from simple planning to complex report generation.
Similar to Scientific workflow-overview-2012-01-rev-2 (20)
Building RAG with self-deployed Milvus vector database and Snowpark Container...Zilliz
This talk will give hands-on advice on building RAG applications with an open-source Milvus database deployed as a docker container. We will also introduce the integration of Milvus with Snowpark Container Services.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
2. This talk will provide answers to 4 questions:
Why did I get involved with scientific workflows?
How do scientific workflows help scientists?
What problems did you find when you first started working
with scientific workflows?
Can scientific workflows be effectively integrated into the
broader scientific process?
3. I became involved with scientific workflows
through the SciDAC SDM Center
The Scientific Discovery through
Advanced Computing (SciDAC)
program was funded by DOE
starting in 2001 with the goal of
advancing scientific computing by having CS and
domain science teams work together to address
science questions using new HPC platforms
Application initiatives were funded in areas such as
combustion, fusion, astrophysics, and groundwater
CS and math centers were funded in areas critical
to the development of new, scalable capabilities
including solvers, AMR, visualization, performance,
and data management
Focus was on science not CS research
4. The Scientific Data Management (SDM)
Center was the focal point for DOE data
management activities
Large, multi-institutional collaborations
Led by Arie Shoshani (LBL)
5 Labs and 5 Universities
Funded for 10 years
Project concluded in 2011
The center had 3 research thrusts:
Storage and efficient access (Rob Ross – ANL)
Data Mining and Analysis (Nagiza Samatova - NCSU)
Software Process Automation (Terence Critchlow – PNNL)
The goal of the SPA team was to develop and deploy
technology that would allow scientists to spend more time
on science by reducing the data management overhead
Workflows had filled that niche in business but, in
2001, there was little usage in science applications
5. As lead for the SPA team, I had both
management and research responsibilities
Team of 10-15 spread across NCSU, Univ. of
Utah, UC Davis, SDSC, ORNL, and PNNL
Identify relevant technology
Work with science teams to design and
deploy solutions
Identify areas requiring additional research
Perform research to improve the existing
capabilities for our target customers
6. Workflow technology was selected because
time consuming, repetitive tasks dominate
day-to-day computational science activity
By automating mundane tasks, we
allow scientists to focus on science
not data management
Needed a general purpose
workflow engine that we could
apply to an HPC-centric
environment
Act as the orchestrator, coordinating
the workflow execution
Allow processing of larger data sets
Support scientific reproducibility
Reduce waste of resources by allowing
timely corrective action to be taken
7. The SDM Center was one of the founding
organizations of the Kepler Consortium
In 2001 there were no widely used scientific workflow
engines
Kepler is an open source workflow environment
Based on the Ptolemy II system developed at UC Berkeley
Started with several projects coming together based on
a need for a flexible
workflow environment
Kepler-project.org
Kepler has become
one of the best
known and widely
used scientific workflow
engines
8. This talk focuses on work that I was directly
involved in
There was a lot of work performed by the SDM Center
team that I managed but I don’t focus on
Provenance tracking
Dashboard
Templates
Patterns
Deployed workflows
ITER
CPES
Combustion
https://sdm.lbl.gov/sdmcenter/
My research focused on raising the level of
abstraction within scientific workflows
9. Our first deployed workflow was managing a
bioinformatics analysis pipeline (2002)
In collaboration with
Matt Coleman (LLNL)
10. The TSI workflow was the first of our
“standard” simulation workflows (2005)
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
In collaboration with
Doug Swesty (Stony Brook)
11. The workflow can be broken into several
general steps
Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Submission
12. Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Job Monitoring
The workflow can be broken into several
general steps
13. Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Moving files
The workflow can be broken into several
general steps
14. Submit batch
request at NERSC
Identify new
complete files
Check job
status
Transfer files to HPSS
Transfer completed
correctly
Transfer files to SB
Transfer completed
correctly
Delete file
Extract
Get Variables
Remap coordinates
Create Chem vars Create neutrino vars
Derive other vars Write diagnostic file
Generate plots
Tool-1
Tool-2
Tool-3
Tool-4
Generate thumbnails
Generate movie
Delay
Queued
Running or Done
Update
web page
IfRunning
Data Analysis
The workflow can be broken into several
general steps
15. This translates into a complicated Kepler
workflow
Extensive use of nested
workflows to
compartmentalize steps
160 instances of 18
distinct actors
Over a dozen
parameters to control
workflow execution
16. We ended up building several similar
simulation science workflows
Fusion science
Combustion
Subsurface science
These all have the same
general steps
But there are significant
differences in the details
17. Unfortunately, workflows are not typically
portable across machines
User authentication
mechanisms depend on
machine-specific policies
Job launch and
monitoring features
depend on scheduler
File transfer
mechanisms depend on
available infrastructure
18. We developed generic actors as the first step
in raising the level of abstraction for
workflow design
Generic actors embody
general functionality into
actors that work across
platforms / workflows
Improve workflow
portability
Simplify creation of new
workflows
Form the basis for sharing
subworkflows
Reduce the number of
actor choices
19. We identified several capabilities required
across simulation workflows
User authentication
Job submission
Submit job scheduling
request to batch
scheduler
Job monitoring
Track status of job from
submitted, to running, to
completed
File transfer
Move files, potentially
between machines at
different sites
Developed and deployed
actors capable of
performing the desired
functionality using
available infrastructure
Generalized to manage
multiple implementations
Parameters and
contextual information
determine which options
to utilize
20. Use of generic actors improved workflow
effectiveness
Same workflow could be
used on all of the DOE
leadership class
machines
Significantly less
maintenance required
Fewer workflows needed
per science team
Each workflow is simpler
Still requires parameters
to manage details of
execution
21. Workflow context can be used to reduce
number of explicit parameters
Workflows run in a context
that provide certain
preferences
Systems
User accounts
Configuration files
Information requested /
computed / bound at run time
instead of design time
Initial results are promising
but more work is required to
determine how effective run-
time binding is for workflows
22. Scientific workflows still have major
adoption challenges to overcome
The correlation between
the scientific process
and the executable
workflow is loose at best
Executable workflows
are extremely complex
and usually require a
dedicated workflow
designer to create
The translation from idea to napkin drawing to
executable workflow is challenging and lossy
23. The scientific process is collaborative, fluid,
and time sensitive
Important decisions are
made in meetings and
conversations.
Records are distributed
and not easily
associated with specific
tasks
Decisions can be
revisited and changed
Science is inherently
iterative
Executable workflows
document the results of
these decisions
Lack broader context
Electronic lab notebooks
provide some contextual
information
Lack details and external
information / links
Provenance provides
some associations
24. Need a way to allow scientists to collect and
share information about their experiments
A single location capable
of collecting all relevant
information about an
experiment
Information needs to be
related in a meaningful
way
Temporal information
must be preserved
Working with
collaborators at UTEP,
we developed a
prototype of what this
could look like
25. Our prototype is built on annotating abstract
workflows
Design principles
Workflow construction
needs to be a byproduct
of information collection
Information should not
need to be entered more
than once
Annotations should relate
to specific steps in the
process
26. The research hierarchy contains the steps in
the abstract workflow
Steps are conceptual
At the top level, these
outline the major steps in
the experiment being
performed
Get data
Create conceptual model
Generate model input
Run simulation
Each step can have sub-
steps within it to refine the
concept further
(nested structure)
27. Free-form text is associated with each step
in the hierarchy
Allows scientists to easily
describe step’s purpose
Top level describes entire
experiment
Decisions are captured
under research specs tab
28. The process view shows the steps as a
workflow
Ports are used to identify
inputs and outputs
Lines between steps
indicate information flow
between steps
Steps should, eventually,
connect
29. Steps are connected by linking input and
output parameters (ports)
Inputs and outputs are
linked
Comment field holds
assumptions and
constraints from the
“other side” of the line
Free-form text makes it
easy to input
information, but
impossible to perform
automatic verification
30. Zooming in on a specific sub-step provides
additional information about that step
A new tab provides
(sub-)step-specific
information
The process view is
updated to reflect the
sub-steps contained
within this step
Note that the inputs
and outputs to the
workflow come from
the higher-level
workflow
31. Eventually, some steps correspond to
executable (Kepler) workflows
Prototype expands
Kepler infrastructure
Executable workflows
are (still) typically created
by a dedicated workflow
designer
This places the
executable workflow in
the broader context of
the experiment it is
supporting
Provenance can be
linked into overall
experiment
32. Annotations are stored in RDF to support
export / import
Semantically Interlinked
Online Communities
(SIOC) format chosen
Supports other tools
using these annotations
Report generation
Search / query
Experiment level
provenance information
33. This prototype represents a starting point for
answering many interesting questions
How do you effectively
link other sources of
information into steps in
an abstract workflow?
How do you select only
the relevant information?
How do you manage
provenance and
attribution in a distributed
environment?
What is the best way to
organize this information
for people filling a variety
of roles?
PIs need a different view
than workflow designers
or bench scientists
How do you effectively
share (subsets of) this
information?
How do you implement
access controls
effectively?
34. This prototype represents a starting point for
answering many interesting questions
Are workflows the right
abstraction for
representing the
scientific process?
Representing evolution
over time is challenging in
workflows
Does everything have to
correspond to a step?
Is there a way to
generate parts of an
executable workflow
given an abstract
definition?
Can we match steps to
specific actors?
Could you develop a
generic set of wizards or
templates?
35. Conclusions
The SDM Center has been at the
forefront of scientific workflow R&D
Workflows have been successfully
deployed across a wide variety of
scientific domains
Significant advances have been
made in making workflow engines
more reliable and useful
There remains significant work
required to
Fit workflows within the context of the
overall scientific process
Allow scientists to design and
implement their own workflows
36. This work involved many, many people
My team
George Chin
Chandrika Sivaramakrishnan
Xiaowen Xin (LLNL)
Anand Kulkarni
Anne Ngu (TX State)
Paulo Pinheiro da Silva
(UTEP)
Aida Gandara (UTEP)
Other SPA team members
Ilkay Altintas
Bertram Ludaescher
Mladen Vouk
Claudio Silva
Scott Klasky
Norbert Podhorszki
Dan Crawl
Ayla Khan
Arie Shoshani
Plus other students and researchers who were
involved for shorter times