The document outlines plans for improving access to geoscience data through the EarthCube initiative. It summarizes discussions from a workshop focused on developing use cases and standardizing the data lifecycle. The workshop highlighted that access is crucial to research and identified challenges around social and technical issues. It also described plans for a "DataSpace" platform to enable open, extensible sharing of data and services and emphasized usability over infrastructure development. Lastly, it discussed next steps from concept meetings focused on converging roadmaps, architectures, and future directions.
GeoDataspace: Simplifying Data Management Tasks with GlobusTanu Malik
This document describes GeoDataspace, a framework for enabling data and model sharing in computational geosciences. GeoDataspace uses geounits, which package code, data, and environment, to capture scientific activities and research outputs. The Globus Catalog provides a flexible metadata catalog for hosting and querying geounits. GeoDataspace aims to improve reproducibility and enable validation of shared models and data through re-executing geounits. Several geoscience applications including plate tectonics, hydrology, and space science are discussed as potential adopters.
The document proposes a reproducible framework powered by Globus to help researchers share and reproduce scientific models and simulations. It describes challenges with current methods for sharing work, like dependencies or configuration issues, that prevent easy reproducibility. The framework aims to [1] capture scientific activities, code, data and environments; [2] preserve them as standardized packages called "SciUnits"; [3] share and distribute SciUnits so others can [4] re-execute and re-analyze the work without installation or configuration problems. Key components are outlined for establishing this framework to support reproducibility across different scientific domains.
Benchmarking Cloud-based Tagging ServicesTanu Malik
This document presents a framework for benchmarking cloud-based tagging services. It proposes a tagging data model and generates workloads using a Markov chain process. Experiments were conducted using OLTP-Bench to evaluate the performance of MySQL, Postgres and SQL Server under different tagging workloads and schemas on Amazon EC2. The results show that for sparse datasets, a vertically partitioned schema outperforms horizontal schemas, and the best database depends on the sparseness of the data. The benchmark aims to help determine the most efficient cloud platform for supporting dynamic, sparse tagging data.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
This document proposes an approach called PTU (Provenance-To-Use) to improve the repeatability of scientific experiments by minimizing computation time during repeatability testing. PTU builds a package containing the software, input data, and provenance trace from a reference execution. Testers can then selectively replay parts of the provenance graph using the ptu-exec tool, reducing testing time compared to full re-execution. The document describes the PTU components, including tools for auditing reference runs, building provenance packages, and selectively replaying parts of the provenance graph. Examples applying PTU to the PEEL0 and TextAnalyzer applications show reductions in testing time.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Keiichiro Ono
This document outlines a tutorial on biological data analysis and visualization using Cytoscape. The tutorial covers basic concepts like networks and tables in Cytoscape, data import, network analysis features, and visualization techniques. It discusses loading sample network data, calculating network statistics, filtering networks, basic search functionality, and applying visual styles. The tutorial is intended to provide a practical introduction to Cytoscape's core features through examples and demos.
GeoDataspace: Simplifying Data Management Tasks with GlobusTanu Malik
This document describes GeoDataspace, a framework for enabling data and model sharing in computational geosciences. GeoDataspace uses geounits, which package code, data, and environment, to capture scientific activities and research outputs. The Globus Catalog provides a flexible metadata catalog for hosting and querying geounits. GeoDataspace aims to improve reproducibility and enable validation of shared models and data through re-executing geounits. Several geoscience applications including plate tectonics, hydrology, and space science are discussed as potential adopters.
The document proposes a reproducible framework powered by Globus to help researchers share and reproduce scientific models and simulations. It describes challenges with current methods for sharing work, like dependencies or configuration issues, that prevent easy reproducibility. The framework aims to [1] capture scientific activities, code, data and environments; [2] preserve them as standardized packages called "SciUnits"; [3] share and distribute SciUnits so others can [4] re-execute and re-analyze the work without installation or configuration problems. Key components are outlined for establishing this framework to support reproducibility across different scientific domains.
Benchmarking Cloud-based Tagging ServicesTanu Malik
This document presents a framework for benchmarking cloud-based tagging services. It proposes a tagging data model and generates workloads using a Markov chain process. Experiments were conducted using OLTP-Bench to evaluate the performance of MySQL, Postgres and SQL Server under different tagging workloads and schemas on Amazon EC2. The results show that for sparse datasets, a vertically partitioned schema outperforms horizontal schemas, and the best database depends on the sparseness of the data. The benchmark aims to help determine the most efficient cloud platform for supporting dynamic, sparse tagging data.
GEN: A Database Interface Generator for HPC ProgramsTanu Malik
GEN is a database interface generator that takes user-supplied C declarations and provides an interface to load scientific array data into databases without requiring changes to source code. It works by wrapping POSIX I/O calls at runtime to generate database schema definitions and load data. Experiments show it can reduce the time needed to reorganize data in the database compared to loading data from files and reorganizing outside the database. Current work aims to relax GEN's assumptions and improve data loading performance.
LDV: Light-weight Database VirtualizationTanu Malik
The document summarizes the Light-weight Database Virtualization (LDV) framework. LDV aims to enable easy and efficient sharing of database applications by capturing an application's execution provenance and dependencies. It uses application virtualization techniques to package the application binaries, libraries, and data. For applications that interact with a database, it also records the interactions between the application and database using system call monitoring and SQL logging. This combined provenance allows recreating the application's execution environment and replaying the database interactions to validate or reproduce results. Key components of LDV include provenance modeling, package creation with necessary files and traces, and runtime redirection to reconstruct the environment.
This document proposes an approach called PTU (Provenance-To-Use) to improve the repeatability of scientific experiments by minimizing computation time during repeatability testing. PTU builds a package containing the software, input data, and provenance trace from a reference execution. Testers can then selectively replay parts of the provenance graph using the ptu-exec tool, reducing testing time compared to full re-execution. The document describes the PTU components, including tools for auditing reference runs, building provenance packages, and selectively replaying parts of the provenance graph. Examples applying PTU to the PEEL0 and TextAnalyzer applications show reductions in testing time.
Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
Cytoscape Tutorial Session 1 at UT-KBRIN Bioinformatics Summit 2014 (4/11/2014)Keiichiro Ono
This document outlines a tutorial on biological data analysis and visualization using Cytoscape. The tutorial covers basic concepts like networks and tables in Cytoscape, data import, network analysis features, and visualization techniques. It discusses loading sample network data, calculating network statistics, filtering networks, basic search functionality, and applying visual styles. The tutorial is intended to provide a practical introduction to Cytoscape's core features through examples and demos.
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
This document summarizes a presentation given by Keiichiro Ono on the open source software platform Cytoscape. Ono introduced Cytoscape as a tool for biological network analysis and visualization. He discussed how it can integrate network and attribute data, perform network analysis functions like filtering and calculating statistics, and visualize networks through customizable layouts and visual styles. Ono also highlighted Cytoscape's ecosystem of apps that extend its functionality and its use of open standards to import a variety of network and attribute data formats.
Research Automationfor Data-Driven DiscoveryGlobus
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Globus and the Materials Data Facility are presented. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
2019 03-11 bio it-world west genepattern notebook slidesMichael Reich
The GenePattern Notebook Environment for Open Science and Reproducible Research, presentation by Michael Reich, UCSD, at BioITWorld West, San Francisco, March 11, 2019
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Building Reproducible Network Data Analysis / Visualization WorkflowsKeiichiro Ono
The document discusses building reproducible network data analysis and visualization workflows using REST APIs and containerization. It aims to solve problems with complex software stacks that are difficult to set up and not reproducible. The goal is to create reproducible and scalable "dry experiments" using Docker containers, GitHub for source code sharing, Jupyter notebooks as electronic lab notebooks, and the cyREST module for the Cytoscape network analysis software. Examples of scenarios using local workstations and cloud computing are presented, as well as a demo and future plans.
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
This document discusses using Globus to automate the management and analysis of large scientific instrument data. It provides examples of challenges with managing large datasets from the Event Horizon Telescope and applying Globus services and automation to help address these challenges. Specific use cases discussed include building connectomes from microscopy data and applying deep learning to flag bad scanning electron microscope images. The document emphasizes that automation needs transparency, results need to be easily findable, and leveraging specialized services can help.
This document discusses using cloud computing and virtualization for scientific research. Some key points:
- Scientists can access remote sensors, share data and workflows, and store personal data in the cloud. Beginners can click to code, while experts can build complex workflows.
- Services allow publishing, finding, and binding to distributed resources through registries. Data can be queried through standards like Simple Image Access Protocol.
- Distributed registries from various organizations harvest metadata to enable semantic search across sky regions, identifiers, tags, vocabularies, schemas, and service descriptions.
- Tools provide code/presentation environments and access to distributed data in the cloud. Services include astronomical cross-matching and event notification through Sky
We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Sensors and Crowd - Steve Liang, GeoCENS ProjectCybera Inc.
Steve Liang, assistant professor at the University of Calgary, presented these slides as part of the Cybera Summit 2010 session, Environmental Infrastructure: The Tools and Technologies Behind Water and Resource Management.
Workshop: Introduction to Cytoscape at UT-KBRIN Bioinformatics Summit 2014 (4...Keiichiro Ono
This document summarizes a presentation given by Keiichiro Ono on the open source software platform Cytoscape. Ono introduced Cytoscape as a tool for biological network analysis and visualization. He discussed how it can integrate network and attribute data, perform network analysis functions like filtering and calculating statistics, and visualize networks through customizable layouts and visual styles. Ono also highlighted Cytoscape's ecosystem of apps that extend its functionality and its use of open standards to import a variety of network and attribute data formats.
Research Automationfor Data-Driven DiscoveryGlobus
This document discusses research automation and data-driven discovery. It notes that data volumes are growing much faster than computational power, creating a productivity crisis in research. However, most labs have limited resources to handle these large data volumes. The document proposes applying lessons from industry to create cloud-based science services with standardized APIs that can automate and outsource common tasks like data transfer, sharing, publishing, and searching. This would help scientists focus on their core research instead of computational infrastructure. Examples of existing services from Globus and the Materials Data Facility are presented. The goal is to establish robust, scalable, and persistent cloud platforms to help address the challenges of data-driven scientific discovery.
Scaling collaborative data science with Globus and JupyterIan Foster
The Globus service simplifies the utilization of large and distributed data on the Jupyter platform. Ian Foster explains how to use Globus and Jupyter to seamlessly access notebooks using existing institutional credentials, connect notebooks with data residing on disparate storage systems, and make data securely available to business partners and research collaborators.
Deep learning is finding applications in science such as predicting material properties. DLHub is being developed to facilitate sharing of deep learning models, data, and code for science. It will collect, publish, serve, and enable retraining of models on new data. This will help address challenges of applying deep learning to science like accessing relevant resources and integrating models into workflows. The goal is to deliver deep learning capabilities to thousands of scientists through software for managing data, models and workflows.
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
A talk at the CODATA/RDA meeting in Gaborone, Botswana. I made the case that the biggest barriers to effective data sharing and reuse are often those associated with "data friction" and that cloud automation can be used to overcome those barriers.
The image on the first slide shows a few of the more than 20,000 active Globus endpoints.
In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and "where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum.
Jupyter notebooks are transforming the way we look at computing, coding and problem solving. But is this the only “data scientist experience” that this technology can provide?
In this webinar, Natalino will sketch how you could use Jupyter to create interactive and compelling data science web applications and provide new ways of data exploration and analysis. In the background, these apps are still powered by well understood and documented Jupyter notebooks.
They will present an architecture which is composed of four parts: a jupyter server-only gateway, a Scala/Spark Jupyter kernel, a Spark cluster and a angular/bootstrap web application.
2019 03-11 bio it-world west genepattern notebook slidesMichael Reich
The GenePattern Notebook Environment for Open Science and Reproducible Research, presentation by Michael Reich, UCSD, at BioITWorld West, San Francisco, March 11, 2019
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
Director's Colloquium at Los Alamos National Laboratory, September 18, 2014.
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.
Accelerating Discovery via Science ServicesIan Foster
[A talk presented at Oak Ridge National Laboratory on October 15, 2015]
We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In big-science projects in high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to develop suites of science services to which researchers can dispatch mundane but time-consuming tasks, and thus to achieve economies of scale and reduce cognitive load. I explore the past, current, and potential future of large-scale outsourcing and automation for science, and suggest opportunities and challenges for today’s researchers. I use examples from Globus and other projects to demonstrate what can be achieved.
Building Reproducible Network Data Analysis / Visualization WorkflowsKeiichiro Ono
The document discusses building reproducible network data analysis and visualization workflows using REST APIs and containerization. It aims to solve problems with complex software stacks that are difficult to set up and not reproducible. The goal is to create reproducible and scalable "dry experiments" using Docker containers, GitHub for source code sharing, Jupyter notebooks as electronic lab notebooks, and the cyREST module for the Cytoscape network analysis software. Examples of scenarios using local workstations and cloud computing are presented, as well as a demo and future plans.
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Keiichiro Ono
Introduction to biological network analysis and visualization with Cytoscape (using the latest version 3.4).
This is a first half of the lecture for Applied Bioinformatics lecture at TSRI.
1) Scientists at the Advanced Photon Source use the Argonne Leadership Computing Facility for data reconstruction and analysis from experimental facilities in real-time or near real-time. This provides feedback during experiments.
2) Using the Swift parallel scripting language and ALCF supercomputers like Mira, scientists can process terabytes of data from experiments in minutes rather than hours or days. This enables errors to be detected and addressed during experiments.
3) Key applications discussed include near-field high-energy X-ray diffraction microscopy, X-ray nano/microtomography, and determining crystal structures from diffuse scattering images through simulation and optimization. The workflows developed provide significant time savings and improved experimental outcomes.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
This document discusses using Globus to automate the management and analysis of large scientific instrument data. It provides examples of challenges with managing large datasets from the Event Horizon Telescope and applying Globus services and automation to help address these challenges. Specific use cases discussed include building connectomes from microscopy data and applying deep learning to flag bad scanning electron microscope images. The document emphasizes that automation needs transparency, results need to be easily findable, and leveraging specialized services can help.
This document discusses using cloud computing and virtualization for scientific research. Some key points:
- Scientists can access remote sensors, share data and workflows, and store personal data in the cloud. Beginners can click to code, while experts can build complex workflows.
- Services allow publishing, finding, and binding to distributed resources through registries. Data can be queried through standards like Simple Image Access Protocol.
- Distributed registries from various organizations harvest metadata to enable semantic search across sky regions, identifiers, tags, vocabularies, schemas, and service descriptions.
- Tools provide code/presentation environments and access to distributed data in the cloud. Services include astronomical cross-matching and event notification through Sky
We presented these slides at the NIH Data Commons kickoff meeting, showing some of the technologies that we propose to integrate in our "full stack" pilot.
This document discusses tools for distributed data analysis including Apache Spark. It is divided into three parts:
1) An introduction to cluster computing architectures like batch processing and stream processing.
2) The Python data analysis library stack including NumPy, Matplotlib, Scikit-image, Scikit-learn, Rasterio, Fiona, Pandas, and Jupyter.
3) The Apache Spark cluster computing framework and examples of its use including contexts, HDFS, telemetry, MLlib, streaming, and deployment on AWS.
Sensors and Crowd - Steve Liang, GeoCENS ProjectCybera Inc.
Steve Liang, assistant professor at the University of Calgary, presented these slides as part of the Cybera Summit 2010 session, Environmental Infrastructure: The Tools and Technologies Behind Water and Resource Management.
Strategies to foster OER and OER initiatives in developing regions@cristobalcobo
An OER action research project, OportUnidad, founded by the European Commission is presented. This study, lead by a partnership of European and Latin American universities, aims to increase the awareness and institutional support of OER in Latin American HE. Based in this action-research project, this article analyses the impact of digital technologies in education, particularly regarding the generation, adoption and dissemination of educational content.
VIVO is a semantic web application that aims to address challenges in academia such as finding collaborators, generating lists of publications and inferring expertise. It extracts and connects data from various sources using ontologies and stores it as RDF triples. VIVO versions are released periodically with new features. A collaboration of organizations contribute to its development and many institutions have adopted it to make their research more visible and connectable.
This document introduces IkeWiki, a semantic wiki for collaborative knowledge management. It describes IkeWiki's vision of aligning technology with users rather than forcing predefined workflows. The document outlines IkeWiki's interface, which combines a normal wiki interface with semantic annotations. It also describes IkeWiki's architecture of separately storing content and metadata and linking them through rendering. Several current and potential applications of IkeWiki and the related KIWI project are presented.
IASSIST 2011 presentation: Problems with our Data Citation SolutionHeather Piwowar
IASSIST 2011 presentation: "Problems with our Data Citation Solution" by Heather Piwowar, June 1 2011 #iassist
Note that these problems are solvable! Many are currently being solved... tool fixes are underway, etc. This presentation is to raise awareness of current state and challenges ahead.
Presented at STPcon 2011 on leveraging your automated tests by putting them under CI. Covers value proposition of CI, challenges to implement, and ideas for scaling.
!"#$%&'()*&#+,$)
!"#$$%&'%()*)%
• +,-%./$0$1%2,#""-34-$%#35%'676.-%&'%8/2.&'/3#32-9:#20-5%$-26./7/-$%#$%#%
$-;#.#7-%#$$-7%2"#$$%
• <#"#32-%$,--7%$6;;&.7%$7.6276.-$%'&.%'/3#32/#"%/3$7/767/&3$%
• =>#8/3/34%7,-%$&92#""-5%[email protected]$$%+-$7A%#35%7,-%2&8;-7/7/B-%/8;"/2#7/&3$%
• C#7/34%D-54-%E635$%
• +,-%F<@%!#$-G%!./$/$%#35%7,-%H#I%'&.H#.5%
• D-54-%E635$%J #%7-$7%2#$-%'&.%'676.-%.-46"#7/&3%#35%./$0%4&B-.3#32-%&'%$,#5&H%
:#30/34%#27/B/7/-$%%
!"#$$%&'%()**%
• K;7/8#"%E635/34%#35%L3B-$78-37%@7.#7-4I%'&.%#3%L37-.3#7/&3#"%M&.7'&"/&%&'%
!&88&5/7I%N-./B#7/B-$%#35%<&35$O!#$-%@765I%&'%=NM%C-3-H#:"-$%
• !&.;&.#7-%C/$0%P#3#4-8-37O=>;-./-32-$%N6./34%())Q%!./$/$%%
• C/$0%P#3#4-8-37%'&.%/3B-$7/34%/3%=8-.4/34%P#.0-7$%L32"65/34%D-54/34%
@7.#7-4/-$%
• M&H-.%@6.4-9R$$-79:#20-5%=3-.4I%+.#5/34%C/$0%#35%C-76.3%
• R$$-$$/34%#35%D-54/34%K;-.#7/&3#"%C/$09D-54/34%@I$7-8%N&H37/8-%
!"#$$%&'%()*(%
• R%E.#8-H&.0%'&.%K;-.#7/&3#"%C/$0%P/7/4#7/&3%
• @I$7-8/2%C/$0%@#'-46#.5$%'&.%!-37.#"%!"-#./34%!&637-.;#.7/-$%
• P#3#4-8-37%&'%C-;67#7/&3#"%C/$0G%R%M.#27/2#"%R;;.,%
• @7.-$$%+-$7/34%#35%C/$0%!&37.&"$G%!,#34-$%'.&8%())S%7&%()*)%
• @7.#7-4/2%P&.74#4-%N-'#6"7%/3%E/3#32/#"%L3$7/767/&3$%%
• =8-.4/34%T/U6/5/7I%C-U6/.-8-37$%#35%7,-%L8;#27%&3%<#30/34%#35%L3$6.#32-%
• L37.&562/34%=CPG%R%!#$-%@765I%&'%#%@,/;;/34%!&8;#3I%
• L8;#27%&'%C-46"#7/&3%&3%C/$0%%
%
%
!"#$$%&'%()*V%
• +,-%=''-27/B-3-$$%&'%7,-%C-46"#7&.I%@7.-$$%+-$7/34%N/$2"&$6.-%M.&2-$$%
• <.#W/"G%L$%+,/$%+/8-%N/''-.-37X%
• N&-$%C-;67#7/&3%&'%@LEL$%.-#""I%8#77-.X%
• @7.#7-4/2%D-54/34%!,&/2-$%/3%Y,&"-$#"-%#35%C-7#/"%<#30/34%
• P#3#4/34%!6..-32I%C/$0%/3%@8#""%=8-.4/34%P#.0-7$G%R%!#$-%@765I%/3%!&$7#%C/2#%
• @,#5&H%<#30/34G%!#$-%@765/-$%#35%Z"&:#"%C-46"#7&.I%C-'&.8%
• N&-$%C/$0%P#3#4-8-37%P#77-.%7&%@,#.-,&"5-.$X%
• +,-%L8;#27%&'%=>2,#34-%+.#5-5%M.&5627$%&3%7,-%M#.#8-7-.$%&'%Z&"5%
TIMELINE
Due Date Deliverable Comments
July 15, 2013 Individual topic proposal 1-2 paragraph write-up. A discussion forum is created
to share ideas
July 29, 2013 Pitching Session 5-6 minutes presentation to the class
August 3, 2013
Group formation and decision about the
topic
One page proposal stating the objectives and proposed
methodology and a list of team members.
August 26, 2013 An adviser is assigned Conference call/meeting with the advisor
October 2013 Status report Short report stating the work done so far and the
steps planned for coming months.
April 21, 2014 Draft report It should be a complete report in all respects. Any
changes at this point should be limited to non-critical
components.
May 5, 2014 Final report The structure if the report must include an 'Executive
Summary' followed by a detailed report, followed by
Appendices. The report should be as long as it needs to
be and not longer.
May 20, 2014 Presentation 20 minutes presentation followed by 20 minute Q&A
Master of Science in Risk
Management Program
Strategic Capstone Workshop
May 30, 2013
Agenda
•! Introductio ...
Evidence for the Pareto principle in open source software activityTom Mens
This document studies the distribution of activity among contributors to open source software projects. It analyzes three projects to determine if the Pareto principle applies, where 20% of contributors account for 80% of activity. Results show activity becomes more unequally distributed over time. For two projects, a limited core group performs most commits, emails, and bug changes. Future work aims to better understand how core groups evolve and identify new active contributors.
Presentation by Heather Piwowar at Simon Fraser University in October 2012 at the SFU Research Data Repository Project Launch.
Highlights current state of research data sharing. http://www.lib.sfu.ca/node/11510
Gil Elbaz, CEO and founder of Factual, gave a talk at the 2011 Web 2.0 Conference in San Francisco. His talk was entitled: "Big Data Challenges: Getting Some."
The term "life cycle" refers to the series of stages or phases that an organism, system, or product goes through from its beginning to its end. It is a concept that can be applied to various contexts, such as biology, ecology, business, technology, and project management. Here are a few examples of life cycles:
Biological Life Cycle: In biology, the life cycle refers to the sequence of stages that an organism undergoes from birth to reproduction and eventually death. This can include processes like birth or germination, growth and development, reproduction, and death.
Product Life Cycle: The product life cycle describes the stages a product goes through from its introduction to the market until its eventual decline. These stages typically include introduction, growth, maturity, and decline. Companies monitor the product life cycle to make strategic decisions regarding marketing, production, and product development.
Project Life Cycle: The project life cycle outlines the stages involved in the management and execution of a project. These stages typically include initiation, planning, execution, monitoring and control, and closure. Each phase has specific activities and deliverables, ensuring that the project progresses in a structured and organized manner.
Ecological Life Cycle: Ecological life cycles refer to the stages that ecosystems or species go through over time. This can involve the growth and decline of populations, adaptation to environmental changes, and interactions within the ecosystem.
Human Life Cycle: The human life cycle encompasses the different stages of development and growth that individuals go through from birth to death. This includes infancy, childhood, adolescence, adulthood, and eventually old age.
Understanding life cycles is important as it provides insight into the processes and changes that occur within various systems. It allows for better planning, decision-making, and adaptation to ensure sustainable growth, effective management, and optimal utilization of resources throughout the life cycle.
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
Big data visualization frameworks and applications at Kitware
Marcus Hanwell, Technical Leader at Kitware, Inc.
March 27th 2014
Kitware develops permissively licensed open source frameworks and applications for scientific data applications, and related areas. Some of the frameworks developed by our High Performance Computing and Visualization group address current challenges in big data visualization and analysis in a number of application domains including geospatial visualization, social media, finance, chemistry, biological (phylogenetics), and climate. The frameworks used to develop solutions in these areas will be described, along with the applications and the nature of the underlying data. These solutions focus on shared frameworks providing data storage, indexing, retrieval, client-server delivery models, server-side serial and parallel data reduction, analysis, and diagnostics. Additionally, they provide mechanisms that enable server-side or client-side rendering based on the capabilities and configuration of the system.
Big Data Visualization Meetup - South Bay
http://www.meetup.com/Big-Data-Visualisation-South-Bay/
This document discusses problems with traditional scholarly publishing and proposes solutions centered around open data and transparency. It notes that traditional publishing hinders reproducibility due to lack of access to data and methods. This has led to an increasing number of non-reproducible findings and retractions. The document advocates for incentivizing the publication of data, software, workflows and other research objects to improve reproducibility and transparency. It highlights several examples where making these elements openly available improved scrutiny and identified errors in published works.
ExLibris National Library Meeting @ IFLA-Helsinki - Aug 15th 2012Lee Dirks
An invited talk to 40+ directors of national libraries worldwide at the annual ExLibris member meeting at IFLA (Helsinki, Finland) on August 15th, 2012.
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
Presentation at the First Open Global Systems Science Conference, Brussels, 8-10 November 2012
http://www.gsdp.eu/nc/news/news/date/2012/10/31/first-open-global-systems-science-conference/
This document discusses information sources for research. It identifies 6 main types of information sources: 1) libraries and information centers, 2) publications such as books, journals, magazines, and newspapers, 3) experts and knowledgeable individuals, 4) organizations and associations, 5) the internet and search engines, and 6) other media such as television, radio, and films. Search engines are further divided into major search engines, meta search engines, and directory search engines. Examples are provided for each type of information source.
The document provides guidance on improving data management practices for scientists. It discusses common mistakes made, such as not teaching or practicing good data management. The document recommends reducing workload by reusing ideas and recycling data, and provides a toolbox of resources to improve data management skills.
The document discusses research data management and provides guidance on how to manage research data. It defines research data and explains why proper data management is important, such as ensuring data quality and access. It also outlines Oxford's activities to support data management, including interviews with researchers to understand challenges and requirements. Finally, it provides recommendations on developing a data management plan and offers services available at Oxford to help with file handling, metadata, storage, sharing, and long-term preservation of research data.
How e-infrastructure can contribute to Linked Germplasm DataStoitsis Giannis
This document discusses how e-infrastructure can help link germplasm data. It describes the need for e-infrastructure to overcome issues like data silos and lack of interoperability. The agINFRA approach provides services like metadata aggregation, vocabulary publishing, and APIs that can transform and expose germplasm descriptions as linked data. This will help link germplasm databases and make the data more discoverable. Next steps include developing recommendations for publishing germplasm data and deploying transformers and APIs on the agINFRA platform.
How e-infrastructure can contribute to Linked Germplasm Data
EarthCube DDMA AGU
1. A Community Roadmap for Enabling
Access to Geosciences Data
Tanu Malik
Ian Foster
Computation Institute
University of Chicago and Argonne National Lab.
tanum@ci.uchicago.edu, foster@anl.gov
www.ci.anl.gov
www.ci.uchicago.edu
3. Access is Vital for EarthCube’s Success
• The goal of EarthCube is to create a sustainable
infrastructure that enables the sharing of all
geosciences data, information, and knowledge in an
open, transparent and inclusive manner.
I cant get access to *.
It is difficult for me to *.
I want to integrate data from other disciplines, but *.
Access refers to software and activities that make data and computational
resources easily, efficiently and reliably available to scientists across
disciplines.
www.ci.anl.gov
3
www.ci.uchicago.edu
4. Access Workshop Goals
• Encourage discussions on emergent issues:
– Use of cloud computing
– Exploiting the general principle of moving computation to data
– A technological and governance framework for cross-disciplinary
access, service architecture, brokering principles, real-time data, uniform
authentication and authorization environment, etc.
– Improving access to data in publications.
• Bring some standardization on research data life cycle issues:
– In general, data, once generated, follow a lifecycle---they are
stored, described, processed, transformed, accessed, discovered, analyze
d, and curated. In organized networks and campaigns, lifecycle stages are
often documented and standardized, though vary significantly across
networks and campaigns. In individual initiatives, the lifecycle stages
continue to remain ad hoc and ill-defined. [RDLM-Workshop2011]
• Obtain community consensus on a few use cases
www.ci.anl.gov
4
www.ci.uchicago.edu
5. Workshop Activity Outcomes
• Use Case 1: Can I access “not large” but “big data”
to conduct statistical analysis?
• Use Case 2: I have a hypothesis not tied to a
physical instrument or geophysical parameter. Can
I still access all the data, in an “interactive” fashion
to test my hypothesis?
• Use Case 3: The storm dust paper is vital to my
research. Can I access the data in the publication
and change parameters of experiments to
understand the nature of storm dust?
www.ci.anl.gov
5
www.ci.uchicago.edu
6. Workshop Reflections
• Its all about data!
People
Import Import
Resources, Resources,
Data Services Data
Services
Export Export
www.ci.anl.gov
6
www.ci.uchicago.edu
7. Workshop Reflections-2
• Discussing technology issues in insolation is a
recipe for disaster.
– Access is closely aligned with other subgroups
– It is important to organize in functional units
www.ci.anl.gov
7
www.ci.uchicago.edu
8. Workshop Reflections-3
• Challenges will continue
Social Challenges Changing Requirements/
Changing Technology
• Transparency
• Openness Adoption Culture
• Establishing social ties • Real-time data
Adoption is slow • Cross-disciplinary Data
Sustainability • High dimensionality
Establishing practices • Network bandwidth,
Computational resource,
Data management constraints
www.ci.anl.gov
8
www.ci.uchicago.edu
10. Enabling A Data Sharing Space: The
DataSpace
• Embrace a “semi--structured” notion
• Ingest data in raw form,
Structuring and refinement of the data and metadata.
• Open, extensible architecture that supports Import
Software as a Service (SaaS) model,
Process for vetting contributed services prior to their incorporation.
Based on on-demand resources
Resc,
• Emphasis on usability instead Services Data
DataSpace
on developing technology/infrastructure
Export
&
www.ci.anl.gov
10
www.ci.uchicago.edu
11. Post-Charette
• 2 Earthcube PI meets at University of Colorado, Boulder
– A Concept group meeting,
o some representation from Community groups,
o July 10, 2012
– A Concept and Community group meeting,
o October 4 -5, 2012
• Primary objective: Convergence
– Through Roadmaps
– Architecture
– On future steps
www.ci.anl.gov
11
www.ci.uchicago.edu
12. Highlights: Summary of Roadmaps
• Workplace to collaborate,
• Lower barriers for participation,
• Openness and extensibility,
• Feedback and reproducibility,
• Discovery of materials held by long-tailed
scientists,
• Education and reward system for scientists,
• Cross-domain teams and broad collaboration
• A new community paradigm.
www.ci.anl.gov
12
www.ci.uchicago.edu
15. Acknowledgements
• Don Middleton, NCAR • Dave Fulker, OPeNDAP,
• Robert Gibb, New Zealand Landcare • Amarnath Gupta, UCS,
Research • Robert Jacob, ANL
• Jeff Heard, U. of North Carolina
• Chris Jenkins, JPL
• Doug Lindholm, U. of Colorado
• Craig Mattocks, U. Miami
• Joseph Baker, Virginia Tech
• Beth Plale, Indiana Univ.
• Anne Wilson, U of Colorado
• Stephen M. Richard, AZGS
• Chris Lynnes, NASA/ESIP Federation
• Sameer Sirugeri, Microsoft
• Karsten Steinhauser, U. of
• Zhangfan Xing, JPL,
Minnesota
• John Williams, NCAR
• Ruth Duerr, NSIDC
www.ci.anl.gov
15
www.ci.uchicago.edu
Shared, standard, reusable software interfacesFor disparate data types, disparate storage, varying protocols;Deliver data in user-requested format and translation between standards.Link various kinds of data Integration of high resolution topography scans & geodetic data;Integration of geologic data in deep time;Geo-located, and non-geo-located datasets;Observation and simulation datasets for comparison.Real-time access to data and facilities Capabilities within Cloud, Grid such a shared storage and data spacesIn low bandwidth settingsSimulation and modeling capabilities within HPC, and Science Portals Access refers to software and activities that make data and computational resources easily, efficiently and reliably available to scientists.
Access Paradigms: The SaaS model, the brokering approach. The SaaS model increases usage and adoption by making access to data and resources easy and convenient. The brokering approach implements mediation and distribution capabilities in a transparent way. Discuss these paradigms in context of the needs of the publishers of the big data and the needs of the long-tail geoscientist. Issues relating to access control, confidentiality, and the role of governance bodies for emerging access paradigms.Structural Data Integration for Access: issues relating to data a, data models, and standards for data integration. discuss novel data types needed by current science cases and their abstraction to data models and knowledge-based models based on space-time integration.Scalable Resource Access:scalable access to resources, such as HPC systems, cloud-based systems (parallel storage systems, parallel analysis systems as map-reduce [8], Hadoop, SciDB [19]), especially at marginal cost. to store and manipulate data even when the structure of the data is not fully known to the system; associating the cloud with a set of services for recognizing the structure of a wide variety of file types used in the geoscience applications, extracting structure from the data, and traversing files to extract metadata.
However, in cases where researchers are interested in studying a phenomena, can an EarthCube framework provide adequate semantics to express a search query, a generic model for data access of events, and interactively discover ‘events’ within data and perform ‘first look’ analytics, while keeping provenance and history of all analyses?
Earlier Resources were at the center, and data was massaged so that the resources and services can access itBut now the data is going to be central and services will feed into it and so the
The Sher Dataspace embodies a “semi-‐‑structured” notion compared, on the one hand, with rigidly structured systems like, say, relational database systems, where a data schema needs to be specified first before data can be stored and, on the other hand with, say, filesystems, which are unstructured and do not support any notion of a schema or content-‐‑based metadata. In Sher, data can be ingested as a file (or a heterogeneous package, e.g. a folder) with minimal metadata. Services are provided for capturing this metadata as well as the package structure. Further services are provided for on-‐‑going structuring and refinement of the data and metadata. Examples include user-‐‑specified annotations; extraction of information for well-‐‑known filetypes (e.g., netCDF); extraction of metadata for proprietary file types using software libraries (e.g., NMR data); structuring of data and associated information, e.g. associating a set of flat files with a database along with the set of data cleaning routines and load scripts that were used to create the data, etc. Thus, the Dataspace concept supports the model of data being transformed incrementally from a relatively unstructured state with minimal metadata, to a highly structured form with rich metadata, using an array of structuring and refinement services. A key enabling characteristic of Sher is its open, extensible architecture that supports the Software as a Service (SaaS) model, thereby removing the burden of maintaining software and software environments from the client [52]. Using this SaaS model, Sher facilitates creation of third-‐‑party services that can be contributed into the system, i.e., a SherStore, similar to the Apple AppStore, including the notion of vetting contributed services prior to their incorporation.