Fall 2016 TMS Webinar on Data Curation Tools. Slides for the Materials Data Facility presentation on data services (publish and discover) as described by Ben Blaiszik. See http://www.materialsdatafacility.org for more information.
I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
The document outlines the vision, mission, and strategy of the STFC (Science and Technology Facilities Council) in implementing e-Science technologies. The goals are to exploit data from STFC facilities through innovative infrastructure, integrate activities nationally and internationally, and improve computation and data management capabilities to enable new scientific discoveries.
Screenshots prepared by Ben Blaiszik and Kyle Chard, used in our Globus publication demo at GlobusWorld 2014. See https://www.globus.org/data-publication for more information and the notes on the slides for details.
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
This document summarizes the development of an open linked data repository called Open Data Web (ODW) built using CKAN. Key points:
- ODW publishes structured data from a Taiwanese archive catalog as linked open data using the RDF data model.
- It provides features for browsing, spatial and temporal querying of the data through a SPARQL endpoint.
- The system was implemented by customizing CKAN using extensions to support linked data import/export, custom fields, spatial/temporal search.
- Future work includes improving import speed and providing native SPARQL queries in CKAN.
Globus Online provides services to enable easy and reliable data transfer between campus resources and national cyberinfrastructure. It uses Globus Transfer for simple file transfers and Globus Connect to easily integrate campus resources. Globus Connect Multi-User allows administrators to easily deploy GridFTP servers and authentication for multiple users, facilitating campus bridging. Several universities have found success using these Globus services to enable terabyte-scale data sharing across their campuses and with national resources.
Foundations for the future of science discusses using artificial intelligence and machine learning to advance scientific research. Key points discussed include using AI to analyze large datasets, develop scientific models, and automate experimental workflows. The document also outlines several examples of how the Globus data platform is currently enabling AI-powered scientific applications across multiple domains. Overall, the document advocates that embracing "AI for science" has the potential to accelerate scientific discovery by overcoming limitations in human analysis capabilities and computational resources.
I presented this keynote talk at the WorldComp conference in Las Vegas, on July 13, 2009. In it, I summarize what grid is about (focusing in particular on the "integration" function, rather than the "outsourcing" function--what people call "cloud" today), using biomedical examples in particular.
The document outlines the vision, mission, and strategy of the STFC (Science and Technology Facilities Council) in implementing e-Science technologies. The goals are to exploit data from STFC facilities through innovative infrastructure, integrate activities nationally and internationally, and improve computation and data management capabilities to enable new scientific discoveries.
Screenshots prepared by Ben Blaiszik and Kyle Chard, used in our Globus publication demo at GlobusWorld 2014. See https://www.globus.org/data-publication for more information and the notes on the slides for details.
“Open Data Web” – A Linked Open Data Repository Built with CKANChengjen Lee
This document summarizes the development of an open linked data repository called Open Data Web (ODW) built using CKAN. Key points:
- ODW publishes structured data from a Taiwanese archive catalog as linked open data using the RDF data model.
- It provides features for browsing, spatial and temporal querying of the data through a SPARQL endpoint.
- The system was implemented by customizing CKAN using extensions to support linked data import/export, custom fields, spatial/temporal search.
- Future work includes improving import speed and providing native SPARQL queries in CKAN.
Globus Online provides services to enable easy and reliable data transfer between campus resources and national cyberinfrastructure. It uses Globus Transfer for simple file transfers and Globus Connect to easily integrate campus resources. Globus Connect Multi-User allows administrators to easily deploy GridFTP servers and authentication for multiple users, facilitating campus bridging. Several universities have found success using these Globus services to enable terabyte-scale data sharing across their campuses and with national resources.
Foundations for the future of science discusses using artificial intelligence and machine learning to advance scientific research. Key points discussed include using AI to analyze large datasets, develop scientific models, and automate experimental workflows. The document also outlines several examples of how the Globus data platform is currently enabling AI-powered scientific applications across multiple domains. Overall, the document advocates that embracing "AI for science" has the potential to accelerate scientific discovery by overcoming limitations in human analysis capabilities and computational resources.
This document summarizes a presentation on CKAN, an open-source data management system. It discusses CKAN's features for publishing, finding, and managing datasets. These include adding metadata and data, filtering datasets, previewing data types, and customizing CKAN. It also covers harvesting data from external sources, installing CKAN, and common issues. The goal of CKAN is to make data open and accessible on the web according to the 5 star open data model.
Research on vector spatial data storage scheme basedAnant Kumar
The document proposes a novel vector spatial data storage schema based on Hadoop to address problems with managing large-scale spatial data in cloud computing. It designs a vector spatial data storage scheme using column-oriented storage and key-value mapping to represent topological relationships. It also develops middleware to directly store spatial data and enable geospatial data access using the GeoTools toolkit. Experiments on a Hadoop cluster demonstrate the proposal is efficient and applicable for large-scale vector spatial data storage and expression of spatial relationships.
Russell 2012 introduction to spring integration and spring batchGaryPRussell
This document introduces Spring Integration and Spring Batch. It discusses how Spring Integration provides an extension of the Spring programming model to support enterprise integration patterns using pipes and filters. It also explains that Spring Batch supports common batch processing concerns like retries and skipping through pluggable strategies. Finally, it describes how Spring Integration and Spring Batch can be used together, such as launching batch jobs through messages or providing feedback with messages.
Introduction to Linked Data Platform (LDP)Hector Correa
The Linked Data Platform (LDP) defines rules for HTTP operations on web resources to provide an architecture for read-write Linked Data on the web. Key concepts include resources, RDF sources, non-RDF sources, and containers. LDP uses HTTP requests and responses to create, retrieve, update, and delete resources. Resources can be contained within different types of containers, including basic, direct, and indirect containers. LDP provides a standard way to manage Linked Data using HTTP.
Instrument Data Orchestration with Globus Search and FlowsGlobus
This document discusses various Globus services for instrument data orchestration including the Timer service, platform services, authentication, search, transfer, flows, and the upcoming Trigger service. The Timer service allows for scheduled and recurring transfers. Platform services provide comprehensive data and compute orchestration. Authentication is handled by Globus Auth. Search allows for data description and discovery. Transfer shares and moves data. Flows automate distributed research tasks. Triggers will start flows based on events.
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus
We describe the large-scale data transfer scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform large-scale data transfers, and walk through a code repository with the web application’s code.
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGlobus
We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus
We describe the requirements for, and challenges of, distributing datasets at scale, e.g. from instruments such as CryoEM and advanced light sources. We demonstrate a web application that uses Globus to perform large-scale data distribution. We introduce and walk through a Jupyter notebook highlighting the relevant code to incorporate into a science gateway.
Looking at remote data replication, including possible scenarios and how it compares to syncing information. This slide deck also covers how data replication happens across various operating systems and how to use HotFolder to HotFolder replication.
The document summarizes Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes the key components of Hadoop including the Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware, and the MapReduce programming model which allows distributed processing of large datasets in parallel. The document provides an overview of HDFS architecture, data flow, fault tolerance, and other aspects to enable reliable storage and access of very large files across clusters.
This document provides an overview and agenda for a presentation on Azure DocumentDB. It begins with an introduction to DocumentDB, then covers getting started by setting it up in Azure, how to work with it using C#, cost and usage details, use cases and limitations. Key points are that DocumentDB is a fully-managed NoSQL document database with horizontal scalability. It provides a familiar programming model and common database functions like indexing, consistency options, and stored procedures.
This document provides an overview of NoSQL databases. It begins with a brief history of relational databases and Edgar Codd's 1970 paper introducing the relational model. It then discusses modern trends driving the emergence of NoSQL databases, including increased data complexity, the need for nested data structures and graphs, evolving schemas, high query volumes, and cheap storage. The core characteristics of NoSQL databases are outlined, including flexible schemas, non-relational structures, horizontal scaling, and distribution. The major categories of NoSQL databases are explained - key-value, document, graph, and column-oriented stores - along with examples like Redis, MongoDB, Neo4j, and Cassandra. The document concludes by discussing use cases and
A set of slides that provides a high-level overview of the W3C Linked Data Platform specification presented at the 4th Linked Data in Architecture and Construction Workshop.
For more detailed and technical version of the presentation, please refer to
http://www.slideshare.net/nandana/learning-w3c-linked-data-platform-with-examples
LDAC 2016 programme
http://smartcity.linkeddata.es/LDAC2016/#programme
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
The document describes how to model an address book application using the Linked Data Platform (LDP) and Hydra Core Vocabularies. It provides examples of modeling an address book container and contacts as LDP resources, supporting common operations like GET, POST and PATCH. It also shows how to describe the application's API using the Hydra Core Vocabulary, including supported classes, operations and documentation. Potential conflicts between LDP and Hydra concepts like containers vs collections and paging are discussed.
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
This document provides an outline for a student talk on NoSQL databases. It introduces NoSQL databases and discusses their characteristics and uses. It then covers different types of NoSQL databases including key-value, column, document, and graph databases. Examples of specific NoSQL databases like MongoDB, Cassandra, HBase, Riak, and Neo4j are provided. The document also discusses concepts like CAP theorem, replication, sharding, and provides comparisons of different database types.
The document discusses the target audience for a media thriller project. The target audience is 12-20 year old males based on survey results showing most thriller fans are in this age range and are male. To attract this audience, the media project includes a gun, a significant death of a main character, and a black male lead actor. These elements appeal to the target audience by relating to genres and media they enjoy like action games and subverting expectations of typical thrillers.
Kinh Do Corporation is a Vietnamese food production company seeking to expand into new product categories like noodles and cooking oil. The document provides a sales strategy for 2015, including segmenting customers, identifying strengths/weaknesses/opportunities/threats, and setting objectives. The objectives are to focus on "Food & Essential" products, target supermarkets and schools, and expand partnerships. Customer segmentation includes current large distributors and a plan to focus more on supermarkets and grocers to market new noodles and oil products.
This document summarizes a presentation on CKAN, an open-source data management system. It discusses CKAN's features for publishing, finding, and managing datasets. These include adding metadata and data, filtering datasets, previewing data types, and customizing CKAN. It also covers harvesting data from external sources, installing CKAN, and common issues. The goal of CKAN is to make data open and accessible on the web according to the 5 star open data model.
Research on vector spatial data storage scheme basedAnant Kumar
The document proposes a novel vector spatial data storage schema based on Hadoop to address problems with managing large-scale spatial data in cloud computing. It designs a vector spatial data storage scheme using column-oriented storage and key-value mapping to represent topological relationships. It also develops middleware to directly store spatial data and enable geospatial data access using the GeoTools toolkit. Experiments on a Hadoop cluster demonstrate the proposal is efficient and applicable for large-scale vector spatial data storage and expression of spatial relationships.
Russell 2012 introduction to spring integration and spring batchGaryPRussell
This document introduces Spring Integration and Spring Batch. It discusses how Spring Integration provides an extension of the Spring programming model to support enterprise integration patterns using pipes and filters. It also explains that Spring Batch supports common batch processing concerns like retries and skipping through pluggable strategies. Finally, it describes how Spring Integration and Spring Batch can be used together, such as launching batch jobs through messages or providing feedback with messages.
Introduction to Linked Data Platform (LDP)Hector Correa
The Linked Data Platform (LDP) defines rules for HTTP operations on web resources to provide an architecture for read-write Linked Data on the web. Key concepts include resources, RDF sources, non-RDF sources, and containers. LDP uses HTTP requests and responses to create, retrieve, update, and delete resources. Resources can be contained within different types of containers, including basic, direct, and indirect containers. LDP provides a standard way to manage Linked Data using HTTP.
Instrument Data Orchestration with Globus Search and FlowsGlobus
This document discusses various Globus services for instrument data orchestration including the Timer service, platform services, authentication, search, transfer, flows, and the upcoming Trigger service. The Timer service allows for scheduled and recurring transfers. Platform services provide comprehensive data and compute orchestration. Authentication is handled by Globus Auth. Search allows for data description and discovery. Transfer shares and moves data. Flows automate distributed research tasks. Triggers will start flows based on events.
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus
We describe the large-scale data transfer scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform large-scale data transfers, and walk through a code repository with the web application’s code.
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGlobus
We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus
We describe the requirements for, and challenges of, distributing datasets at scale, e.g. from instruments such as CryoEM and advanced light sources. We demonstrate a web application that uses Globus to perform large-scale data distribution. We introduce and walk through a Jupyter notebook highlighting the relevant code to incorporate into a science gateway.
Looking at remote data replication, including possible scenarios and how it compares to syncing information. This slide deck also covers how data replication happens across various operating systems and how to use HotFolder to HotFolder replication.
The document summarizes Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It describes the key components of Hadoop including the Hadoop Distributed File System (HDFS) which stores data reliably across commodity hardware, and the MapReduce programming model which allows distributed processing of large datasets in parallel. The document provides an overview of HDFS architecture, data flow, fault tolerance, and other aspects to enable reliable storage and access of very large files across clusters.
This document provides an overview and agenda for a presentation on Azure DocumentDB. It begins with an introduction to DocumentDB, then covers getting started by setting it up in Azure, how to work with it using C#, cost and usage details, use cases and limitations. Key points are that DocumentDB is a fully-managed NoSQL document database with horizontal scalability. It provides a familiar programming model and common database functions like indexing, consistency options, and stored procedures.
This document provides an overview of NoSQL databases. It begins with a brief history of relational databases and Edgar Codd's 1970 paper introducing the relational model. It then discusses modern trends driving the emergence of NoSQL databases, including increased data complexity, the need for nested data structures and graphs, evolving schemas, high query volumes, and cheap storage. The core characteristics of NoSQL databases are outlined, including flexible schemas, non-relational structures, horizontal scaling, and distribution. The major categories of NoSQL databases are explained - key-value, document, graph, and column-oriented stores - along with examples like Redis, MongoDB, Neo4j, and Cassandra. The document concludes by discussing use cases and
A set of slides that provides a high-level overview of the W3C Linked Data Platform specification presented at the 4th Linked Data in Architecture and Construction Workshop.
For more detailed and technical version of the presentation, please refer to
http://www.slideshare.net/nandana/learning-w3c-linked-data-platform-with-examples
LDAC 2016 programme
http://smartcity.linkeddata.es/LDAC2016/#programme
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
The document describes how to model an address book application using the Linked Data Platform (LDP) and Hydra Core Vocabularies. It provides examples of modeling an address book container and contacts as LDP resources, supporting common operations like GET, POST and PATCH. It also shows how to describe the application's API using the Hydra Core Vocabulary, including supported classes, operations and documentation. Potential conflicts between LDP and Hydra concepts like containers vs collections and paging are discussed.
The document discusses NoSQL databases and MapReduce. It provides historical context on how databases were not adequate for the large amounts of data being accumulated from the web. It describes Brewer's Conjecture and CAP Theorem, which contributed to the rise of NoSQL databases. It then defines what NoSQL databases are, provides examples of different types, and discusses some large-scale implementations like Amazon SimpleDB, Google Datastore, and Hadoop MapReduce.
This document provides an outline for a student talk on NoSQL databases. It introduces NoSQL databases and discusses their characteristics and uses. It then covers different types of NoSQL databases including key-value, column, document, and graph databases. Examples of specific NoSQL databases like MongoDB, Cassandra, HBase, Riak, and Neo4j are provided. The document also discusses concepts like CAP theorem, replication, sharding, and provides comparisons of different database types.
The document discusses the target audience for a media thriller project. The target audience is 12-20 year old males based on survey results showing most thriller fans are in this age range and are male. To attract this audience, the media project includes a gun, a significant death of a main character, and a black male lead actor. These elements appeal to the target audience by relating to genres and media they enjoy like action games and subverting expectations of typical thrillers.
Kinh Do Corporation is a Vietnamese food production company seeking to expand into new product categories like noodles and cooking oil. The document provides a sales strategy for 2015, including segmenting customers, identifying strengths/weaknesses/opportunities/threats, and setting objectives. The objectives are to focus on "Food & Essential" products, target supermarkets and schools, and expand partnerships. Customer segmentation includes current large distributors and a plan to focus more on supermarkets and grocers to market new noodles and oil products.
Course Syllabus Tier 2 5 Day Syllabus Fall 2015David Bourque
This 5-day course focuses on repair and maintenance of four Hewlett Packard LaserJet printer models. Students will learn laser printer theory, control panel operations, firmware updates, and reset procedures. The schedule includes lectures on printer specifications and theory, followed by hands-on labs for disassembly, reassembly, and troubleshooting. Students' learning goals are to understand printer theory, perform control panel functions, disassemble and reassemble the printers, and address common issues. The course requires a laptop and basic mechanical understanding. Students will be quizzed during modules and receive a completion certificate; course surveys provide performance feedback.
El documento ofrece instrucciones para escribir un comentario crítico efectivo en 3 oraciones o menos. Debe incluir una introducción que enuncie el tema principal y la intención del autor, varios párrafos centrales donde se exprese la opinión personal a favor o en contra con argumentos, y una conclusión que resuma la idea clave. Asimismo, señala errores comunes a evitar como limitarse a parafrasiar el texto sin añadir análisis crítico o cometer fallas de redacción y organización.
Care Santos es una escritora española de literatura juvenil. Nació en Mataró, Barcelona en 1970 y comenzó a escribir a los ocho años, sabiendo desde entonces que quería ser escritora. Es una de las autoras más leídas de literatura juvenil en España y también trabaja como crítica literaria. La novela trata temas como las relaciones entre jóvenes y la amistad, así como la muerte.
This resume summarizes Araceli Ulloa's work experience including positions in medical records, scanning, customer service, retail, and insurance. She has over 10 years of experience providing excellent customer service in both English and Spanish. She is organized, detail-oriented, and able to multi-task and work independently.
A Data Ecosystem to Support Machine Learning in Materials ScienceGlobus
This presentation was given at the 2019 GlobusWorld Conference in Chicago, IL by Ben Blaiszik from University of Chicago and Argonne National Laboratory Data Science and Learning Division.
Materials Data Facility: Streamlined and automated data sharing, discovery, ...Ian Foster
Reviews recent results from the Materials Data Facility. Thanks in particular to Ben Blaiszik, Jonathon Goff, and Logan Ward, and the Globus data search team. Some features shown here are still in beta. We are grateful for NIST for their support.
Simplified Research Data Management with the Globus PlatformGlobus
Overview of the Globus research data management platform, as presented at the Fall 2018 Membership Meeting of the Coalition for Networked Information (CNI), held in Washington, D.C., December 10-11, 2018
This document summarizes a presentation about providing next-generation sequencing analysis capabilities using Globus Genomics. It outlines challenges with current manual approaches to sequencing data analysis, including difficulties moving large datasets between locations and maintaining complex analysis scripts. The presentation introduces Globus Genomics, which uses Globus data transfer services integrated with Galaxy to provide a workflow-based system for sequencing analysis without requiring local installation or configuration. Key benefits include on-demand access to scalable cloud resources, ability to easily modify and reuse analysis workflows, and integration with data sources. The system aims to accelerate genomic research by automating and simplifying analysis.
Data scientists spend too much of their time collecting, cleaning and wrangling data as well as curating and enriching it. Some of this work is inevitable due to the variety of data sources, but there are tools and frameworks that help automate many of these non-creative tasks. A unifying feature of these tools is support for rich metadata for data sets, jobs, and data policies. In this talk, I will introduce state-of-the-art tools for automating data science and I will show how you can use metadata to help automate common tasks in Data Science. I will also introduce a new architecture for extensible, distributed metadata in Hadoop, called Hops (Hadoop Open Platform-as-a-Service), and show how tinker-friendly metadata (for jobs, files, users, and projects) opens up new ways to build smarter applications.
Introduction to the Globus SaaS (GlobusWorld Tour - STFC)Globus
This document summarizes a presentation about the Globus data management platform. It includes an agenda covering an introduction to the Globus Software as a Service and Platform as a Service, automating research data workflows, facilitating collaboration, and building services. There are demonstrations of file transfers, data sharing, publication, and high assurance endpoints. The sustainability model is discussed, with standard and high assurance subscriptions, branded websites, premium storage connectors, and identity providers. Support resources like documentation, email lists, and professional services are also mentioned.
The document provides an overview of Hadoop including:
- A brief history of Hadoop and its origins from Nutch.
- An overview of the Hadoop architecture including HDFS and MapReduce.
- Examples of how companies like Yahoo, Facebook and Amazon use Hadoop at large scales to process petabytes of data.
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
"In this talk, hear about two high-performant research services developed and operated by the Computation Institute at the University of Chicago running on AWS. Globus.org, a high-performance, reliable, robust file transfer service, has over 10,000 registered users who have moved over 25 petabytes of data using the service. The Globus service is operated entirely on AWS, leveraging Amazon EC2, Amazon EBS, Amazon S3, Amazon SES, Amazon SNS, etc. Globus Genomics is an end-to-end next-gen sequencing analysis service with state-of-art research data management capabilities. Globus Genomics uses Amazon EC2 for scaling out analysis, Amazon EBS for persistent storage, and Amazon S3 for archival storage. Attend this session to learn how to move data quickly at any scale as well as how to use genomic analysis tools and pipelines for next generation sequencers using Globus on AWS.
"
Introduces the Globus software-as-a-service for file transfer and data sharing. Includes step-by-step instructions for creating a Globus account, transferring a file, and setting up a Globus endpoint on your laptop.
This presentation was given at the GlobusWorld 2020 Virtual Conference, by Ian Foster, Rachana Ananthakrishnan, and Vas Vasiliadis from the University of Chicago.
The Materials Data Facility: A Distributed Model for the Materials Data Commu...Ben Blaiszik
The Materials Data Facility (MDF) is a distributed model for the materials data community that aims to make materials data more shareable, open, accessible, computable, and valuable. The MDF indexes over 100 terabytes of materials data from various repositories and facilities. It provides services for data discovery, publication with DOIs, and integrates data with computing resources. The goal is to simplify critical tasks in materials science like finding relevant data, training machine learning models across multiple datasets, and reproducing results.
This a talk that I gave at BioIT World West on March 12, 2019. The talk was called: A Gen3 Perspective of Disparate Data:From Pipelines in Data Commons to AI in Data Ecosystems.
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
Ever more data- and compute-intensive science makes computing increasingly important for research. But for advanced computing infrastructure to benefit more than the scientific 1%, we need new delivery methods that slash access costs, new sustainability models beyond direct research funding, and new platform capabilities to accelerate the development of new, interoperable tools and services.
The Globus team has been working towards these goals since 2010. We have developed software-as-a-service methods that move complex and time-consuming research IT tasks out of the lab and into the cloud, thus greatly reducing the expertise and resources required to use them. We have demonstrated a subscription-based funding model that engages research institutions in supporting service operations. And we are now also showing how the platform services that underpin Globus applications can accelerate the development and use of an integrated ecosystem of advanced science applications, such as NCAR’s Research Data Archive and OSG Connect, thus enabling access to powerful data and compute resources by many more people than is possible today.
In this talk, I introduce Globus services and the underlying Globus platform. I present representative applications and discuss opportunities that this platform presents for both small science and large facilities.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds
1) Globus Genomics addresses challenges in sequencing analysis by providing a platform that integrates data transfer via Globus Online, workflow management in Galaxy, and scalable compute resources in AWS.
2) An example collaboration with the Dobyns Lab saw over a 10x speedup in exome data analysis by replacing a manual process with Globus Genomics.
3) Globus Genomics leverages XSEDE services like Globus Transfer and Nexus while integrating additional resources like sequencing centers and cloud computing, in order to reduce the costs and complexities of genomic research for communities not traditionally using advanced cyberinfrastructure.
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
Peter Haase and Michael Schmidt of fluid Operations AG presented on developing applications using linked open data. They discussed the increasing amount of linked open data available and challenges in building applications that integrate data from different sources and domains. Their Information Workbench platform aims to address these challenges by allowing users to discover, integrate, and customize applications using linked data in a no-code environment. Key components of the platform include virtualized integration of data sources and the vision of accessing linked data as a cloud-based data service.
Enabling Secure Data Discoverability (SC21 Tutorial)Globus
Major research instruments are generating orders of magnitude more data in relatively short timeframes. As a result, the research enterprise is increasingly challenged by what should be mundane tasks: describing data for discovery and making data securely accessible to the broader research community. The ad hoc methods currently employed place undue burden on scientists and system administrators alike, and it is clear that a more robust, scalable approach is required.
Bespoke data portals (and science gateways/data commons) are becoming more prominent as a means of enabling access to large datasets. in this tutorial we demonstrate how services for authentication, authorization, metadata management, and search may be integrated with popular web frameworks, and used in combination with fast, well-architected networks to make data discoverable and accessible. Outcomes: build a simple, but functional, data portal that facilitates flexible data description, faceted data search and secure data access.
This document discusses a multi-tenant Hadoop-as-a-Service platform called HopsWorks that is available for free use at the SICS ICE research facility in Luleå, Sweden. Key points:
- SICS ICE is the world's first open data center dedicated to big data research, with resources like Hadoop/Spark/Flink available as a service.
- HopsWorks provides true multi-tenancy for Hadoop through project-specific user IDs and group IDs to isolate data and enforce access controls.
- Metadata is kept consistent through the use of a distributed database, with foreign keys ensuring integrity when projects or data sets are modified or deleted.
- A
Data continues to grow exponentially – especially with the advent of social content. Approximately 70% of data is unstructured. This impacts on storage costs and management, Data Protection, and SLAs.
New deployment options such as cloud provide alternatives but how do you know what you should move to the cloud?
Introduction to Globus: Research Data Management Software at the ALCFGlobus
This document provides an introduction and overview of Globus, a research data management platform. It discusses how Globus can be used to move, share, discover, and reproduce data across different storage tiers and resources. Globus delivers fast and reliable big data transfer, sharing, and platform services directly from existing storage systems via software-as-a-service using existing identities, with the goal of unifying access to data across different locations and resources. The document demonstrates how Globus can be used via its web interface, command line interface, REST API, and as a platform for building other research applications and workflows.
Similar to 20160922 Materials Data Facility TMS Webinar (20)
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Sérgio Sacani
Magmatic iron-meteorite parent bodies are the earliest planetesimals in the Solar System,and they preserve information about conditions and planet-forming processes in thesolar nebula. In this study, we include comprehensive elemental compositions andfractional-crystallization modeling for iron meteorites from the cores of five differenti-ated asteroids from the inner Solar System. Together with previous results of metalliccores from the outer Solar System, we conclude that asteroidal cores from the outerSolar System have smaller sizes, elevated siderophile-element abundances, and simplercrystallization processes than those from the inner Solar System. These differences arerelated to the formation locations of the parent asteroids because the solar protoplane-tary disk varied in redox conditions, elemental distributions, and dynamics at differentheliocentric distances. Using highly siderophile-element data from iron meteorites, wereconstruct the distribution of calcium-aluminum-rich inclusions (CAIs) across theprotoplanetary disk within the first million years of Solar-System history. CAIs, the firstsolids to condense in the Solar System, formed close to the Sun. They were, however,concentrated within the outer disk and depleted within the inner disk. Future modelsof the structure and evolution of the protoplanetary disk should account for this dis-tribution pattern of CAIs.
Evaluation and Identification of J'BaFofi the Giant Spider of Congo and Moke...MrSproy
ABSTRACT
The J'BaFofi, or "Giant Spider," is a mainly legendary arachnid by reportedly inhabiting the dense rain forests of
the Congo. As despite numerous anecdotal accounts and cultural references, the scientific validation remains more elusive.
My study aims to proper evaluate the existence of the J'BaFofi through the analysis of historical reports,indigenous
testimonies and modern exploration efforts.
BIRDS DIVERSITY OF SOOTEA BISWANATH ASSAM.ppt.pptxgoluk9330
Ahota Beel, nestled in Sootea Biswanath Assam , is celebrated for its extraordinary diversity of bird species. This wetland sanctuary supports a myriad of avian residents and migrants alike. Visitors can admire the elegant flights of migratory species such as the Northern Pintail and Eurasian Wigeon, alongside resident birds including the Asian Openbill and Pheasant-tailed Jacana. With its tranquil scenery and varied habitats, Ahota Beel offers a perfect haven for birdwatchers to appreciate and study the vibrant birdlife that thrives in this natural refuge.
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxshubhijain836
Centrifugation is a powerful technique used in laboratories to separate components of a heterogeneous mixture based on their density. This process utilizes centrifugal force to rapidly spin samples, causing denser particles to migrate outward more quickly than lighter ones. As a result, distinct layers form within the sample tube, allowing for easy isolation and purification of target substances.
Mechanics:- Simple and Compound PendulumPravinHudge1
a compound pendulum is a physical system with a more complex structure than a simple pendulum, incorporating its mass distribution and dimensions into its oscillatory motion around a fixed axis. Understanding its dynamics involves principles of rotational mechanics and the interplay between gravitational potential energy and kinetic energy. Compound pendulums are used in various scientific and engineering applications, such as seismology for measuring earthquakes, in clocks to maintain accurate timekeeping, and in mechanical systems to study oscillatory motion dynamics.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Creative-Biolabs
Neutralizing antibodies, pivotal in immune defense, specifically bind and inhibit viral pathogens, thereby playing a crucial role in protecting against and mitigating infectious diseases. In this slide, we will introduce what antibodies and neutralizing antibodies are, the production and regulation of neutralizing antibodies, their mechanisms of action, classification and applications, as well as the challenges they face.
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
20160922 Materials Data Facility TMS Webinar
1. Ben Blaiszik (blaiszik@uchicago.edu),
Kyle Chard, Rachana Ananthakrishnan
Michael Ondrejcek, Kenton McHenry
PIs: Ian Foster (foster@uchicago.edu), Steven Tuecke, John Towns
materialsdatafacility.org
globus.org
Materials Data Facility -
Data Services to Advance Materials
Science Research
4. 4
Outline
APIs
• Overview
§ MDF Overview
§ Globus quick introduction
• MDF Data Publication Service
§ Key MDF data pub service features
§ Publication walk-through
• General Observations and Future
Outlook
5. What is MDF?
5
We are developing production services
to make it more simple for materials
datasets and resources to be ...
Published
Identified
Described
Curated
Verifiable
Accessible
Preserved
Discovered
Searched
Browsed
Shared
Recommended
Accessed
and
SRD
Publishable Results
Published Results
Resource Data
Ref Data
Derived Data
Working Data * Figure adapted from Warren et al.
7. 7
Publication
APIs
• Identify datasets with persistent identifiers (e.g. DOI)
• Describe datasets with appropriate metadata and
provenance
• Verify dataset contents over time
• Preserve critical datasets in a state that increases
transparency, replicability, and helps encourage reuse
8. 8
Discovery
• Search and query datasets in modern ways – e.g. via
search against indexed metadata and harvested file
contents rather than remembering opaque file paths
Future...
Spotlight for all
data you have
access to
regardless of
location
Under Development
9. 9
Discovery
Under Development
• SaaS cloud-hosted solution
• Logical metadata repository to index many external sources
• Flexible queries (boosting, full text, partial matches, etc.)
• Search results are limited by ACLs
10. 10
Discovery
Under Development
• All MDF-published datasets will be indexed
• May use common schemas (Datacite, Dublin Core etc.) or
domain specific
• Globus endpoint contents may be indexed (owner enabled)
• Index has the flexibility of no required schema
• Built on Elasticsearch for proven scalability and speed,
hosted on scalable AWS resources
13. Globus Platform-as-a-Service (PaaS)
14
Identity
management
User
groups
Data
transfer
Data
sharing
• Share directly from your storage
device (laptop or cluster)
• File and directory-level ACLs
• Manage user group creation and
administration flows
• Share data with user groups
• High-performance data transfer
from a web browser
• Optimize transfer settings and
verify transfer integrity
• Add your laptop to the Globus cloud
with Globus Connect Personal
• create and manage a unique
identity linked to external identities
for authentication
Publication Discovery
14. REST APIs, Clients, and Docs
15
• New version of core services released in Feb.
• New Python SDK available
§ https://github.com/globusonline/globus-sdk-python
• Jupyter Notebook Examples
§ https://github.com/globus/globus-jupyter-notebooks
• Sample Data Portal
§ https://github.com/globus/globus-sample-data-portal
• (alpha) MDF Data Publication Service API
15. Globus Background
16
B
Globus moves the
data for you
secure
endpoint,
e.g. laptop
You
submit a
transfer
request Globus
notifies you
once the
transfer is
complete
secure
endpoint,
e.g. midway
transfer
A
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data
16. Globus Web UI
17
Endpoint
• E.g. laptop or server
running a Globus
client (e.g. Dropbox
client)
• Enables advanced file
transfer and sharing
• Currently GridFTP,
future GridFTP + HTTP
Some Key
Features
• REST API for
automation and
interoperability
• Web UI for
convenience
• Optimizes and verifies
transfers
• Handles auto-restarts
• Battle tested with big
data
19. 21
Materials Data Publication/Discovery is Often
a Challenge
Data Collection
?
?
?
Networked storage, sometimes many TB
Unique identifier data for search/cite
Custom metadata descriptions
Data curation workflow
Automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
Don’t put under desk!
Needed to close the loop
20. 22
Data Collection
?
?
?
Need storage, sometimes many TB
Need to uniquely identify data for search/cite
Need custom metadata descriptions
Need a data curation workflow
Need automation capabilities
Data Storage and Process Publication
Want to
Discover / Use
Want to
Publish
Materials Data Publication/Discovery is Often
a Challenge
Don’t put under desk!
21. Collection Model
23
• Collections might be a
research group or a research
topic...
• Collections have specified
§ Mapping to storage endpoint
§ Currently handled as automatically created
shared endpoints
§ Metadata schemas
§ Access control policies
§ Licenses
§ Curation workflows
• Collections contain
§ Datasets
§ Data
§ Metadata
• Metadata Persistence
§ Metadata log file with dataset
§ Metadata replicated in search
index
22. Hybrid Distributed Model
24
Petrel @Argonne
1.7 PB
BlueWaters Condo
@UIUC
100 TB
EP 1
EP 2
EP 3
Campus
RDS
DOE
Cloud Metadata Index
And Tools
Centralized resource
Globus endpoint
NSF
(XSEDE)
ElectroCat
EP
23. Publish Large Datasets
25
• Distributed data model leverages
Globus production capabilities for file
transfer (i.e. dataset assembly), user
authentication, and access control
groups
• 100s of TB of reliable storage @ NCSA,
and more storage at Argonne
§ Globus endpoint at ncsa#mdf on Nebula
§ Expandable to many PBs as necessary
§ Automated tape backup for reliability (in progress)
• Researchers can optionally use your
own local or institutional storage
24. Uniquely Identify Datasets
26
• Associate a unique identifier with a
dataset
§ DOI, Handle
• Improve dataset discovery and citability
§ Aligning incentives and understanding the culture
will be critical to driving adoption
DatasetDownloads
Time
• Your work has been cited
153 times in the last year
• Researchers from 30
institutions have
downloaded your datasets
Future...
25. Share Data with Flexible ACLs
27
• Share data publicly, with a set of users,
or keep data private
Leverage Curation Workflows
• Collection administrators can specify
the level of curation workflow required
for a given collection e.g.
§ No curation
§ Curation of metadata only
§ Curation of metadata and files
26. Customize Metadata
28
• Build a custom metadata schema for
your specific research data
• Re-use existing metadata schemas
• Working in conjunction with NIST
researchers to define these schemas
• Can we build a system that allows
schema:
§ Inheritance
§ E.g. a schema “polymers” might inherit and expand
upon the “base material” of NIST
§ Versioning
§ E.g. Understand contextually how to map fields
between versions
§ Dependence
§ E.g. Allows the ability to build consensus around
schemas
Future...
28. Example Use Case
30
Publishing Big, Remote Data
Collected multi TB
of data at a light source
Bundle the data with metadata
and provenance
Want a citable DOI to share the
raw and derived data with the
community
Want their data to be discoverable
by free text search and custom
metadata
30. MDF Collections
32
Recall: Policies Set at the Collection Level
• Required metadata, schemas
• Data storage location
• Metadata curation policies
31. MDF Metadata Entry
33
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required
32. MDF Custom Metadata
34
• Scientist or
representative
describes the data
they are submitting
• For this collection
Dublin Core and a
custom metadata
template are
required
33. Dataset Assembly
35
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)
34. Dataset Assembly
36
• Shared endpoint is
auto-created on
collection-specified
data store
• Scientist transfers
dataset files to a
unique publish
endpoint
• Dataset may be
assembled over any
period of time
• When submission is
finished, dataset
will be rendered
immutable via
checksum
(e.g. NU) (e.g. UIUC Nebula)
35. Dataset Curation (Optional)
37
• Optionally specified
in collection
configuration
• Can be approved or
rejected (i.e. sent
back to the
submitter)
40. 48
Publication Year
1 Milestones
APIs
• Opened to the public in March 2016
• Provisioned reliable storage to support researchers sharing
open materials data (~200 TB)
• MDF data volume approaching ~ 6 TB of materials data
• Started building deep relationships with many of the key
materials data generating groups and communities
• Ingested dataset > 1 TB in size
• Ingested dataset > 1.5M files
41. Integration with the Community is Key
49
Materials Project
OQMD
Citrination
Materials
Commons
Other Facilities (APS, SNS, NSLS, …), Institutional Repositories,
Publishers!
Metadata
Publishing
MetadataMD,
Pub., Compute
Metadata
Publishing
NCSA-PIREHV/TMSMBDH
42. Understanding Incentives is Critical
50
Meeting Award
Requirements
Smoothing
Dislocations
Increasing
Impact
• Increase paper citations1
• Add dataset citation capabilities
• [Distance] Enable simple sharing among
collaborators (near and far)
• [Personnel] Ease transitions between students
• [Format] Lessen need for ad hoc resource sharing
(e.g. via group websites)
• Simplify DMP compliance
1 Citation increase 30 (10.7717/peerj.175) - 60% (10.1371/journal.pone.0000308) [caveat bio research]
43. Lessons Learned
51
• The demand is there from researchers and
institutions
• Lots of cross-over with centers and projects
§ (NIST) CHiMaD
§ (DOE) ElectroCat, MICCoM, JCESR, PRISMS, Argonne IT, Integrated Imaging Institute
§ (NSF) T2C2 [DIBBS], AMI-CFP (PIRE), HV/TMS (I/UCRC), BD Hubs, IMaD BD Spoke*
• Data Heterogeneity is a challenge
§ Metadata is the major sticking point
• Friction points
§ Need more flexible data objects e.g. {“temperature”:100, “unit”:“K”}
§ Need file or directory based metadata
§ Immutable datasets alone is not enough à Versioning
§ Data gathering in retrospect
§ Schema generation and interoperability
§ Working with and following developments at NIST, RDA, Citrine et al.
§ Differing institutional approval processes
§ Lack of programmatic interface (planned).
• Support for data interactivity and visualization
• Smart versioning for large file-based datasets
44. Wider Data Community
52
• Curated and described datasets
• Well-posed problems
• Community to share analyses
• Challenges to start “sprints”
• Great APIs and clients
• Examples to get started
• Hundreds of video tutorials
Materials ProjectOQMD
Citrination
Materials
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB
45. 53
• Continuous integration, QA, and testing
• Containerized solutions, microservice architecture, abstracting software from
hardware
• Automation
• Internet of Things (IoT) – connect everything
• Machine Learning / AI
• Natural Language Processing (Siri, chatbots or “slack”bots, etc.)
• Search rules the world – ok this was 20 years ago…
What are the analogs and applications in the materials community?
Materials ProjectOQMD
Citrination
Materials
Commons
• Less inherently intuitive problems
• Sometimes need advanced compute
capabilities
• Often many TB
Broader Trends
47. Use Case: Scenario Generator-Consumer
55
• Data generator
§ Generates data periodically (perhaps from an instrument)
§ Pushes data to a public channel
§ Schema is validated before inclusion in channel stream
• Data consumer
§ Polls channel periodically
§ Wants to pull datasets by property
Dataset
Channel
MDF-composites
Data Generator
Data Consumer
DatasetDatasetDataset
DatasetDatasetcreate
q: result
q
49. Aggregate, Perform ML
58
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and “reproduce” data from journal publication
50. Aggregate, Perform ML, Visualize
59
• Combine cloud-published dataset, scikitlearn, pandas to predict
steel fatigue and validate journal publication
51. What’s Currently Available?
60
• Web interface to support data publication (public-
facing APIs coming soon)
• 100s of TB of storage at NCSA (scalable to many PB)
more at Argonne (1.7 PB total on Petrel – not all for
materials…)
• Help with developing metadata schemas to describe
your research datasets
MDF Tutorial on Github
https://github.com/blaiszik/materials-data-facility-training
52. What are we looking for?
61
• Early adopters, willing to get their hands
dirty with the service and give honest
feedback
• Key integration points where metadata is
picked up automatically!
• Key datasets and resources of all sizes,
shapes, raw or derived, that might help us
understand the process better
53. Thanks to Our Sponsors!
62
U . S . D E P A RT M E N T O F
ENERGY
54. Publication REST APIs Discovery
• Identify datasets with
persistent identifiers (e.g.
DOI)
• Describe datasets with
appropriate metadata and
provenance
• Verify dataset contents over
time
• Handle big (and small) data:
We have already ingested
datasets with > 1.5M files
and > 1TB in size
• Search and query
datasets in modern ways
• Index metadata and
harvest file contents
• Simple user interfaces
(i.e., after Google and
Amazon)
Opened to external users in Mar. 2016
~ 6 TB of data published
Materialsdatafacility
.org