Describes a set of scholarly communications use cases for ResourcesSync and present the development and integration of the PublisherConnector in CORE. By Petr Knoth
sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...Alexandre Passant
Presentation @ SFSW2010 (ESWC2010 Workshop). Paper available at semanticscripting.org/SFSW2010/papers/sfsw2010_submission_6.pdf + video at http://apassant.net/blog/2010/04/18/sparql-pubsubhubbub-sparqlpush#comments
This document summarizes a project to convert biomedical databases like Reactome, CHEBI, UniProt and GO into JSON-LD format and load them into Elasticsearch for full-text search and exploration in Siren Investigate. Key databases were extracted via APIs, converted to JSON-LD using Elasticsearch pipelines, and loaded into Elasticsearch. Visualizations and a relational data model were then created in Siren Investigate to allow faceted browsing and exploration of relationships between datasets. The project demonstrated an effective method for integrating and exploring life science knowledge graphs. Future work includes the Kibio.science project to apply these techniques on their own infrastructure.
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus
We describe the requirements for, and challenges of, distributing datasets at scale, e.g. from instruments such as CryoEM and advanced light sources. We demonstrate a web application that uses Globus to perform large-scale data distribution. We introduce and walk through a Jupyter notebook highlighting the relevant code to incorporate into a science gateway.
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus
We describe the large-scale data transfer scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform large-scale data transfers, and walk through a code repository with the web application’s code.
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGlobus
We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.
This document discusses log aggregation tools and systems. It provides brief examples of Logstash, Splunk, and Fluentd which are open source tools that can centralize, transform, and store log data from various sources. These tools help create dashboards to more easily perform tasks like troubleshooting, audits, and refactoring using log data. The tools can obtain logs through REST APIs, syslog, or file forwarding.
The document summarizes a project analyzing connections between GitHub users by constructing a graph based on user collaborations to repositories. Over 1TB of data including users, followers, repositories and events was processed. A graph with users as vertices and collaborations as edges was created and analyzed to find clusters using connected components and influential users using PageRank. Challenges included unstructured data schemas and memory issues when processing the large dataset.
"What's New With Globus" Webinar: Spring 2018Globus
In this presentation from June 26, 2018, Globus co-founder Steve Tuecke discussed Globus Connect Server 5.1 with HTTPS file access; plans for new premium storage connectors; upcoming publication services including the new Globus Search and Identifiers services; the new Globus Web App, SSH with Globus Auth, and more.
sparqlPuSH: Proactive notification of data updates in RDF stores using PubSub...Alexandre Passant
Presentation @ SFSW2010 (ESWC2010 Workshop). Paper available at semanticscripting.org/SFSW2010/papers/sfsw2010_submission_6.pdf + video at http://apassant.net/blog/2010/04/18/sparql-pubsubhubbub-sparqlpush#comments
This document summarizes a project to convert biomedical databases like Reactome, CHEBI, UniProt and GO into JSON-LD format and load them into Elasticsearch for full-text search and exploration in Siren Investigate. Key databases were extracted via APIs, converted to JSON-LD using Elasticsearch pipelines, and loaded into Elasticsearch. Visualizations and a relational data model were then created in Siren Investigate to allow faceted browsing and exploration of relationships between datasets. The project demonstrated an effective method for integrating and exploring life science knowledge graphs. Future work includes the Kibio.science project to apply these techniques on their own infrastructure.
Gateways 2020 Tutorial - Instrument Data Distribution with GlobusGlobus
We describe the requirements for, and challenges of, distributing datasets at scale, e.g. from instruments such as CryoEM and advanced light sources. We demonstrate a web application that uses Globus to perform large-scale data distribution. We introduce and walk through a Jupyter notebook highlighting the relevant code to incorporate into a science gateway.
Gateways 2020 Tutorial - Large Scale Data Transfer with GlobusGlobus
We describe the large-scale data transfer scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform large-scale data transfers, and walk through a code repository with the web application’s code.
Gateways 2020 Tutorial - Automated Data Ingest and Search with GlobusGlobus
We describe the automated data ingest scenario, referencing current and past research teams and their challenges. We demonstrate a web application that uses Globus to perform automated data ingest and present a faceted search interface that can be used by science gateways to simplify data discovery. We also walk through the application's GitHub repository and highlight relevant components.
This document discusses log aggregation tools and systems. It provides brief examples of Logstash, Splunk, and Fluentd which are open source tools that can centralize, transform, and store log data from various sources. These tools help create dashboards to more easily perform tasks like troubleshooting, audits, and refactoring using log data. The tools can obtain logs through REST APIs, syslog, or file forwarding.
The document summarizes a project analyzing connections between GitHub users by constructing a graph based on user collaborations to repositories. Over 1TB of data including users, followers, repositories and events was processed. A graph with users as vertices and collaborations as edges was created and analyzed to find clusters using connected components and influential users using PageRank. Challenges included unstructured data schemas and memory issues when processing the large dataset.
"What's New With Globus" Webinar: Spring 2018Globus
In this presentation from June 26, 2018, Globus co-founder Steve Tuecke discussed Globus Connect Server 5.1 with HTTPS file access; plans for new premium storage connectors; upcoming publication services including the new Globus Search and Identifiers services; the new Globus Web App, SSH with Globus Auth, and more.
Automating Research Data Management at Scale with GlobusGlobus
Research computing facilities, such as the national supercomputing centers, and shared instruments, such as cryo electron microscopes and advanced light sources, are generating large volumes of data daily. These growing data volumes make it challenging for researchers to perform what should be mundane tasks: move data reliably, describe data for subsequent discovery, and make data accessible to geographically distributed collaborators. Most employ some set of ad hoc methods, which are not scalable, and it is clear that some level of automation is required for these tasks.
Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its intuitive web app addresses simple file transfer and sharing scenarios, automation at scale requires integrating Globus data management platform services into custom science gateways, data portals and other web applications in service of research. Such applications should enable automated ingest of data from diverse sources, launching of analysis runs on diverse computing resources, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users — all protected by an authentication and authorization substrate that allows the implementation of flexible data access policies for both metadata and data alike.
We describe current and emerging Globus services that facilitate these automated data flows while ensuring a streamlined user experience. We also demonstrate Petreldata.net, a data management portal and gateway to multiple computing resources, that supports large-scale research at the Advanced Photon Source.
This document describes a data pipeline for processing large amounts of web crawl data from Common Crawl. It discusses ingesting over 160 TB of data per month from Common Crawl into AWS S3 storage and then using batch processing with T4 instances to index the data in Elasticsearch and store metadata in Cassandra. It also describes querying the hybrid database and some of the engineering challenges around approximating page rank with low latency.
FY'16 Library of Congress Storage Environment Carl Watts
The Library of Congress is continuing to see significant growth in digital content. Over the past 12 months, online access grew by 48% to 525 terabytes. Long-term storage grew by 26% to 2.9 petabytes over the fiscal year. Migrations of data from older storage systems to new systems like the IBM GPFS and Oracle tape libraries are ongoing but challenging projects. The Library tested several object storage systems in 2016 and found that each system had different capabilities and results. Factors like consistency, intended use, and system sizing need to be considered when selecting an object storage solution.
Institutional respositories - Ken Scott (Georgetown University Qatar) - #OAWe...QScience
Presentation by Ken Scott (Associate Library Director - Access and Media Services) at Georgetown University - Qatar on Open Access Institutional Repositories -
Part of QScience.com's Open Access Week Event: Discover Open Access with QScience.com - held at Hamad bin Khalifa University Student Center, Education City, Doha on 22nd October 2014
http://www.qscience.com/page/OAweek2014
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
SHARE is building a free, open dataset about the entire research lifecycle. It uses the Open Science Framework (OSF) to collect and store this data. The presentation demonstrates SHARE's search API, which allows querying the dataset using Elasticsearch queries. An example shows aggregating the top tags used in the dataset. The results return the top tags and the number of documents associated with each tag, with "ecological" being the most common tag. SHARE is developing a Python library to make interacting with the search API easier by handling the JSON request/response. The library can convert the Elasticsearch response into a dataframe for further analysis or visualization of the results.
The NISO Update provides the latest news about NISO's current efforts, including standards, recommended practices and community meetings covering many areas of interest to the library community. Working group members will provide updates on projects newly underway or recently completed
Logstash is an open source tool for collecting, parsing, and storing logs and other event data. It can input data from multiple sources, parse and transform the data, and output it to multiple destinations such as Elasticsearch. Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It allows storing, searching, and analyzing large volumes of data quickly and in near real-time. Together, Logstash can collect and parse log files, enriching the data, and outputting it to Elasticsearch for storage, search, and visualization, making log event data searchable and analyzable.
PoolParty Semantic Search Server is described technologically. How to use SKOS thesauri to map data from different sources and how to generate a semantic index. How to build precise faceted search.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
This document discusses data munging and analysis for scientific applications using Apache Big Data tools. It evaluates tools like Apache Hadoop, YARN, and Spark, and explores using the Airavata science gateway platform to enable collection of resources and application-centric workflows. As a use case, it presents a text analysis project called TextRWeb that uses parallel R on the web for large-scale text mining and analytics. The goals are to support interactive and iterative text analysis while hiding computational complexity. It explores integrating TextRWeb with Spark and Airavata for high-performance computing jobs and developing Apache Thrift interfaces.
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
Populate your Search index, NEST 2016-01David Smiley
This document discusses considerations for populating a search index. It covers topics like how to get data into the index, backups, scheduling and monitoring indexing, real-time search requirements, and common software used for crawlers and pipelines. Specific approaches are suggested for bulk indexing, incremental indexing, detecting deletes, and taking backups. The challenges of document transformations and mapping source data to search documents are also addressed. Open-source ETL software options like Clover ETL, Pentaho, and Talend are briefly summarized, with Talend and Apache NiFi receiving more detailed overviews.
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs
The document discusses the BBC moving to linked data and embedding metadata in news articles. It describes pilots tagging content with topics, using concept extraction tools, and publishing machine-readable metadata using RDFa. The next steps involve rolling out tagging to more journalists and integrating full metadata markup.
Logs, metrics and real time data analyticsEwere Diagboya
The document discusses logs, metrics, real-time data analytics, and tools used for collecting and analyzing such data. It defines what logs and metrics are, and names several tools for collecting and visualizing metrics as well as analyzing real-time data, including the ELK stack. The ELK stack consists of Elasticsearch for storage, Logstash for parsing, and Kibana for analytics and visualization. The document also provides examples of how the ELK stack works and how Terragon has applied it to various applications.
We present a traffic analytics platform for servers that publish Linked Data. To the best of our knowledge, this is the first system that mines access logs of registered Linked Data servers to extract traffic insights on daily basis and without human intervention.
An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
eXtensible Catalog (afternoon session) Integrated Search Towards Catalogue 2.0
July 31, 2009
Digital Libraries à la Carte 2009
Tilburg University, the Netherlands
OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair
Petr Knoth talks about machine accessibility of Open Access scientific publications from publisher systems via ResourceSync
Training title:TDM unlocking a goldmine of information
Training overview:
Text and Data Mining (TDM) is a natural ‘next step’ in open science. It can lead to new and unexpected discoveries and increase the impact of publications and repositories. This workshop showcases examples of successful TDM and infrastructural solutions for researchers. We will also discuss what is needed to make most of infrastructures and how publishers and repositories can open up their content.
DAY 2 - PARALLEL SESSION 4 & 5
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
One of the key aims of the COAR NGR group is to help us to overcome the challenges that still make it difficult to move beyond repositories as document silos. The group wants to see a globally interoperable network of repositories and global services built on top of repositories fulfilling the expectations of users in the 21st century. During this talk, I will address two use cases the COAR NGR working group aims to enable: text and data mining and recommender systems.
Automating Research Data Management at Scale with GlobusGlobus
Research computing facilities, such as the national supercomputing centers, and shared instruments, such as cryo electron microscopes and advanced light sources, are generating large volumes of data daily. These growing data volumes make it challenging for researchers to perform what should be mundane tasks: move data reliably, describe data for subsequent discovery, and make data accessible to geographically distributed collaborators. Most employ some set of ad hoc methods, which are not scalable, and it is clear that some level of automation is required for these tasks.
Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its intuitive web app addresses simple file transfer and sharing scenarios, automation at scale requires integrating Globus data management platform services into custom science gateways, data portals and other web applications in service of research. Such applications should enable automated ingest of data from diverse sources, launching of analysis runs on diverse computing resources, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users — all protected by an authentication and authorization substrate that allows the implementation of flexible data access policies for both metadata and data alike.
We describe current and emerging Globus services that facilitate these automated data flows while ensuring a streamlined user experience. We also demonstrate Petreldata.net, a data management portal and gateway to multiple computing resources, that supports large-scale research at the Advanced Photon Source.
This document describes a data pipeline for processing large amounts of web crawl data from Common Crawl. It discusses ingesting over 160 TB of data per month from Common Crawl into AWS S3 storage and then using batch processing with T4 instances to index the data in Elasticsearch and store metadata in Cassandra. It also describes querying the hybrid database and some of the engineering challenges around approximating page rank with low latency.
FY'16 Library of Congress Storage Environment Carl Watts
The Library of Congress is continuing to see significant growth in digital content. Over the past 12 months, online access grew by 48% to 525 terabytes. Long-term storage grew by 26% to 2.9 petabytes over the fiscal year. Migrations of data from older storage systems to new systems like the IBM GPFS and Oracle tape libraries are ongoing but challenging projects. The Library tested several object storage systems in 2016 and found that each system had different capabilities and results. Factors like consistency, intended use, and system sizing need to be considered when selecting an object storage solution.
Institutional respositories - Ken Scott (Georgetown University Qatar) - #OAWe...QScience
Presentation by Ken Scott (Associate Library Director - Access and Media Services) at Georgetown University - Qatar on Open Access Institutional Repositories -
Part of QScience.com's Open Access Week Event: Discover Open Access with QScience.com - held at Hamad bin Khalifa University Student Center, Education City, Doha on 22nd October 2014
http://www.qscience.com/page/OAweek2014
So, what is the ELK Stack? "ELK" is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a server‑side data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a "stash" like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch.
SHARE is building a free, open dataset about the entire research lifecycle. It uses the Open Science Framework (OSF) to collect and store this data. The presentation demonstrates SHARE's search API, which allows querying the dataset using Elasticsearch queries. An example shows aggregating the top tags used in the dataset. The results return the top tags and the number of documents associated with each tag, with "ecological" being the most common tag. SHARE is developing a Python library to make interacting with the search API easier by handling the JSON request/response. The library can convert the Elasticsearch response into a dataframe for further analysis or visualization of the results.
The NISO Update provides the latest news about NISO's current efforts, including standards, recommended practices and community meetings covering many areas of interest to the library community. Working group members will provide updates on projects newly underway or recently completed
Logstash is an open source tool for collecting, parsing, and storing logs and other event data. It can input data from multiple sources, parse and transform the data, and output it to multiple destinations such as Elasticsearch. Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It allows storing, searching, and analyzing large volumes of data quickly and in near real-time. Together, Logstash can collect and parse log files, enriching the data, and outputting it to Elasticsearch for storage, search, and visualization, making log event data searchable and analyzable.
PoolParty Semantic Search Server is described technologically. How to use SKOS thesauri to map data from different sources and how to generate a semantic index. How to build precise faceted search.
Visualize some of Austin's open source data using Elasticsearch with Kibana. ObjectRocket's Steve Croce presented this talk on 10/13/17 at the DBaaS event in Austin, TX.
This document discusses data munging and analysis for scientific applications using Apache Big Data tools. It evaluates tools like Apache Hadoop, YARN, and Spark, and explores using the Airavata science gateway platform to enable collection of resources and application-centric workflows. As a use case, it presents a text analysis project called TextRWeb that uses parallel R on the web for large-scale text mining and analytics. The goals are to support interactive and iterative text analysis while hiding computational complexity. It explores integrating TextRWeb with Spark and Airavata for high-performance computing jobs and developing Apache Thrift interfaces.
Gateways 2020 Tutorial - Introduction to GlobusGlobus
Globus provides a platform and services for simplifying data management and sharing for science gateways and applications. It offers fast and reliable file transfers between any storage systems, secure data sharing without copying data, and APIs and SDKs for building applications. Globus uses OAuth authentication and supports a variety of interfaces like CLI, Python SDK, and Jupyter notebooks to enable access.
Populate your Search index, NEST 2016-01David Smiley
This document discusses considerations for populating a search index. It covers topics like how to get data into the index, backups, scheduling and monitoring indexing, real-time search requirements, and common software used for crawlers and pipelines. Specific approaches are suggested for bulk indexing, incremental indexing, detecting deletes, and taking backups. The challenges of document transformations and mapping source data to search documents are also addressed. Open-source ETL software options like Clover ETL, Pentaho, and Talend are briefly summarized, with Talend and Apache NiFi receiving more detailed overviews.
BBC News Labs at ISKO Conference, UCL, London - July 2013BBC News Labs
The document discusses the BBC moving to linked data and embedding metadata in news articles. It describes pilots tagging content with topics, using concept extraction tools, and publishing machine-readable metadata using RDFa. The next steps involve rolling out tagging to more journalists and integrating full metadata markup.
Logs, metrics and real time data analyticsEwere Diagboya
The document discusses logs, metrics, real-time data analytics, and tools used for collecting and analyzing such data. It defines what logs and metrics are, and names several tools for collecting and visualizing metrics as well as analyzing real-time data, including the ELK stack. The ELK stack consists of Elasticsearch for storage, Logstash for parsing, and Kibana for analytics and visualization. The document also provides examples of how the ELK stack works and how Terragon has applied it to various applications.
We present a traffic analytics platform for servers that publish Linked Data. To the best of our knowledge, this is the first system that mines access logs of registered Linked Data servers to extract traffic insights on daily basis and without human intervention.
An Open Talk at DeveloperWeek Austin 2017 by Kimberly Wilkins (@dba_denizen), Principal Engineer - Databases at ObjectRocket. Featuring new use cases like Bitcoin, AI, IoT, and all the cool things.
eXtensible Catalog (afternoon session) Integrated Search Towards Catalogue 2.0
July 31, 2009
Digital Libraries à la Carte 2009
Tilburg University, the Netherlands
OSFair2017 training | Machine accessibility of Open Access scientific publica...Open Science Fair
Petr Knoth talks about machine accessibility of Open Access scientific publications from publisher systems via ResourceSync
Training title:TDM unlocking a goldmine of information
Training overview:
Text and Data Mining (TDM) is a natural ‘next step’ in open science. It can lead to new and unexpected discoveries and increase the impact of publications and repositories. This workshop showcases examples of successful TDM and infrastructural solutions for researchers. We will also discuss what is needed to make most of infrastructures and how publishers and repositories can open up their content.
DAY 2 - PARALLEL SESSION 4 & 5
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
One of the key aims of the COAR NGR group is to help us to overcome the challenges that still make it difficult to move beyond repositories as document silos. The group wants to see a globally interoperable network of repositories and global services built on top of repositories fulfilling the expectations of users in the 21st century. During this talk, I will address two use cases the COAR NGR working group aims to enable: text and data mining and recommender systems.
Presentation of the CORE APIv3 which provides seamless programmable access to the metadata and content from across the global repositories network delivered at Open Repositories 2022.
The document discusses the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). It describes OAI-PMH as a standard that allows data providers to make metadata available via HTTP so that service providers can harvest the metadata to develop value-added services. It provides details on the various requests and operations that are part of the OAI-PMH protocol. The document also discusses some implementation issues and examples of service providers that utilize OAI-PMH harvested metadata.
The document discusses next generation repositories that aim to improve discovery, interoperability, and functionality of existing repository systems. It identifies key priorities like exposing identifiers, enabling batch and navigation discovery, supporting user interactions through annotation and commenting, and collecting usage activities. Technologies like ResourceSync and Signposting are highlighted to enhance areas like notification and metadata exposure. The goal is a global network of interoperable repositories that empower open scholarship.
Literature Services Resource Description FrameworkJee-Hyub Kim
This document describes Europe PMC's text mining pipeline and efforts to publish the text mining data as RDF. Europe PMC mines over 30 million biomedical abstracts and 3 million full text articles to extract biological entities and link them to databases. It is developing an RDF service to provide programmatic access to the billion text mining annotations with sentence and section level contexts. The service aims to support database curation and interoperability with other text mining formats.
CORE aggregates open access content from repositories worldwide, enriches it through text extraction and metadata cleaning, and provides access through search APIs and other services. It currently indexes over 50 million records and aims to make repository content more discoverable and usable for applications like text mining. The CORE dashboard will give repositories more control over their harvested metadata and statistics on usage. CORE coordinates with other Jisc services like IRUS-UK and Publication Router to improve functionality.
OpenAIRE Guidelines for data providers: new Metadata Application Profile for ...OpenAIRE
Presentation at the "OpenAIRE webinar series for repository managers 2017/2018" - Nov. 14, 2017 (11h00 CET) | "OpenAIRE Guidelines for data providers: new Metadata Application Profile for Literature Repositories", presented by Jochen Schirrwagen, Univ. Bielefeld.
OpenAIRE Content Providers Community Call, July 1st, 2020
This call was focused on Data Repositories namely the OpenAIRE Research Graph and Data Repositories, the OpenAIRE Content Acquisition Policy, and the Guidelines for Data Archive Managers.
Was also an opportunity to share the most recent updates and novelties in the OpenAIRE Content Provider Dashboard, and to get feedback from community.
Follow the Community activities at https://www.openaire.eu/provide-community-calls
CrossRef provides a text and data mining hub for researchers. It has built a cross-publisher API to allow researchers to access full text content from participating publishers for open access or subscribed content using a common protocol. The API addresses issues like negotiating permissions by including licensing information in article metadata and a registry of text and data mining terms and conditions. Over 14 million articles from publishers now include full-text links and license information to enable text and data mining through the CrossRef API.
The field of Text and Data Mining (TDM) is growing in importance with an increasing number of researchers interested in mining scholarly content. CrossRef Text and Data Mining Services launched in May 2014 and focuses on providing one common way to retrieve the full text of articles for the purposes of TDM for interested parties. This session will provide an introduction to and update on this service, and a short demonstration of it in action.
The Other Side of the Journal ToCs InterfacePhil Barker
Presentation given to Journal ToCs workshop on 20 Nov 2009, examining where the Journal ToCs API fits into the repository ecology: what is its role and how might it interact with institutional repository systems.
Open Archives Initiatives For Metadata HarvestingNikesh Narayanan
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) provides a simple but effective mechanism for metadata harvesting. It allows service providers to aggregate content from data providers to build value-added services. The OAI-PMH uses HTTP and XML to share metadata in any agreed format, with Dublin Core as a baseline. It defines a set of verbs and standards for harvesting metadata from repositories in a consistent way. This interoperability has helped surface resources and build services across independently developed digital libraries.
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE
Presentation by Pedro Principe and Paolo Manghi at the OpenAIRE Open Access week webinar. Friday October 28, 2016. Webinar on Openaire compatibility guidelines and the dashboard for Repository Managers, with Pedro Principe (University of Minho) and Paolo Manghi (CNR/ISTI).
From Open Access Metadata to Open Access Content: Two Principles for Increase...petrknoth
1. The document discusses two principles for increasing the visibility of open access content by moving from just open access metadata to open access content.
2. The first principle is that repositories should always provide a link in the metadata from each record to the full content item. This ensures the content is discoverable and accessible.
3. The second principle is that repositories should provide universal access to machines to harvest and index the full content, similar to the level of access provided to humans, in order to fully realize the benefits of open access such as reuse and text mining.
Open Archives Initiative Object Reuse and Exchangelagoze
This document discusses infrastructure to support new models of scholarly publication by enabling interoperability across repositories through common data modeling and services. It proposes building blocks like repositories, digital objects, a common data model, serialization formats, and core services. This would allow components like publications and data to move across repositories and workflows, facilitating reuse and new value-added services that expose the scholarly communication process.
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell
UKOLN is a center of expertise in digital information management supported by various organizations. The document discusses the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), including its history and how it allows harvesting of metadata from data providers by service providers through a simple protocol. It also discusses the potential impact of OAI-PMH on institutions, libraries, and researchers.
The Open Archives Initiative (OAI) is a framework that deals with interoperability standards for digital resources by defining a protocol called OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). OAI-PMH allows metadata to be harvested from data providers and aggregated by service providers to provide cross-repository searching. It uses HTTP and XML to make repositories and their metadata interoperable. Many digital repositories and libraries use OAI-PMH to make their metadata openly available and searchable across systems.
ResourceSync: Web-based Resource SynchronizationSimeon Warner
ResourceSync is a framework for synchronizing web resources between systems. The core team is developing standards for baseline synchronization using inventories, incremental synchronization using changesets, and push notifications using XMPP. The framework is based on reusing and extending existing sitemap formats to describe resources and changes in a modular way. Experiments show it can scale to synchronize large datasets like DBpedia and arXiv. Feedback is being solicited throughout 2012 to finalize the specifications.
Similar to Seamless access to the world's open access research papers via resources sync (20)
This document summarizes a presentation about text and data mining of scientific literature. It discusses the large and growing amounts of digital content and data being produced, and challenges around making sense of it all. It introduces text mining as an emerging solution to analyze and extract insights from unstructured text sources. The presentation describes the OpenMinted framework, which aims to create an open infrastructure for text and data mining services, tools, and annotated corpora. It discusses registering and discovering services, running jobs, and sharing results. Finally, it covers challenges around interoperability, legal issues, policies, and sustainability.
Resource sync overview and real-world use cases for discovery, harvesting, an...openminted_eu
This document summarizes an overview presentation about ResourceSync and its implementations at Hyku and the Digital Public Library of America (DPLA). Some key points:
- ResourceSync was developed as an update to OAI-PMH for synchronizing web resources between systems in a more flexible way. It supports resource lists, change lists, and dumps.
- Hyku has implemented ResourceSync publishing capabilities, and the DPLA has developed a harvester for the Hyku endpoint. This allows for incremental metadata updates rather than full resynchronization of data sets.
- Next steps include potentially supporting resource dumps in Hyku and harvesting from 3 DPLA providers using ResourceSync by the end of the year
Webinar slides: Interoperability between resources involved in TDM at the lev...openminted_eu
OpenMinTeD hosted a series of webinars on interoperability. These slides are of the webinar on the level of metadata. Full webinar recording accessible through: https://www.fosteropenscience.eu/content/achieving-interoperability-between-resources-involved-tdm-level-metadata
Text Mining: the next data frontier. Beyond Open Accessopenminted_eu
1) The presentation discusses the need for text and data mining (TDM) tools to make sense of the vast amount of digital data and literature being produced. It notes there are over 1.8 billion websites and 3.46 billion internet users producing large amounts of data daily. 2) Similarly, the global research community produces around 2.5 million new scholarly articles per year, but much of this work is never read or cited. 3) The presentation proposes establishing an open TDM platform called "OpenMinted" that would allow researchers to discover, share, and reuse knowledge extracted from text-based sources through the use of shared TDM services and tools.
This document discusses the work of the WG3 Legal Interoperability working group for the OpenMinTeD project. The goal of the working group is to study copyright and related rights restrictions on text and data mining (TDM) activities and identify contractual and licensing tools to support TDM. It outlines legal barriers like copyright and database rights, as well as exceptions and limitations. It also discusses the use of licenses to enable access and how policy choices could address limitations of licenses. The working group's deliverables will include a compatibility matrix of licenses and ongoing analysis presented in academic papers.
How can repositories support the text mining of their content and why?openminted_eu
This document discusses how repositories can support text and data mining (TDM) of their content. It provides three principles for repositories to follow: (1) establish direct links from metadata to the full text content, (2) provide universal access to harvesting systems at the same level as humans, and (3) ensure metadata is correctly referenced and content is accessible. The role of repositories is to aggregate research papers at full text to enable large-scale TDM by external services. However, many repositories currently do not fully support this due to issues like incomplete metadata records and non-dereferenceable identifiers.
The document discusses the potential value of text and data mining UK theses. It notes that UK theses represent unique cutting-edge research not published elsewhere. The EThOS database contains metadata on over 430,000 UK theses totaling around 6 million pages of research annually. Several examples are provided of text and data mining projects that have extracted useful information from UK theses, such as identifying trends in dementia research and discovering new chemical compounds. While thesis metadata is openly available, accessing the full texts requires permission due to copyright. Overall, the document argues that UK theses represent a valuable untapped resource for text and data mining research.
OpenMinTeD - Repositories in the centre of new scientific knowledgeopenminted_eu
OpenMinted aims to establish an open text and data mining platform for researchers to discover, create, share and reuse knowledge from scholarly sources. It will provide interoperable services for machine reading, information extraction and predictive analysis of structured data from unstructured text. Key challenges include making content and services discoverable, interoperable, and addressing intellectual property rights. OpenMinted will build on existing repositories and language resources and technologies, and involve stakeholders from its inception to evaluate outcomes.
Jisc has invested in text mining capabilities and established the National Centre for Text Mining (NaCTeM) to fund various text aggregation projects. Jisc provides open access, bibliographic, and subscription management services that include text mining of over 25 million records and 600 journal titles in CORE and journal archives. There is potential to develop user-facing text mining applications using these combined data sets to unlock hidden information and develop new knowledge.
OpenMinted: It's Uses and Benefits for the Social Sciencesopenminted_eu
Presentation as presented at the ITOC workshop in Philadelphia, 20 February 2016.
Uses and Benefits for the Social Sciences research community.
By GESIS - Leibniz Institute for the Social Sciences
The document discusses text and data mining (TDM) projects in Europe. It describes how TDM can be used to understand the past by mining historical books, predict the future by mining newspapers, and save lives by mining scientific publications about diseases. It also outlines some current barriers to TDM in Europe like a lack of awareness, skills and tools, licensing and copyright issues. Two EU projects are highlighted: FutureTDM which aims to identify TDM barriers and policy solutions, and OpenMinTeD which builds a collaborative TDM infrastructure.
Infrastructure crossroads... and the way we walked them in DKProopenminted_eu
The document discusses natural language processing (NLP) infrastructure and challenges in text and data mining. It describes DKPro, an open-source collection of NLP tools that provides interoperability between projects. DKPro Core allows running NLP pipelines with no installation through dependency fetching. Challenges discussed include balancing data protection with interoperability and moving data and analytics as needs change. The talk proposes addressing these through open APIs and repositories to discover, access, deploy and retrieve analytics and their results.
OpenMinTeD: Making Sense of Large Volumes of Dataopenminted_eu
The document discusses making scientific content more accessible and useful through text and data mining. It notes that the global research community generates over 1.5 million new articles per year but many are never read or cited. Emerging solutions like machine reading, understanding and predicting can help structure and mine textual data to extract meaningful insights. The OpenMinted project aims to establish an open text and data mining platform and infrastructure for researchers to collaboratively work with scientific sources. It outlines challenges around content, services and processing as well as main routes to make content more accessible through metadata, transfer protocols and licensing. The project involves various partners and use cases across domains like scholarly communication, life sciences, agriculture and social sciences.
Experiences of Text Mining; the National Library of Austria perspectiveopenminted_eu
Max Kaiser discusses text mining challenges for cultural heritage institutions using the Austrian National Library as a case study. The library has digitized over 600,000 volumes and made them available online through partnerships. While technology exists for tasks like named entity recognition and topic modeling, challenges remain in integrating unstable OCR text data into production systems due to evolving source materials and algorithms. User needs must also be understood to ensure text mining benefits cultural heritage.
Text and Data Mining at the Royal Library in the Netherlandsopenminted_eu
The Koninklijke Bibliotheek has a large collection of machine readable structured and semi-structured data that is the result of over 200 years of collecting, 30 years of digitization, and 10 years of collecting born-digital content. Examples of datasets include newspapers from 1840-1995 made available through an ngram viewer, political speeches from 1814 to present enriched and visualized, and radio bulletins developed through collaborations. Lessons learned are that researchers use the data in unexpected ways, collaborations provide insights, opening data creates new opportunities, and connections are built with the research community.
OpenMinTeD is an EU infrastructure project that aims to establish an open and sustainable text mining infrastructure. It will bring together accessible content, discoverable text mining services, and efficient processing capabilities. This will allow researchers to collaboratively create, discover, share and reuse knowledge extracted from a wide range of scientific text sources. The project involves 16 partners from 6 countries and will run for 3 years, starting in June 2015.
BREEDING METHODS FOR DISEASE RESISTANCE.pptxRASHMI M G
Plant breeding for disease resistance is a strategy to reduce crop losses caused by disease. Plants have an innate immune system that allows them to recognize pathogens and provide resistance. However, breeding for long-lasting resistance often involves combining multiple resistance genes
ESR spectroscopy in liquid food and beverages.pptxPRIYANKA PATEL
With increasing population, people need to rely on packaged food stuffs. Packaging of food materials requires the preservation of food. There are various methods for the treatment of food to preserve them and irradiation treatment of food is one of them. It is the most common and the most harmless method for the food preservation as it does not alter the necessary micronutrients of food materials. Although irradiated food doesn’t cause any harm to the human health but still the quality assessment of food is required to provide consumers with necessary information about the food. ESR spectroscopy is the most sophisticated way to investigate the quality of the food and the free radicals induced during the processing of the food. ESR spin trapping technique is useful for the detection of highly unstable radicals in the food. The antioxidant capability of liquid food and beverages in mainly performed by spin trapping technique.
The binding of cosmological structures by massless topological defectsSérgio Sacani
Assuming spherical symmetry and weak field, it is shown that if one solves the Poisson equation or the Einstein field
equations sourced by a topological defect, i.e. a singularity of a very specific form, the result is a localized gravitational
field capable of driving flat rotation (i.e. Keplerian circular orbits at a constant speed for all radii) of test masses on a thin
spherical shell without any underlying mass. Moreover, a large-scale structure which exploits this solution by assembling
concentrically a number of such topological defects can establish a flat stellar or galactic rotation curve, and can also deflect
light in the same manner as an equipotential (isothermal) sphere. Thus, the need for dark matter or modified gravity theory is
mitigated, at least in part.
Phenomics assisted breeding in crop improvementIshaGoswami9
As the population is increasing and will reach about 9 billion upto 2050. Also due to climate change, it is difficult to meet the food requirement of such a large population. Facing the challenges presented by resource shortages, climate
change, and increasing global population, crop yield and quality need to be improved in a sustainable way over the coming decades. Genetic improvement by breeding is the best way to increase crop productivity. With the rapid progression of functional
genomics, an increasing number of crop genomes have been sequenced and dozens of genes influencing key agronomic traits have been identified. However, current genome sequence information has not been adequately exploited for understanding
the complex characteristics of multiple gene, owing to a lack of crop phenotypic data. Efficient, automatic, and accurate technologies and platforms that can capture phenotypic data that can
be linked to genomics information for crop improvement at all growth stages have become as important as genotyping. Thus,
high-throughput phenotyping has become the major bottleneck restricting crop breeding. Plant phenomics has been defined as the high-throughput, accurate acquisition and analysis of multi-dimensional phenotypes
during crop growing stages at the organism level, including the cell, tissue, organ, individual plant, plot, and field levels. With the rapid development of novel sensors, imaging technology,
and analysis methods, numerous infrastructure platforms have been developed for phenotyping.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills MN
Travis Hills of Minnesota developed a method to convert waste into high-value dry fertilizer, significantly enriching soil quality. By providing farmers with a valuable resource derived from waste, Travis Hills helps enhance farm profitability while promoting environmental stewardship. Travis Hills' sustainable practices lead to cost savings and increased revenue for farmers by improving resource efficiency and reducing waste.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Nucleophilic Addition of carbonyl compounds.pptxSSR02
Nucleophilic addition is the most important reaction of carbonyls. Not just aldehydes and ketones, but also carboxylic acid derivatives in general.
Carbonyls undergo addition reactions with a large range of nucleophiles.
Comparing the relative basicity of the nucleophile and the product is extremely helpful in determining how reversible the addition reaction is. Reactions with Grignards and hydrides are irreversible. Reactions with weak bases like halides and carboxylates generally don’t happen.
Electronic effects (inductive effects, electron donation) have a large impact on reactivity.
Large groups adjacent to the carbonyl will slow the rate of reaction.
Neutral nucleophiles can also add to carbonyls, although their additions are generally slower and more reversible. Acid catalysis is sometimes employed to increase the rate of addition.
Seamless access to the world's open access research papers via resources sync
1. Seamless access to the world’s open
access research papers via
ResourceSync
Petr Knoth
2. Use Case 1: ResourceSync as a seamless layer over
heterogenous APIs
3. Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
4. Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
»Enrichment and
harmonisation of
aggregated data
»Products/services:
›Portal
›API
›Data dumps
›Recommendation
system for libraries
›Repository dashboard
›B2B and analytical
services
5. Use Case 1: What is CORE?
OA Repositories OA Journals
Mostly OAI-PMH
CORE aggregates and
provides free access to
millions of research
articles aggregated
from thousands of OA
repositories and
journals.
»70 million+
metadata records
»Over 6 million full
texts hosted on
CORE
»~1.5 million
monthly active
users
»Aggregating from
2,500 repositories
and 10k OA
journals
6. Use Case 1: Key issue
Key players do not provide interoperability for machine
access to metadata and content of research papers.
35%
23%
18%
12%
12%
Accessing full-text by
harvesting
the website
Major search
engines
Recongnised
services upon
approval
75%
12%
13%
Restricting access to
full-text
Don't restrict
access in any way
Specify a crawl
delay
Allow access to
specific robots
39%
11%
39%
11%
Reference of an article’s
full-text on metadata
Direct link to full-
text
Interface
supporting full-text
transfer
50%
42%
8%
Accessing content
standards
OAI
Own API
Z39.50
36%
24%
4%
32%
4%
Files format
PDF
HTML
Plain text
HTML
JSON
54%31%
15%
Automated downloads
of OA full-text
Website
API
FTP
7. Use Case 1: Approach
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol?
8. Use Case 1: Approach
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
Provide seamless access over non-standardised APIs.
What protocol? »Why not OAI-PMH?
›slow and very inefficient
for big repositories.
›Standardised for
metadata transfer but
not for content transfer.
› Very difficult to
represent the richness of
metadata from a broad
range of data providers.
9. Use Case 1: ResourceSync as a seamless access layer
»Very scalable
implementation on
both the server and
client side
»Interpretation of
metadata happens
using existing pipeline
at the aggregator.
»1.5 million OA
publications from
Elsevier, Springer and
others already
exposed.
»Available at: https://publisher-connector.core.ac.uk/resourcesync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
10. Use Case 2: Exposing enriched data for Text and Data
Mining (TDM) via ResourceSync
11. Use Case 2: Subscribing to ResourceSync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
Mostly OAI-PMH
A range of bespoke APIs
ResourceSync
+ many others
»Other aggregators can
subscribe to the Publisher
connector to make use of their
ingestion pipelines and
enrichment technologies
12. Use Case 2: Content ingestion in OpenMinTeD
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
A range of bespoke APIs
+ many others
»CORE and OpenAIRE are content sources in the OpenMinTeD
TDM platform (EU infrastructure project) being developed to
enable the mining of scholarly literature.
13. Use Case 2: Exposing enriched data for TDM
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
A range of bespoke APIs
+ many others
ResourceSync
»But others want similar solutions … typically, they want to be
able to sync and host the data.
14. Use Case 3: Make repositories and journals adopt
ResourceSync
15. Use Case 3: Replace OAI-PMH with ResourceSync
OA Repositories OA Journals
Key publishers
(OA + hybrid OA)
Publisher connector
ResourceSync
Mostly OAI-PMH
OMTD-SHARE
(over REST)
A range of bespoke APIs
+ many others
ResourceSync
ResourceSync
»Will be a game changer …
»Advocated by COAR Next
Generation Repositories WG
17. What’s new about our implementation of ResourceSync?
»Scales to many millions of resources as required by
aggregators (as opposed to existing implementations for
repositories that are scalable for tens of thousands of
resources)
»Real-time updating of ResourceLists and ChangeLists
(avoiding unnecessary batch processes).
»Combination of real-time updates and scalability
18. Architectural choices
»Based on the principle of changes being communicated
to a controller as they happen (rather than having to be
detected prior to ResourceList/ChangeList updates)
»Uses Elasticsearch as a database
»Hashing mechanism to distribute size of each
ResourceList link and a clever mechanism for iterative
updating of ResourceLists
19. Conclusions
»ResourceSync:
›broad range of uses in scholarly communication.
›solves problems with aggregating content over OAI-PMH, faster &
more efficient aggregation => fresher data in aggregators compared
to OAI-PMH
»We used ResourceSync to ”liberate” over 1.5 million OA papers (and
growing) from key publishers
»CORE soon to provide access to over 8 million OA full texts via
ResourceSync.
»CORE actively contributes to the adoption of ResourceSync in the
repositories community (as part of OpenMinTeD and COAR NGR)