On Europe PubMed Central, we extract identifies (e.g., accession numbers, data DOIs) in scientific articles. Recently, we started publishing mined identifiers on Linked Data Platform to improve the connectivity of our mined data.
Ontologies and thesauri. How to answer complex questions using interoperability?Equipex Biblissima
Présentation sur les ontologies et thesauri dans le cadre de la Training School COST-IRHT "La transmission des textes : nouveaux outils, nouvelles approches" (Paris), par Stefanie Gehrke
Constantly Under Construction: STW Thesaurus for Economics Linked Data Maint...Joachim Neubert
Talk at Dublin Core Libraries Community (Sep 5, 2013 in Lisbon). Includes a first draft of a version history for SKOS files - see project on https://github.com/jneubert/skos-history
The slideshow presents the CEBA geoportail. We developed the application to capture and share the basic characteristics of available datasets for labex CEBA in French Guyana.
Jana Parvanova, Vladimir Alexiev and Stanislav Kostadinov. In workshop Collaborative Annotations in Shared Environments: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013). Collocated with DocEng 2013. Florence, Italy, Sep 2013.
Ontologies and thesauri. How to answer complex questions using interoperability?Equipex Biblissima
Présentation sur les ontologies et thesauri dans le cadre de la Training School COST-IRHT "La transmission des textes : nouveaux outils, nouvelles approches" (Paris), par Stefanie Gehrke
Constantly Under Construction: STW Thesaurus for Economics Linked Data Maint...Joachim Neubert
Talk at Dublin Core Libraries Community (Sep 5, 2013 in Lisbon). Includes a first draft of a version history for SKOS files - see project on https://github.com/jneubert/skos-history
The slideshow presents the CEBA geoportail. We developed the application to capture and share the basic characteristics of available datasets for labex CEBA in French Guyana.
Jana Parvanova, Vladimir Alexiev and Stanislav Kostadinov. In workshop Collaborative Annotations in Shared Environments: metadata, vocabularies and techniques in the Digital Humanities (DH-CASE 2013). Collocated with DocEng 2013. Florence, Italy, Sep 2013.
SHELDON is the first true hybridization of NLP machine reading and Semantic Web. It is a framework that builds upon a ma- chine reader for extracting RDF graphs from text so that the output is compliant to Semantic Web and Linked Data patterns. It extends the current human-readable web by using Semantic Web practices and technologies in a machine-processable form. Given a sentence in any language, it provides different semantic functionalities (frame detection, topic extraction, named entity recognition, resolution and coreference, terminology extraction, sense tagging and disambiguation, taxonomy induction, semantic role labeling, type induction, sentiment analysis, citation inference, relation and event extraction) as well as nice visualization tools which make use of the JavaScript infoVis Toolkit and RelFinder, as well as a knowledge enrichment component that extends machine reading to Semantic Web data. The system can be freely used at http://wit.istc.cnr.it/stlab-tools/sheldon.
In this presentation given at EuropeanaTech 2015, Marcin Werla from Posnan Supercomputing and Networking Center and Pavel Kats from Europeana Foundation outlined the vision of how cloud initiatives in the European cultural sector, spearheaded by the Europeana Cloud project, can alter the infrastructure landscape in the entire sector.
Metadata can be a rich source of raw material for building new experiences and opening material to unexpected modes of discovery, whether it is through enabling tools to inject a feeling of play and exploration into research, opening up the closed stacks to browsing, or making wall-art.
StaTIX - Statistical Type Inference on Linked DataArtem Lutov
StaTIX - Statistical Type Inference on Linked Data, presented in BigData 2018, Special Session on Intelligent Data Mining
https://github.com/eXascaleInfolab/StaTIX
Analysis on deposit opportunities for ingest of research papers into repositories by the Sonex workteam was presented at the 2nd DL.org workshop held at the University of Glasgow Sep 9-10th, 2010
Tufts Spatial Data Rescue: Crawling at-risk Government DataKyle Monahan
Much U. S. federal data are perceived as being at risk of becoming inaccessible through lack of maintenance and funding shifts. Organizations such as the End of Term Project and Data Rescue have emerged to coordinate the backup and rescue of at-risk federal data. Tufts University conducted a curated harvest to back up potentially at-risk federal, environmental and social justice geospatial data and associated tabular data. Tufts ongoing harvest has recovered over 40 TB of data. Tufts developed the Crawler to crawl all data: unzip files, identify file types, sizes, local directories, and harvest and process all related metadata using data mining and natural language processing (NLP) techniques. The Crawler results in detailed collection level and layer level metadata analytics for assessment, search and discovery.
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication.
Paper at: https://w3id.org/oc/paper/occ-lisc2016.html
SHELDON is the first true hybridization of NLP machine reading and Semantic Web. It is a framework that builds upon a ma- chine reader for extracting RDF graphs from text so that the output is compliant to Semantic Web and Linked Data patterns. It extends the current human-readable web by using Semantic Web practices and technologies in a machine-processable form. Given a sentence in any language, it provides different semantic functionalities (frame detection, topic extraction, named entity recognition, resolution and coreference, terminology extraction, sense tagging and disambiguation, taxonomy induction, semantic role labeling, type induction, sentiment analysis, citation inference, relation and event extraction) as well as nice visualization tools which make use of the JavaScript infoVis Toolkit and RelFinder, as well as a knowledge enrichment component that extends machine reading to Semantic Web data. The system can be freely used at http://wit.istc.cnr.it/stlab-tools/sheldon.
In this presentation given at EuropeanaTech 2015, Marcin Werla from Posnan Supercomputing and Networking Center and Pavel Kats from Europeana Foundation outlined the vision of how cloud initiatives in the European cultural sector, spearheaded by the Europeana Cloud project, can alter the infrastructure landscape in the entire sector.
Metadata can be a rich source of raw material for building new experiences and opening material to unexpected modes of discovery, whether it is through enabling tools to inject a feeling of play and exploration into research, opening up the closed stacks to browsing, or making wall-art.
StaTIX - Statistical Type Inference on Linked DataArtem Lutov
StaTIX - Statistical Type Inference on Linked Data, presented in BigData 2018, Special Session on Intelligent Data Mining
https://github.com/eXascaleInfolab/StaTIX
Analysis on deposit opportunities for ingest of research papers into repositories by the Sonex workteam was presented at the 2nd DL.org workshop held at the University of Glasgow Sep 9-10th, 2010
Tufts Spatial Data Rescue: Crawling at-risk Government DataKyle Monahan
Much U. S. federal data are perceived as being at risk of becoming inaccessible through lack of maintenance and funding shifts. Organizations such as the End of Term Project and Data Rescue have emerged to coordinate the backup and rescue of at-risk federal data. Tufts University conducted a curated harvest to back up potentially at-risk federal, environmental and social justice geospatial data and associated tabular data. Tufts ongoing harvest has recovered over 40 TB of data. Tufts developed the Crawler to crawl all data: unzip files, identify file types, sizes, local directories, and harvest and process all related metadata using data mining and natural language processing (NLP) techniques. The Crawler results in detailed collection level and layer level metadata analytics for assessment, search and discovery.
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication.
Paper at: https://w3id.org/oc/paper/occ-lisc2016.html
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
This presentation will provide a brief introduction to logical reasoning and overview of the most popular semantic schema and ontology languages: RDFS and the profiles of OWL 2.
While automatic reasoning has always inspired the imagination, numerous projects have failed to deliver to the promises. The typical pitfalls related to ontologies and symbolic reasoning fall into two categories:
- Over-engineered ontologies. The selected ontology language and modeling patterns can be too expressive. This can make the results of inference hard to understand and verify, which in its turn makes KG hard to evolve and maintain. It can also impose performance penalties far greater than the benefits.
- Inappropriate reasoning support. There are many inference algorithms and implementation approaches, which work well with taxonomies and conceptual models of few thousands of concepts, but cannot cope with KG of millions of entities.
- Inappropriate data layer architecture. One such example is reasoning with virtual KG, which is often infeasible.
OpenAIRE guidelines and broker service for repository managers - OpenAIRE #OA...OpenAIRE
Presentation by Pedro Principe and Paolo Manghi at the OpenAIRE Open Access week webinar. Friday October 28, 2016. Webinar on Openaire compatibility guidelines and the dashboard for Repository Managers, with Pedro Principe (University of Minho) and Paolo Manghi (CNR/ISTI).
Good dictionaries are a key for text mining. We present an idea to build a platform where users can create their own dictionary and text-mining pipeline.
OpenAIRE services and tools - presentation at #DI4R2016OpenAIRE
Presentation at Digital Infrastrctures for Research Conference 2016 (Sept. 30). Title: Open Access and Open Data in Horizon 2020: for Research managers and Project Coordinators, by Pedro Príncipe (University of Minho)
One year ago we started ingesting citation data from the Open Access literature into the OpenCitations Corpus (OCC), creating an RDF dataset of scholarly citation data that is open to all. In this presentation we introduce the OCC and we discuss its outcomes and uses after the first year of life.
Abstract: In collaborative agile ontology development projects support for modular reuse of ontologies from large existing remote repositories, ontology project life cycle management, and transitive dependency management are important needs. The Apache Maven approach has proven its success in distributed collaborative Software Engineering by its widespread adoption. The contribution of this paper is a new design artifact called OntoMaven. OntoMaven adopts the Maven-based development methodology and adapts its concepts to knowledge engineering for Maven-based ontology development and management of ontology artifacts in distributed ontology repositories.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. Contents
● Europe PubMed Central
● Linking Literature
● Mining Identifiers
● Publishing Mined Identifiers on RDF
● Web Annotation Data Model
● Use Case for Database Curation
3. Europe PubMed Central
● Europe PMC is a literature database
○ Abstracts: 30 million PubMed, Agricola and patent
records, updated daily
○ Full text articles: over 3 million full text articles, of
which over 900,000 are free to read and reuse,
updated daily
4. Services in Europe PMC
● RESTful web service:
○ http://europepmc.org/RestfulWebService
○ Text-mined terms, metadata, full text
● ORCID article claiming tool
● Embassy Cloud for 3rd party contents providers
● BioJS literature module: http://biojs.io/d/biojs-vis-
pmccitation
● RSS
5. Linking Literature
● Europe PMC provides various types of linking methods
○ By external links: to any URL (e.g., database,
Wikipedia, press release, etc.)
○ By text mining
■ Biological entities
■ Identifiers (e.g., accession numbers)
○ By ORCID (article claims)
● 24 external links providers, 1 ORCID, 9 cross-reference
DBs, 20 DB identifiers, 6 named entity types
6. Linking Examples
To By Relation REST API
Wikipedia Provider Mention labsLinks
Publons Provider Review labsLinks
UniProt Curator Citation databaseLinks
ORCID Provider Author search
EFO Named entity
tagger
Recognition textMinedTerms
PDB Accession
number tagger
Mention textMinedTerms
7. Mining Identifiers in Free Text
● Motivation
○ Started for cross-linking with EBI databases
○ Data citation, impact analysis
○ Now, moving for linked data
● We use patterns from identifiers.org and link back to it.
● A IE problem: ID matching + NER for resource names
● Some ambiguities
○ PDB: 4min
○ OMIM and ERC funding id: both 6-digit numbers
○ Resource name variations: UniProt, Swiss-Prot, etc.
8. Mentioned in Europe PMC articles
Identifiers in Literature
Databases
ENA, PDB,
ArrayExpress, UniProt,
RefSNP, OMIM, PFam,
RefSeq, Ensembl,
InterPro, Bioproject,
Biosample, EMDB, PXD,
EGA, TreeFam
Funding
resources
European
Research Council
Ontologies
GO, UniProt,
EFO, ChEBI,
NCBI Taxonomy,
UMLS
Clinical Trials
NCT, EudraCT
Digital
Repositories
(Dryad, figshare,
etc.)
Data DOI
9. Identifiers in Different Resources
Articles (978,605) Patents 2014 (266,192) Wiki pages (15,346,290)
db # articles db # patents db # pages
ena/genbank/
ddbj 23,295
ena/genbank/
ddbj 4,074 pdb 4,265
pdb 15,544 uniprot 1,387 omim 2,226
nct 13,006 pdb 1,093 uniprot 1,712
refsnp 10,168 refseq 1,002 refseq 1,643
refseq 6,551 refsnp 322 ensembl 1,402
omim 5,093 omim 254 go 1,351
uniprot 2,865 pfam 115 pfam 582
go 1,900 ensembl 97 interpro 560
arrayexpress 1,832 interpro 46
ena/genbank/
ddbj 396
10. Publishing Identifiers on RDF
● Goals
○ More connectivity
○ More provenance for each linking
■ PMCID, sentence, section label, etc.
○ Links to share and comment (e.g., hypothes.is)
● Challenges:
○ How to model? Web Annotation Data Model.
○ dealing with nearly a billion annotations generated
automatically in a large scale
11. Web Annotation Data Model
● Built on the top on RDF
● Annotations as resources
● To provide a standard description mechanism for
sharing annotations between systems
● For more general purpose use
○ Not only for text mining
○ For example, YouTube video comments (by people),
image annotation, etc.
○ W3C Working Draft
12. Core Annotation Framework
● Typically an Annotation has a single Body, which is
the comment or other descriptive resource, and a single
Target that the Body is somehow "about".
● The Body provides the information which is annotating
the Target.
● This "aboutness" may be further clarified or extended to
notions such as classifying or identifying.
13.
14. Text-Mining RDF Service
● Running on EBI RDF Platform
● Stores 1,563,241,810 triples text-mined from 400,746
Open Access articles in Europe PubMed Central.
● Provides
○ for each article, all the annotations linking to
ontologies/databases
○ with contexts:
■ sentences
■ section information
15. Use Case for Database Curation
● Given an database identifier, provides sentence-level
information for database curation.
○ Show all the articles where a PDB accession number
3NSS is mentioned.
○ Show all the annotations with each its label in
PMC3382907.
○ Show all the articles where inflammatory bowel
disease (C0021390) is mentioned.
● http://wwwdev.ebi.ac.uk/rdf/services/textmining/sparql
16.
17. Plans for BioHackathon 2015
● Integration with other SPAQL endpoints
● Interoperability with other formats used in text-mining
community
○ e.g., BioC, UIMA
● Produce more links on RDF
18. References
Europe PMC Consortium. Europe PMC: a full-text literature database for the life sciences and platform
for innovation. Nucleic Acids Res. 2015 Jan;43(Database issue) D1042-8. doi:10.1093/nar/gku1061.
PMID: 25378340; PMCID: PMC4383902.
Kafkas Ş, Kim JH, McEntyre JR. Database citation in full text biomedical articles. PLoS One. 2013;8(5)
e63184. doi:10.1371/journal.pone.0063184. PMID: 23734176; PMCID: PMC3667078.
Juty N, Le Novère N, Laibe C. Identifiers.org and MIRIAM Registry: community resources to provide
persistent identification. Nucleic Acids Res. 2012 Jan;40(Database issue) D580-6. doi:10.1093
/nar/gkr1097. PMID: 22140103; PMCID: PMC3245029.