SlideShare a Scribd company logo
Recent advances in the project
EXCITE – Extraction of Citations
from PDF Documents
Philipp Mayr
GESIS – Leibniz Institute for the Social Sciences
2018-09-03, Bologna
http://excite.west.uni-koblenz.de/
#opencitations
#WOOC2018
EXCITE team
• PI: Steffen Staab (WeST), Philipp Mayr (GESIS)
• Researchers: Behnam Ghavimi, Zeyd Boukhers
• Developer: Azam Hosseini
• Collaborators: Heinrich Hartmann, Martin Körner
2
EXCITE: Background
3
• We run productive search systems and research in information
retrieval, recommendation systems and knowledge discovery
− SSOAR https://www.gesis.org/ssoar/ (48K full texts)
− GESIS Search https://search.gesis.org/ (242K data sets +
further materials)
• National literatures are not well represented in major citation
indices (like WoS, Scopus)
• Shortage of citation data for the international and German
social sciences (Social Science Citation Index is not enough)
• Open availability of citation data is improving but still very
limited
EXCITE: Main objectives
• Develop web services to allow third-parties to
extract citation data from arbitrary publications
• Develop a toolchain of reference extraction and
matching software
• Integrate and publish the extracted citation data in
reusable formats
• Narrow the supply gap of citation data in the social
sciences
4
EXCITE: toolchain
(1) Extraction of text from source documents (PDFs),
(2) Identification of reference sections and other forms of embedded reference
information within the text,
(3) Segmentation of individual references into its constituent fields such as author, title,
etc.,
(4) Matching of reference strings against bibliographic databases,
(5) Export and publication of matched references to reusable formats (convert to OCC)
5
Training data
EXCITE: recent advances
• All components are available as reusable components, see
https://github.com/exciteproject
• EXparser – tool to extracting and segment references (see talk
by Zeyd Boukhers: “A Generic Approach for Reference
Extraction from PDF Documents” tomorrow)
• Annotators and Gold standards – tools to annotate references
and different gold standards to train and test the tools
• EXmatcher – tool to match references to bibliographic
databases which base on solr, elasticsearch
• EXpublisher – tool to convert EXCITE data to JSON-LD
• Public demo http://excite.west.uni-koblenz.de/excite
• Extracted and matched data in productive systems, e.g.
https://search.gesis.org/publication/gesis-ssoar-10004
6
EXAnnotators: Reference Identification
7
http://excite.west.uni-koblenz.de/refanno
EXAnnotators: Reference Segmentation
8
http://excite.west.uni-koblenz.de/seganno
EXCITE: Demo
9
DEMO
EXCITE: Demo
10
Uploading File Display References
Result
http://excite.west.uni-koblenz.de/excite
EXmatcher
• Input are segmented reference strings with probabilities
for each segment
• Output are matched document ids
11
EXmatcher
hybrid
approach -
combination of
blocking
techniques
and a
classifier
algorithm
Input: strings,
segments,
probabilities 12
EXPublisher
• Converting extracted and matched data to the OCC ontology (incl.
EXCITE Identifier in the OCI)
• Enrichment of the reference information by external metadata
13https://github.com/exciteproject/EXpublisher
EXMatcher and ExPublisher will be included in the demo soon!
Next steps in EXCITE
• EXCITE references to be published in OpenCitationCorpus
• Public EXCITE API for testing (to be public soon)
• Reference Matching to Crossref to be added in the
demo/API
• Gold Standards (German/English/Reference
Section/Footnotes) to be completed
• Extractions models for German and English texts
• More Social Science data to be processed and released
• We try to process ArXiv for OCC
14
Thank you
Contact:
Dr Philipp Mayr
GESIS - Leibniz Institute for the Social Sciences, Germany
Email: philipp.mayr@gesis.org
Twitter: @philipp_mayr
• Project website
http://excite.west.uni-koblenz.de/
• EXCITE mailing list: Subscribe to our Newsletter.
• Demo http://excite.west.uni-koblenz.de/excite
• GIT https://github.com/exciteproject/
15
EXCITE: Toolchain
16

More Related Content

What's hot

ODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiers
datacite
 
skos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systemsskos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systems
Joachim Neubert
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
National Information Standards Organization (NISO)
 
Meadows apr28-1
Meadows apr28-1Meadows apr28-1
ODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentres
datacite
 
Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...
National Information Standards Organization (NISO)
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
Irina Radchenko
 
Federating Research Profiling Data
Federating Research Profiling DataFederating Research Profiling Data
Federating Research Profiling Data
ericmeeks
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
National Information Standards Organization (NISO)
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
datacite
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
Emily Singley
 
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Digital Methods Initiative
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
Simon Price
 
Introduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzIntroduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed Pentz
Crossref
 
Dash: data sharing made easy
Dash: data sharing made easyDash: data sharing made easy
Dash: data sharing made easy
University of California Curation Center
 
ORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice Meadows
Crossref
 
Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpedia
Krzysztof Wecel
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
Robert H. McDonald
 

What's hot (20)

ODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiersODIN Final Event - Publishing and citing, and the role of persistent identifiers
ODIN Final Event - Publishing and citing, and the role of persistent identifiers
 
skos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systemsskos-history: Tracking the evolution of Knowledge Organization Systems
skos-history: Tracking the evolution of Knowledge Organization Systems
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
 
Meadows apr28-1
Meadows apr28-1Meadows apr28-1
Meadows apr28-1
 
ODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentresODIN Final Event - Submission to datacentres
ODIN Final Event - Submission to datacentres
 
Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...Godby "'What are the 'entities that matter?' And how much should we say about...
Godby "'What are the 'entities that matter?' And how much should we say about...
 
Introduction to Data Journalism
Introduction to Data JournalismIntroduction to Data Journalism
Introduction to Data Journalism
 
Federating Research Profiling Data
Federating Research Profiling DataFederating Research Profiling Data
Federating Research Profiling Data
 
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
 
ODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific DataODIN Final Event - The Care and Feeding of Scientific Data
ODIN Final Event - The Care and Feeding of Scientific Data
 
What ami searching_hollis+articlestab
What ami searching_hollis+articlestabWhat ami searching_hollis+articlestab
What ami searching_hollis+articlestab
 
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
Studying Facebook via Data Extraction: a Netvizz tutorial at the Digital Meth...
 
Mining and Mapping the Research Landscape
Mining and Mapping the Research LandscapeMining and Mapping the Research Landscape
Mining and Mapping the Research Landscape
 
Introduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed PentzIntroduction to Crossref, Seoul - Ed Pentz
Introduction to Crossref, Seoul - Ed Pentz
 
Dash: data sharing made easy
Dash: data sharing made easyDash: data sharing made easy
Dash: data sharing made easy
 
ORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice MeadowsORCID: An Overview - Alice Meadows
ORCID: An Overview - Alice Meadows
 
Citations and References in DBpedia
Citations and References in DBpediaCitations and References in DBpedia
Citations and References in DBpedia
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
The HathiTrust Research Center: Enabling New Knowledge Through Shared Infras...
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Anita de Waard
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
Andrea Bollini
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentation
John Turner
 
2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl
Tamara Pianos
 
6_ULiege_presentation.pdf
6_ULiege_presentation.pdf6_ULiege_presentation.pdf
6_ULiege_presentation.pdf
OpenAccessBelgium
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & how
David T Palmer
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
Dave Marcial
 
EOSC and libraries
EOSC and librariesEOSC and libraries
EOSC and libraries
Sarah Jones
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshop
bellalli
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
مركز البحوث الأقسام العلمية
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
Brian Hole
 
British Library
British LibraryBritish Library
British Library
clarivate
 
Open Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesOpen Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research Libraries
Anup Kumar Das
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017
Sarah Amrani
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
Anita de Waard
 
Library support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchLibrary support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted Research
Tim Wales
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
National Information Standards Organization (NISO)
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
مركز البحوث الأقسام العلمية
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Susanna-Assunta Sansone
 

Similar to Recent advances in the project EXCITE – Extraction of Citations from PDF Documents (20)

Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
Research dissemination presentation
Research dissemination presentationResearch dissemination presentation
Research dissemination presentation
 
2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl2015 04-21-eexcess emtacl
2015 04-21-eexcess emtacl
 
6_ULiege_presentation.pdf
6_ULiege_presentation.pdf6_ULiege_presentation.pdf
6_ULiege_presentation.pdf
 
Moving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & howMoving from an IR to a CRIS, the why & how
Moving from an IR to a CRIS, the why & how
 
Citation Management Using Mendeley Software
Citation Management  Using Mendeley SoftwareCitation Management  Using Mendeley Software
Citation Management Using Mendeley Software
 
EOSC and libraries
EOSC and librariesEOSC and libraries
EOSC and libraries
 
Focus on research workshop
Focus on research workshopFocus on research workshop
Focus on research workshop
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Research Data Publishing
Research Data PublishingResearch Data Publishing
Research Data Publishing
 
British Library
British LibraryBritish Library
British Library
 
Open Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research LibrariesOpen Access to Scholarly Research: Implications for Research Libraries
Open Access to Scholarly Research: Implications for Research Libraries
 
MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017MyScienceWork & NFAIS - Webinar 07 11 2017
MyScienceWork & NFAIS - Webinar 07 11 2017
 
CNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data CommonsCNI 2018: A Research Object Authoring Tool for the Data Commons
CNI 2018: A Research Object Authoring Tool for the Data Commons
 
Library support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted ResearchLibrary support for the scientific publishing cycle @ Rothamsted Research
Library support for the scientific publishing cycle @ Rothamsted Research
 
NISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to RealityNISO Webinar: Library Linked Data: From Vision to Reality
NISO Webinar: Library Linked Data: From Vision to Reality
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
 
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...Scientific Data overview of Data Descriptors - WT Data-Literature integration...
Scientific Data overview of Data Descriptors - WT Data-Literature integration...
 

More from GESIS

10th BIR Workshop @ECIR 2020: introduction
10th  BIR Workshop @ECIR 2020: introduction10th  BIR Workshop @ECIR 2020: introduction
10th BIR Workshop @ECIR 2020: introduction
GESIS
 
From closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsFrom closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journals
GESIS
 
Highly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeHighly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over time
GESIS
 
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
GESIS
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
GESIS
 
Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”
GESIS
 
Searching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesSearching beyond datasets in the Social Sciences
Searching beyond datasets in the Social Sciences
GESIS
 
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenBedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
GESIS
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living Lab
GESIS
 
41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)
GESIS
 
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
GESIS
 
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
GESIS
 
Challenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesChallenges in Extracting and Managing References
Challenges in Extracting and Managing References
GESIS
 
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
GESIS
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
GESIS
 
Recent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalRecent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information Retrieval
GESIS
 
Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...
GESIS
 
Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016
GESIS
 
Recent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsRecent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization Systems
GESIS
 
Using co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationUsing co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguation
GESIS
 

More from GESIS (20)

10th BIR Workshop @ECIR 2020: introduction
10th  BIR Workshop @ECIR 2020: introduction10th  BIR Workshop @ECIR 2020: introduction
10th BIR Workshop @ECIR 2020: introduction
 
From closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journalsFrom closed to open access: A case study of flipped journals
From closed to open access: A case study of flipped journals
 
Highly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over timeHighly cited references in PLOS ONE and their in-text usage over time
Highly cited references in PLOS ONE and their in-text usage over time
 
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural...
 
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with BibliometricsBibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
Bibliometric-enhanced Information Retrieval: Connecting IR with Bibliometrics
 
Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”Analyzing the network structure and gender differences of the “NKOS community”
Analyzing the network structure and gender differences of the “NKOS community”
 
Searching beyond datasets in the Social Sciences
Searching beyond datasets in the Social SciencesSearching beyond datasets in the Social Sciences
Searching beyond datasets in the Social Sciences
 
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der SozialwissenschaftenBedeutung von Text Mining am Beispiel der Sozialwissenschaften
Bedeutung von Text Mining am Beispiel der Sozialwissenschaften
 
Contextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living LabContextualised Browsing in a Digital Library’s Living Lab
Contextualised Browsing in a Digital Library’s Living Lab
 
41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)41st European Conference on Information Retrieval (ECIR 2019)
41st European Conference on Information Retrieval (ECIR 2019)
 
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
Offenes kollaboratives Schreiben: Eine „Open Science“-Infrastruktur am Beispi...
 
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
A Complete Year of User Retrieval Sessions in a Social Sciences Academic Sear...
 
Challenges in Extracting and Managing References
Challenges in Extracting and Managing ReferencesChallenges in Extracting and Managing References
Challenges in Extracting and Managing References
 
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
Opening Scholarly Communication in Social Sciences by Connecting Collaborativ...
 
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...Measuring the usefulness of Knowledge Organization Systems in Information Ret...
Measuring the usefulness of Knowledge Organization Systems in Information Ret...
 
Recent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information RetrievalRecent Advances in Bibliometric-Enhanced Information Retrieval
Recent Advances in Bibliometric-Enhanced Information Retrieval
 
Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...Analyzing the research output presented at European Networked Knowledge Organ...
Analyzing the research output presented at European Networked Knowledge Organ...
 
Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016Introduction to the 15th NKOS workshop @TPDL2016
Introduction to the 15th NKOS workshop @TPDL2016
 
Recent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization SystemsRecent applications of Knowledge Organization Systems
Recent applications of Knowledge Organization Systems
 
Using co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguationUsing co-authorship networks for author name disambiguation
Using co-authorship networks for author name disambiguation
 

Recently uploaded

Ancient Theory, Abiogenesis , Biogenesis
Ancient Theory, Abiogenesis , BiogenesisAncient Theory, Abiogenesis , Biogenesis
Ancient Theory, Abiogenesis , Biogenesis
SoniaBajaj10
 
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
muralinath2
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
Isha Pandey
 
Post RN - Biochemistry (Unit 7) Metabolism
Post RN - Biochemistry (Unit 7) MetabolismPost RN - Biochemistry (Unit 7) Metabolism
Post RN - Biochemistry (Unit 7) Metabolism
Areesha Ahmad
 
VIII-Geography FOR CBSE CLASS 8 INDIA.pdf
VIII-Geography FOR CBSE CLASS 8 INDIA.pdfVIII-Geography FOR CBSE CLASS 8 INDIA.pdf
VIII-Geography FOR CBSE CLASS 8 INDIA.pdf
poorvarajgolkar
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
J. Bovas Joel BFSc
 
Classification and role of plant nutrients - Roxana Madjar
Classification and role of plant nutrients - Roxana MadjarClassification and role of plant nutrients - Roxana Madjar
Classification and role of plant nutrients - Roxana Madjar
Faculty of Applied Chemistry and Materials Science
 
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
Sérgio Sacani
 
Concept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdfConcept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdf
SELF-EXPLANATORY
 
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Sérgio Sacani
 
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
Faculty of Applied Chemistry and Materials Science
 
Post RN - Biochemistry (Unit 1) Basic concept of Chemistry
Post RN - Biochemistry (Unit 1) Basic concept of ChemistryPost RN - Biochemistry (Unit 1) Basic concept of Chemistry
Post RN - Biochemistry (Unit 1) Basic concept of Chemistry
Areesha Ahmad
 
Structure of Sperm / Spermatozoon .pdf
Structure of  Sperm / Spermatozoon  .pdfStructure of  Sperm / Spermatozoon  .pdf
Structure of Sperm / Spermatozoon .pdf
SELF-EXPLANATORY
 
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Sérgio Sacani
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
muralinath2
 
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
Steffi Friedrichs
 
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
Sérgio Sacani
 
Synopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic SpecimenSynopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic Specimen
Sérgio Sacani
 
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbitA hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
Sérgio Sacani
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
Faculty of Applied Chemistry and Materials Science
 

Recently uploaded (20)

Ancient Theory, Abiogenesis , Biogenesis
Ancient Theory, Abiogenesis , BiogenesisAncient Theory, Abiogenesis , Biogenesis
Ancient Theory, Abiogenesis , Biogenesis
 
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
bloodclotfactorsprocoagulantsexstrinsicintrinsicfactors-240607054610-6895d6e5...
 
Types of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptxTypes of Hypersensitivity Reactions.pptx
Types of Hypersensitivity Reactions.pptx
 
Post RN - Biochemistry (Unit 7) Metabolism
Post RN - Biochemistry (Unit 7) MetabolismPost RN - Biochemistry (Unit 7) Metabolism
Post RN - Biochemistry (Unit 7) Metabolism
 
VIII-Geography FOR CBSE CLASS 8 INDIA.pdf
VIII-Geography FOR CBSE CLASS 8 INDIA.pdfVIII-Geography FOR CBSE CLASS 8 INDIA.pdf
VIII-Geography FOR CBSE CLASS 8 INDIA.pdf
 
Potential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptxPotential of Marine renewable and Non renewable energy.pptx
Potential of Marine renewable and Non renewable energy.pptx
 
Classification and role of plant nutrients - Roxana Madjar
Classification and role of plant nutrients - Roxana MadjarClassification and role of plant nutrients - Roxana Madjar
Classification and role of plant nutrients - Roxana Madjar
 
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
All-domain Anomaly Resolution Office Supplement to Oak Ridge National Laborat...
 
Concept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdfConcept of Balanced Diet & Nutrients.pdf
Concept of Balanced Diet & Nutrients.pdf
 
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
Simulations of pulsed overpressure jets: formation of bellows and ripples in ...
 
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
End of pipe treatment: Unlocking the potential of RAS waste - Carlos Octavio ...
 
Post RN - Biochemistry (Unit 1) Basic concept of Chemistry
Post RN - Biochemistry (Unit 1) Basic concept of ChemistryPost RN - Biochemistry (Unit 1) Basic concept of Chemistry
Post RN - Biochemistry (Unit 1) Basic concept of Chemistry
 
Structure of Sperm / Spermatozoon .pdf
Structure of  Sperm / Spermatozoon  .pdfStructure of  Sperm / Spermatozoon  .pdf
Structure of Sperm / Spermatozoon .pdf
 
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...Surface properties of the seas of Titan as revealed by Cassini mission bistat...
Surface properties of the seas of Titan as revealed by Cassini mission bistat...
 
Pancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptxPancreas_functional anatomy_enzymes.pptx
Pancreas_functional anatomy_enzymes.pptx
 
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
MACRAMÉ ChIPs @Behoerdenklausur 2024 (Berlin)
 
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
A NICER VIEW OF THE NEAREST AND BRIGHTEST MILLISECOND PULSAR: PSR J0437−4715
 
Synopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic SpecimenSynopsis: Analysis of a Metallic Specimen
Synopsis: Analysis of a Metallic Specimen
 
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbitA hot-Jupiter progenitor on a super-eccentric retrograde orbit
A hot-Jupiter progenitor on a super-eccentric retrograde orbit
 
Traditional, current and future use of fish and seaweed for fertilisation - ...
Traditional, current and future use of fish and seaweed for fertilisation -  ...Traditional, current and future use of fish and seaweed for fertilisation -  ...
Traditional, current and future use of fish and seaweed for fertilisation - ...
 

Recent advances in the project EXCITE – Extraction of Citations from PDF Documents

  • 1. Recent advances in the project EXCITE – Extraction of Citations from PDF Documents Philipp Mayr GESIS – Leibniz Institute for the Social Sciences 2018-09-03, Bologna http://excite.west.uni-koblenz.de/ #opencitations #WOOC2018
  • 2. EXCITE team • PI: Steffen Staab (WeST), Philipp Mayr (GESIS) • Researchers: Behnam Ghavimi, Zeyd Boukhers • Developer: Azam Hosseini • Collaborators: Heinrich Hartmann, Martin Körner 2
  • 3. EXCITE: Background 3 • We run productive search systems and research in information retrieval, recommendation systems and knowledge discovery − SSOAR https://www.gesis.org/ssoar/ (48K full texts) − GESIS Search https://search.gesis.org/ (242K data sets + further materials) • National literatures are not well represented in major citation indices (like WoS, Scopus) • Shortage of citation data for the international and German social sciences (Social Science Citation Index is not enough) • Open availability of citation data is improving but still very limited
  • 4. EXCITE: Main objectives • Develop web services to allow third-parties to extract citation data from arbitrary publications • Develop a toolchain of reference extraction and matching software • Integrate and publish the extracted citation data in reusable formats • Narrow the supply gap of citation data in the social sciences 4
  • 5. EXCITE: toolchain (1) Extraction of text from source documents (PDFs), (2) Identification of reference sections and other forms of embedded reference information within the text, (3) Segmentation of individual references into its constituent fields such as author, title, etc., (4) Matching of reference strings against bibliographic databases, (5) Export and publication of matched references to reusable formats (convert to OCC) 5 Training data
  • 6. EXCITE: recent advances • All components are available as reusable components, see https://github.com/exciteproject • EXparser – tool to extracting and segment references (see talk by Zeyd Boukhers: “A Generic Approach for Reference Extraction from PDF Documents” tomorrow) • Annotators and Gold standards – tools to annotate references and different gold standards to train and test the tools • EXmatcher – tool to match references to bibliographic databases which base on solr, elasticsearch • EXpublisher – tool to convert EXCITE data to JSON-LD • Public demo http://excite.west.uni-koblenz.de/excite • Extracted and matched data in productive systems, e.g. https://search.gesis.org/publication/gesis-ssoar-10004 6
  • 10. EXCITE: Demo 10 Uploading File Display References Result http://excite.west.uni-koblenz.de/excite
  • 11. EXmatcher • Input are segmented reference strings with probabilities for each segment • Output are matched document ids 11
  • 12. EXmatcher hybrid approach - combination of blocking techniques and a classifier algorithm Input: strings, segments, probabilities 12
  • 13. EXPublisher • Converting extracted and matched data to the OCC ontology (incl. EXCITE Identifier in the OCI) • Enrichment of the reference information by external metadata 13https://github.com/exciteproject/EXpublisher EXMatcher and ExPublisher will be included in the demo soon!
  • 14. Next steps in EXCITE • EXCITE references to be published in OpenCitationCorpus • Public EXCITE API for testing (to be public soon) • Reference Matching to Crossref to be added in the demo/API • Gold Standards (German/English/Reference Section/Footnotes) to be completed • Extractions models for German and English texts • More Social Science data to be processed and released • We try to process ArXiv for OCC 14
  • 15. Thank you Contact: Dr Philipp Mayr GESIS - Leibniz Institute for the Social Sciences, Germany Email: philipp.mayr@gesis.org Twitter: @philipp_mayr • Project website http://excite.west.uni-koblenz.de/ • EXCITE mailing list: Subscribe to our Newsletter. • Demo http://excite.west.uni-koblenz.de/excite • GIT https://github.com/exciteproject/ 15