Matching and merging anonymous terms from web sourcesIJwest
This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated i
Navigating science using citation networksCarl Bergstrom
Carl Bergstrom's presentation from Microsoft's Big Scholarly Data workshop held in Redmond 7/10/2015. The presentation makes the argument that the scholarly community is far too dependent on Google Scholar for the vital task of searching the scientific literature. Google scholar uses an unknown algorithm on an unknown corpus and allows neither extension nor community development. Moreover, it may not persist forever. As a step toward a solution, I propose that the scholarly community take charge of its output. Creating an open repository of full text for all of scientific publication faces copyright issues that will be intractable in the short term, but creating an open citation graph for all of science is feasible today. At the Eigenfactor Project, we specialize in using citation networks to reveal the scientific landscape. I provide a series of quick illustrations of the kinds of things we can do with large-scale citation data.
Gergely Palla - Extracting tag hierarchiesknowescape2013
The document discusses extracting hierarchies from tags. It introduces tags and tagging on blogs and news portals. The goal is to extract tag hierarchies to help with searching and recommendations. Several tag hierarchy extraction methods are mentioned. Benchmarks using Gene Ontology and synthetic hierarchies are used to test the methods. Quality is measured by comparing the reconstructed hierarchy to the original and calculating correctly identified, acceptable, and missing links.
Image to Text Converter PPT. PPT contains step by step algorithms/methods to which we can convert images in to text , specially contains algorithms for images which contains human handwritting, can convert writting in to text, img to text.
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Gaurav Vaidya
This document discusses extracting data from historical documents through crowdsourcing annotations on Wikisource. It describes a project to digitize the field notebooks of Junius Henderson, a early 20th century curator at the University of Colorado Museum of Natural History. Volunteers helped transcribe and annotate the notebooks by adding images, text, and metadata templates on Wikisource. This allowed the data within the notebooks to be extracted and linked to other online resources. The project demonstrates how small incremental steps over many years, like initial scanning and transcription efforts, enabled the fully digitized, annotated, and data-linked version of the notebooks.
Big data deep learning: applications and challengesfazail amin
This document discusses big data, deep learning, and their applications and challenges. It begins with an introduction to big data that defines it in terms of large volume, high velocity, and variety of data types. It then discusses challenges of big data like storage, transfer, privacy, and analyzing diverse data types. Applications of big data analytics include sensor data analysis, trend analysis, and network intrusion detection. Deep learning algorithms can extract patterns from large unlabeled data and non-local relationships. Applications of deep learning in big data include semantic indexing for search engines, discriminative tasks using extracted features, and transfer learning. Challenges of deep learning in big data include learning from streaming data, high dimensionality, scalability, and distributed computing.
Data Science : Make Smarter Business DecisionsEdureka!
Data Science training certifies you with ‘in demand’ Big Data Technologies to help you grab the top paying Data Science job title with Big Data skills and expertise in R programming, Machine Learning and Hadoop framework. A Data Scientist deals with all the phases of data life cycle ranging from Data Acquisition and Data Storage using R-Hadoop concepts, applying modelling through R programming using Machine learning algorithms and illustrate impeccable Data Visualization by leveraging on 'R' capabilities.
This Slide was collected from a seminar "Machine Learning for Data Mining" which was arranged in Daffodil International University.The Chief Guest was Dr. Dewan Md. Farid. He made this wonderful Slide for described to us about Data Mining. He also shared his research experience which was just amazing.Totally unpredictable speech it was from Dr. Dewan Md. Farid Sir. He is one of the famous researcher.I hope , you will enjoy this slide. Details about Dr. Dewan Md. Farid sir is given below in this link
https://ai.vub.ac.be/members/dewan-md-farid
Matching and merging anonymous terms from web sourcesIJwest
This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching spec This paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching specThis paper describes a workflow of simplifying and matching spec ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated ial language terms in RDF generated i
Navigating science using citation networksCarl Bergstrom
Carl Bergstrom's presentation from Microsoft's Big Scholarly Data workshop held in Redmond 7/10/2015. The presentation makes the argument that the scholarly community is far too dependent on Google Scholar for the vital task of searching the scientific literature. Google scholar uses an unknown algorithm on an unknown corpus and allows neither extension nor community development. Moreover, it may not persist forever. As a step toward a solution, I propose that the scholarly community take charge of its output. Creating an open repository of full text for all of scientific publication faces copyright issues that will be intractable in the short term, but creating an open citation graph for all of science is feasible today. At the Eigenfactor Project, we specialize in using citation networks to reveal the scientific landscape. I provide a series of quick illustrations of the kinds of things we can do with large-scale citation data.
Gergely Palla - Extracting tag hierarchiesknowescape2013
The document discusses extracting hierarchies from tags. It introduces tags and tagging on blogs and news portals. The goal is to extract tag hierarchies to help with searching and recommendations. Several tag hierarchy extraction methods are mentioned. Benchmarks using Gene Ontology and synthetic hierarchies are used to test the methods. Quality is measured by comparing the reconstructed hierarchy to the original and calculating correctly identified, acceptable, and missing links.
Image to Text Converter PPT. PPT contains step by step algorithms/methods to which we can convert images in to text , specially contains algorithms for images which contains human handwritting, can convert writting in to text, img to text.
Extracting Data from Historical Documents: Crowdsourcing Annotations on Wikis...Gaurav Vaidya
This document discusses extracting data from historical documents through crowdsourcing annotations on Wikisource. It describes a project to digitize the field notebooks of Junius Henderson, a early 20th century curator at the University of Colorado Museum of Natural History. Volunteers helped transcribe and annotate the notebooks by adding images, text, and metadata templates on Wikisource. This allowed the data within the notebooks to be extracted and linked to other online resources. The project demonstrates how small incremental steps over many years, like initial scanning and transcription efforts, enabled the fully digitized, annotated, and data-linked version of the notebooks.
Big data deep learning: applications and challengesfazail amin
This document discusses big data, deep learning, and their applications and challenges. It begins with an introduction to big data that defines it in terms of large volume, high velocity, and variety of data types. It then discusses challenges of big data like storage, transfer, privacy, and analyzing diverse data types. Applications of big data analytics include sensor data analysis, trend analysis, and network intrusion detection. Deep learning algorithms can extract patterns from large unlabeled data and non-local relationships. Applications of deep learning in big data include semantic indexing for search engines, discriminative tasks using extracted features, and transfer learning. Challenges of deep learning in big data include learning from streaming data, high dimensionality, scalability, and distributed computing.
Data Science : Make Smarter Business DecisionsEdureka!
Data Science training certifies you with ‘in demand’ Big Data Technologies to help you grab the top paying Data Science job title with Big Data skills and expertise in R programming, Machine Learning and Hadoop framework. A Data Scientist deals with all the phases of data life cycle ranging from Data Acquisition and Data Storage using R-Hadoop concepts, applying modelling through R programming using Machine learning algorithms and illustrate impeccable Data Visualization by leveraging on 'R' capabilities.
This Slide was collected from a seminar "Machine Learning for Data Mining" which was arranged in Daffodil International University.The Chief Guest was Dr. Dewan Md. Farid. He made this wonderful Slide for described to us about Data Mining. He also shared his research experience which was just amazing.Totally unpredictable speech it was from Dr. Dewan Md. Farid Sir. He is one of the famous researcher.I hope , you will enjoy this slide. Details about Dr. Dewan Md. Farid sir is given below in this link
https://ai.vub.ac.be/members/dewan-md-farid
Structured and Unstructured:Extracting Information From Classics Scholarly TextsMatteo Romanello
1) The document describes a project to develop an automatic system to extract semantic information from unstructured scholarly texts in classics, focusing on named entities and references.
2) The goal is to build knowledge bases integrating information from multiple sources to improve information retrieval over a classics corpus.
3) The project involves building corpora from online archives, processing texts to extract entities and references, and developing techniques to recognize canonical and bibliographic references.
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsMatteo Romanello
This document outlines a project to develop tools to extract information from classics scholarly texts. It aims to improve information retrieval for classics researchers by automatically identifying mentions of realia (people, places, sources) and extracting canonical references to primary sources from unstructured texts. The methodology involves building corpora of classics articles, creating a knowledge base from existing structured classics data sources, and developing natural language processing tools trained on the knowledge base to extract entities and references from the text corpora. The expected results are improved access points to information for researchers through enriched full-text search and links to relevant primary sources.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication.
Paper at: https://w3id.org/oc/paper/occ-lisc2016.html
The common use by archaeologists of ubiquitous technologies such as computers and digital cameras means that archaeological research projects now produce huge amounts of diverse, digital documentation. However, while the technology is available to collect this documentation, we still largely lack community accepted dissemination channels appropriate for such torrents of data. Open Context (http://www.opencontext.org) aims to help fill this gap by providing open access data publication services for archaeology. Open Context has a flexible and generalized technical architecture that can accommodate most archaeological datasets, despite the lack of common recording systems or other documentation standards. Open Context includes a variety of tools to make data dissemination easier and more worthwhile. Authorship is clearly identified through citation tools, a web-based publication systems enables individuals upload their own data for review, and collaboration is facilitated through easy download and other features. While we have demonstrated a potentially valuable approach for data sharing, we face significant challenges in scaling Open Context up for serving large quantities of data from multiple projects.
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
Global Library of Life: The Biodiversity Heritage Library. Martin R. Kalfatovic. Boston Library Consortium Meeting. Boston Public Library. 18 March 2008. Boston, MA.
Library Catalogues: from Traditional to Next-GenerationKC Tan
Presented at Lecture on 13 Sep 2007 for CS3255 Information Organization for 3rd Year IS students of the School of Computing, National University of Singapore
Open Annotation Collaboration IntroductionTimothy Cole
The Open Annotation Collaboration aims to develop a shared, interoperable data model for scholarly annotation. Phase I of the project created the OAC data model and integrated annotation tools. Phase II will deploy the model through demonstration projects to test its capabilities for annotating a variety of scholarly resources and use cases. The goal is to facilitate widespread adoption of interoperable annotation across different domains.
Ontologies and thesauri. How to answer complex questions using interoperability?Equipex Biblissima
Présentation sur les ontologies et thesauri dans le cadre de la Training School COST-IRHT "La transmission des textes : nouveaux outils, nouvelles approches" (Paris), par Stefanie Gehrke
Annotated Bibliographical Reference Corpora In Digital HumanitiesFaith Brown
This document describes the creation of new bibliographical reference corpora in digital humanities. It presents three corpora constructed of references from Revues.org, a French online journal platform. The first corpus contains references in bibliographies and is relatively simple. The second contains references in footnotes, which are less structured. The third contains partial references in article bodies, which are the most difficult to annotate. The corpora were manually annotated using TEI tags to label fields like authors, titles, dates. This will provide a valuable resource for natural language processing research on bibliographical reference annotation.
Next Generation Catalogs: Extensible Catalog, David Lindahlyouthelectronix
On Wednesday November 7th, 2007 David Lindahl from the University of Rochester discussed his work on the eXtensible Catalog project as part of a program on Next Generation Library Catalogs held at the University of Massachusetts at Amherst and sponsored by the Five Colleges' Librarians Council and Simmons College Graduate School of Library & Information Science (GSLIS). More information is available here:http://www.smith.edu/libraries/staff/fivecoll/nextgen.htm
This document discusses a research project on early modern professorial career patterns that analyzes databases of academic histories. It proposes a methodology using the Heloise Common Research Model, which takes a service-based, layered approach to applying knowledge bases. A key part of the methodology is developing a domain-specific research ontology to model relevant concepts from the databases. Future work includes simplifying exploration of databases by aligning them to publishing standards, and documenting the research process using tools and infrastructures.
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
Many issues are faced by scholars, book researchers, museum directors who try to find the underlying connection between resources. Scholars in particular continuously emphasizes the role of digital humanities and the value of linked data in cultural heritage information systems.
The document discusses the evolution of library catalogs from traditional to next-generation systems. Traditional library catalogs were based on standards like MARC and had limitations like complex interfaces and inability to deliver online content. Next-generation catalogs address these through features like federated search across resources, enriched content with metadata, faceted navigation, and user contributions through tagging and reviews. They integrate these features through a unified discovery interface that provides simplified access to library resources and services.
One year ago we started ingesting citation data from the Open Access literature into the OpenCitations Corpus (OCC), creating an RDF dataset of scholarly citation data that is open to all. In this presentation we introduce the OCC and we discuss its outcomes and uses after the first year of life.
The document discusses developing ontologies, including:
1. What an ontology is and different types of ontologies such as taxonomies, thesauri, and reference models.
2. Representing ontologies using knowledge representation formalisms that have evolved from semantic networks to description logics.
3. The Semantic Web ontology language OWL, which extends RDFS and is divided into three species with different levels of expressivity.
The document discusses OntoMaven repositories and the OMG API4KP standard. It describes two scenarios using distributed knowledge platforms: a connected patient system and semantic annotation of biodiversity data. The API4KP metamodel defines classes for knowledge sources, environments, operations and events. Distributed architecture styles for API4KP include direct access, remote invocation, and decoupled event-based systems. OntoMaven supports knowledge repositories with plug-ins for distributed scenarios.
The document provides an overview of the OAI-ORE (Object Reuse and Exchange) project, which aims to develop standards and protocols to facilitate discovery, linking, and reuse of compound digital objects across repositories. OAI-ORE complements OAI-PMH by allowing for richer descriptions of compound objects and relationships. Key concepts include modeling compound objects, their components and views as resources with defined boundaries and relationships. The OAI-ORE technical committee is defining use cases and a preliminary data model to represent these concepts and enable services like harvesting and obtaining compound object descriptions.
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Matteo Romanello
My presentation at the "Classical Philology Goes Digital" workshop in Potsdam (16-17 February 2017), www.dh.uni-leipzig.de/wo/events/global-philology-open-conference/
Scaling up the Extraction of Canonical Citations in ClassicsMatteo Romanello
The document discusses extracting canonical citations from classical texts at scale. It begins by explaining the importance of references in classics scholarship and trends toward enhanced reading. An approach is presented that uses named entity recognition, relation extraction, and disambiguation to extract citation components and assign identifiers. The extraction pipeline is evaluated on data from L'Année philologique, achieving a high F1 score. Overall, the approach aims to scale the extraction of citations to enable applications like search and network analysis over large corpora.
Structured and Unstructured:Extracting Information From Classics Scholarly TextsMatteo Romanello
1) The document describes a project to develop an automatic system to extract semantic information from unstructured scholarly texts in classics, focusing on named entities and references.
2) The goal is to build knowledge bases integrating information from multiple sources to improve information retrieval over a classics corpus.
3) The project involves building corpora from online archives, processing texts to extract entities and references, and developing techniques to recognize canonical and bibliographic references.
Stuctured Vs Unstructured: Extracting Information from Classics Scholarly TextsMatteo Romanello
This document outlines a project to develop tools to extract information from classics scholarly texts. It aims to improve information retrieval for classics researchers by automatically identifying mentions of realia (people, places, sources) and extracting canonical references to primary sources from unstructured texts. The methodology involves building corpora of classics articles, creating a knowledge base from existing structured classics data sources, and developing natural language processing tools trained on the knowledge base to extract entities and references from the text corpora. The expected results are improved access points to information for researchers through enriched full-text search and links to relevant primary sources.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication.
Paper at: https://w3id.org/oc/paper/occ-lisc2016.html
The common use by archaeologists of ubiquitous technologies such as computers and digital cameras means that archaeological research projects now produce huge amounts of diverse, digital documentation. However, while the technology is available to collect this documentation, we still largely lack community accepted dissemination channels appropriate for such torrents of data. Open Context (http://www.opencontext.org) aims to help fill this gap by providing open access data publication services for archaeology. Open Context has a flexible and generalized technical architecture that can accommodate most archaeological datasets, despite the lack of common recording systems or other documentation standards. Open Context includes a variety of tools to make data dissemination easier and more worthwhile. Authorship is clearly identified through citation tools, a web-based publication systems enables individuals upload their own data for review, and collaboration is facilitated through easy download and other features. While we have demonstrated a potentially valuable approach for data sharing, we face significant challenges in scaling Open Context up for serving large quantities of data from multiple projects.
Global Library of Life: The Biodiversity Heritage LibraryMartin Kalfatovic
Global Library of Life: The Biodiversity Heritage Library. Martin R. Kalfatovic. Boston Library Consortium Meeting. Boston Public Library. 18 March 2008. Boston, MA.
Library Catalogues: from Traditional to Next-GenerationKC Tan
Presented at Lecture on 13 Sep 2007 for CS3255 Information Organization for 3rd Year IS students of the School of Computing, National University of Singapore
Open Annotation Collaboration IntroductionTimothy Cole
The Open Annotation Collaboration aims to develop a shared, interoperable data model for scholarly annotation. Phase I of the project created the OAC data model and integrated annotation tools. Phase II will deploy the model through demonstration projects to test its capabilities for annotating a variety of scholarly resources and use cases. The goal is to facilitate widespread adoption of interoperable annotation across different domains.
Ontologies and thesauri. How to answer complex questions using interoperability?Equipex Biblissima
Présentation sur les ontologies et thesauri dans le cadre de la Training School COST-IRHT "La transmission des textes : nouveaux outils, nouvelles approches" (Paris), par Stefanie Gehrke
Annotated Bibliographical Reference Corpora In Digital HumanitiesFaith Brown
This document describes the creation of new bibliographical reference corpora in digital humanities. It presents three corpora constructed of references from Revues.org, a French online journal platform. The first corpus contains references in bibliographies and is relatively simple. The second contains references in footnotes, which are less structured. The third contains partial references in article bodies, which are the most difficult to annotate. The corpora were manually annotated using TEI tags to label fields like authors, titles, dates. This will provide a valuable resource for natural language processing research on bibliographical reference annotation.
Next Generation Catalogs: Extensible Catalog, David Lindahlyouthelectronix
On Wednesday November 7th, 2007 David Lindahl from the University of Rochester discussed his work on the eXtensible Catalog project as part of a program on Next Generation Library Catalogs held at the University of Massachusetts at Amherst and sponsored by the Five Colleges' Librarians Council and Simmons College Graduate School of Library & Information Science (GSLIS). More information is available here:http://www.smith.edu/libraries/staff/fivecoll/nextgen.htm
This document discusses a research project on early modern professorial career patterns that analyzes databases of academic histories. It proposes a methodology using the Heloise Common Research Model, which takes a service-based, layered approach to applying knowledge bases. A key part of the methodology is developing a domain-specific research ontology to model relevant concepts from the databases. Future work includes simplifying exploration of databases by aligning them to publishing standards, and documenting the research process using tools and infrastructures.
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
Many issues are faced by scholars, book researchers, museum directors who try to find the underlying connection between resources. Scholars in particular continuously emphasizes the role of digital humanities and the value of linked data in cultural heritage information systems.
The document discusses the evolution of library catalogs from traditional to next-generation systems. Traditional library catalogs were based on standards like MARC and had limitations like complex interfaces and inability to deliver online content. Next-generation catalogs address these through features like federated search across resources, enriched content with metadata, faceted navigation, and user contributions through tagging and reviews. They integrate these features through a unified discovery interface that provides simplified access to library resources and services.
One year ago we started ingesting citation data from the Open Access literature into the OpenCitations Corpus (OCC), creating an RDF dataset of scholarly citation data that is open to all. In this presentation we introduce the OCC and we discuss its outcomes and uses after the first year of life.
The document discusses developing ontologies, including:
1. What an ontology is and different types of ontologies such as taxonomies, thesauri, and reference models.
2. Representing ontologies using knowledge representation formalisms that have evolved from semantic networks to description logics.
3. The Semantic Web ontology language OWL, which extends RDFS and is divided into three species with different levels of expressivity.
The document discusses OntoMaven repositories and the OMG API4KP standard. It describes two scenarios using distributed knowledge platforms: a connected patient system and semantic annotation of biodiversity data. The API4KP metamodel defines classes for knowledge sources, environments, operations and events. Distributed architecture styles for API4KP include direct access, remote invocation, and decoupled event-based systems. OntoMaven supports knowledge repositories with plug-ins for distributed scenarios.
The document provides an overview of the OAI-ORE (Object Reuse and Exchange) project, which aims to develop standards and protocols to facilitate discovery, linking, and reuse of compound digital objects across repositories. OAI-ORE complements OAI-PMH by allowing for richer descriptions of compound objects and relationships. Key concepts include modeling compound objects, their components and views as resources with defined boundaries and relationships. The OAI-ORE technical committee is defining use cases and a preliminary data model to represent these concepts and enable services like harvesting and obtaining compound object descriptions.
Towards the Automatic Retrieval of Cited Parallel Passages from Secondary Lit...Matteo Romanello
My presentation at the "Classical Philology Goes Digital" workshop in Potsdam (16-17 February 2017), www.dh.uni-leipzig.de/wo/events/global-philology-open-conference/
Scaling up the Extraction of Canonical Citations in ClassicsMatteo Romanello
The document discusses extracting canonical citations from classical texts at scale. It begins by explaining the importance of references in classics scholarship and trends toward enhanced reading. An approach is presented that uses named entity recognition, relation extraction, and disambiguation to extract citation components and assign identifiers. The extraction pipeline is evaluated on data from L'Année philologique, achieving a high F1 score. Overall, the approach aims to scale the extraction of citations to enable applications like search and network analysis over large corpora.
Transforming Indexes Locorum into Citation NetworksMatteo Romanello
1) O documento descreve como índices de locorum podem ser transformados em redes de citações extraídas de textos clássicos.
2) Dados como o L'Année philologique são processados para extrair entidades nomeadas e relações de citação.
3) As citações extraídas são usadas para construir redes de citações em níveis macro, meso e micro que fornecem diferentes perspectivas sobre a intertextualidade.
Enhancing and Extending the Digital Study of Intertextuality (pt. 2): Reveali...Matteo Romanello
Slides of my presentation at the Digital Classics Association's titled *Making Meaning from Data* at the 146th Annual Meeting of Society for Classical Studies (was American Philological Association).
1) O documento discute a reutilização de textos no contexto das humanidades digitais, definindo-a como a reiteração significativa de texto para além da repetição de linguagem comum.
2) Os organizadores do painel visam compartilhar conhecimentos, discutir abordagens e fomentar pesquisas colaborativas futuras sobre o tema.
3) Serão abordados tópicos como a definição e tipos de reutilização de texto, infraestrutura para reutilização de texto e engajamento de usuários.
Exploring Citation Networks to Study Intertextuality in ClassicsMatteo Romanello
Referring is such an essential part of scholarly activity across disciplines that it has been regarded by John Unsworth (2000) as one of the scholarly primitives. There is, however, a kind of citation whose potential has not been fully exploited to date, despite the attention they recently received within Digital Classics research (Romanello, Boschetti, and Crane 2009; Smith 2010; Romanello 2011). These are called “canonical citations” and are the references commonly used to refer to passages of ancient texts. Given their importance to classicists, Crane et al. (2009) have argued, services for extracting and exploiting them should be part of the Cyberinfrastructure for Classics.
In this paper I discuss the various aspects of making such citations–together with the network of links they create–computable. Firstly, I will present the characteristics of such citations by showing how their semantics can be modeled by means of a formal ontology. Once such an ontology is created and populated, it can be used by a machine as a surrogate for domain knowledge in order to make inferences about texts and citations.
Secondly, I will illustrate how an expert system that captures canonical citations and their meaning from modern journal papers can be implemented by using Natural Language Processing techniques that are well known in Computer Science. I will then present two resources that were developed for this task and made available under Open Source licenses: 1) a manually corrected, multilingual corpus of approximately 30,000 tokens drawn from L’Année Philologique with annotated Named Entities; 2) a machine learning-based classifier that can be trained with this corpus to extract from texts canonical citations and mentions of ancient authors and works.
Finally, I will show some examples of how the citation network so extracted– consisting of journal papers and the ancient texts they refer to–can be exploited to offer scholars new ways and tools to studying intertexuality.
References
Crane, Gregory, Brent Seales, and Melissa Terras. 2009. “Cyberinfrastructure for Classical Philology.” Digital Humanities Quarterly 3.
Romanello, Matteo. 2011. “New Value-Added Services for Electronic Journals in Classics.” JLIS.it 2. doi:10.4403/jlis.it-4603.
Romanello, Matteo, Federico Boschetti, and Gregory Crane. 2009. “Citations in the digital library of classics: extracting canonical references by using conditional random fields.” In , 80–87. Morristown, NJ, USA: Association for Computational Linguistics.
Smith, Neel. 2010. “Digital Infrastructure and the Homer Multitext Project.” In Digital Research in the Study of Classical Antiquity, ed. Gabriel Bodard and Simon Mahony, 121–137. Burlington, VT: Ashgate Publishing.
Unsworth, John. 2000. “Scholarly Primitives: what methods do humanities researchers have in common, and how might our tools reflect this?.” http://www3.isrl.illinois.edu/~unsworth/Kings.5-00/primitives.html.
DARIAH Geo-browser: Exploring Data through Time and SpaceMatteo Romanello
This document discusses the DARIAH Geo-browser tool, which allows users to explore datasets with both temporal and spatial information by visualizing the data on a map. The tool is suitable for exploratory research to help users visualize patterns within their data. It can import data in KML format that contains time and place information. As an example, the document demonstrates how the tool can be used to explore publications related to places along the Roman Limes by mapping the publications to the locations and dates. The key benefits highlighted are the ability to conduct exploratory research through data visualization, and the interoperability of the tool through its use of APIs, common identifiers, and ability to import/export different data formats.
This document discusses using computational grids for resource-intensive digital humanities projects. It provides two examples of projects that would benefit from using a grid: 1) comparing OCR text to ground truth which requires large amounts of memory, and 2) aligning multiple OCR streams and generating error patterns which is computationally intensive. The document raises questions about preparing code written in Java to run on a grid, whether the programming language matters, how input/output operations and storage work on a grid, and if thread-based programs would be better suited for a grid.
[poster] Extracting Information From Classics Scholarly TextsMatteo Romanello
This document provides an overview of a PhD research project aimed at developing an automatic system to extract structured information from a corpus of unstructured classics scholarly texts, in order to improve information retrieval capabilities. The project involves building a corpus from open access classics journal papers, applying natural language processing techniques to identify mentions of people, places, works, and other entities within texts, and using structured data from existing databases to disambiguate entity mentions and automatically generate new indices linking texts. The expected results are providing multiple meaningful access points to information within the corpus and demonstrating the scalability of the approach.
This document provides an introduction to digital humanities and philology. It discusses the history and methods of digital humanities, as well as key resources like journals, conferences, and projects. Examples of digital humanities applications for philology are described, such as parsing critical apparatuses and creating treebanks of ancient Greek. The document also outlines the digital tools available to philologists for finding, organizing, sharing, and reusing information in their work.
This document discusses extracting information from indices of quotations found in classical texts. It presents a parsing-based approach to extract data from the indices to support creating digital collections of ancient texts. Preliminary results show applying a fuzzy parser to an OCR transcription of an index extracted information from potentially noisy input by representing the hierarchical structure of author names and referenced works. The parsing results can be used to automatically tag quotations in texts and reconstruct hyperlinks between the index and text.
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
This document discusses rethinking the representation of fragmentary classical texts in digital editions through the use of ontologies. It addresses problems with current editions, such as duplication of text. The authors analyze the domain to identify concepts like fragments as interpretations linked to evidence. They design an ontology with classes for interpretations, textual passages, and linking fragments to witness texts. The benefits cited include a solid architecture separating texts from interpretations, formalization of the domain, and improved data interoperability.
The document proposes a microformat to encode references to canonical texts in classics literature to enable linking references to primary source materials. It discusses preliminary definitions including reference linking, distinguishing between primary sources like ancient texts and secondary sources like commentaries and articles, and defining canonical text references that will be encoded in the proposed microformat. The proposal is for critical value-added services for e-journals on classics using this new microformat.
The document discusses using microformats and Canonical Text Services (CTS) Uniform Resource Names (URNs) to link primary sources like classical texts to secondary sources that discuss or reference them. It proposes encoding citation references to primary sources semantically as microformats to allow for loose coupling between systems. This would make the linking system open-ended, language-neutral and distributed across services. The CTS URN system provides unambiguous identifiers for authors, works and text passages. Together, microformats and CTS URNs could enable new services like semantic parsing of citations and aggregating related information from different sources to support research.
M. Romanello, E-scholia: scenari digitali per la comunicazione scientifica in...Matteo Romanello
Slides presentate durante la discussione della tesi di laurea specilistica in Informatica per le Discipline Umanistiche presso l'Università Ca' Foscari di Venezia.
This document discusses linking references to ancient texts in secondary sources to relevant digital resources over the web. It proposes a microformat for embedding semantic information about canonical text references in XHTML documents. This would allow references to be mapped to requests to a text server using the Canonical Text Services protocol, in order to build a more distributed digital library and provide enhanced functionality like viewing referenced passages in context. Examples are given of how this could improve scholars' online research experience.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
Chapter wise All Notes of First year Basic Civil Engineering.pptxDenish Jangid
Chapter wise All Notes of First year Basic Civil Engineering
Syllabus
Chapter-1
Introduction to objective, scope and outcome the subject
Chapter 2
Introduction: Scope and Specialization of Civil Engineering, Role of civil Engineer in Society, Impact of infrastructural development on economy of country.
Chapter 3
Surveying: Object Principles & Types of Surveying; Site Plans, Plans & Maps; Scales & Unit of different Measurements.
Linear Measurements: Instruments used. Linear Measurement by Tape, Ranging out Survey Lines and overcoming Obstructions; Measurements on sloping ground; Tape corrections, conventional symbols. Angular Measurements: Instruments used; Introduction to Compass Surveying, Bearings and Longitude & Latitude of a Line, Introduction to total station.
Levelling: Instrument used Object of levelling, Methods of levelling in brief, and Contour maps.
Chapter 4
Buildings: Selection of site for Buildings, Layout of Building Plan, Types of buildings, Plinth area, carpet area, floor space index, Introduction to building byelaws, concept of sun light & ventilation. Components of Buildings & their functions, Basic concept of R.C.C., Introduction to types of foundation
Chapter 5
Transportation: Introduction to Transportation Engineering; Traffic and Road Safety: Types and Characteristics of Various Modes of Transportation; Various Road Traffic Signs, Causes of Accidents and Road Safety Measures.
Chapter 6
Environmental Engineering: Environmental Pollution, Environmental Acts and Regulations, Functional Concepts of Ecology, Basics of Species, Biodiversity, Ecosystem, Hydrological Cycle; Chemical Cycles: Carbon, Nitrogen & Phosphorus; Energy Flow in Ecosystems.
Water Pollution: Water Quality standards, Introduction to Treatment & Disposal of Waste Water. Reuse and Saving of Water, Rain Water Harvesting. Solid Waste Management: Classification of Solid Waste, Collection, Transportation and Disposal of Solid. Recycling of Solid Waste: Energy Recovery, Sanitary Landfill, On-Site Sanitation. Air & Noise Pollution: Primary and Secondary air pollutants, Harmful effects of Air Pollution, Control of Air Pollution. . Noise Pollution Harmful Effects of noise pollution, control of noise pollution, Global warming & Climate Change, Ozone depletion, Greenhouse effect
Text Books:
1. Palancharmy, Basic Civil Engineering, McGraw Hill publishers.
2. Satheesh Gopi, Basic Civil Engineering, Pearson Publishers.
3. Ketki Rangwala Dalal, Essentials of Civil Engineering, Charotar Publishing House.
4. BCP, Surveying volume 1
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
it describes the bony anatomy including the femoral head , acetabulum, labrum . also discusses the capsule , ligaments . muscle that act on the hip joint and the range of motion are outlined. factors affecting hip joint stability and weight transmission through the joint are summarized.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...
Romanello tokyo
1. Structured Vs Unstructured:
Extracting Information From Scholarly Texts in
European Classical Studies
Matteo Romanello1
1 Centre for Computing in the Humanities
EIRI - CCH Symposium on the Digitization in the Humanities
Keio University - Tokyo 18th March 2010
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 1 / 26
2. Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 2 / 26
3. Introduction
Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 3 / 26
4. Introduction
The Project at a glance
Project started in October 2009;
Field of application: Digital Humanities, Classics (particularly
Greek literature);
co-supervision between the CCH and the CS department at King’s
-> application of Computational Linguistics method
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 4 / 26
5. Introduction
Focus
Scholarly Texts from the European Scholarly Tradition in Classical
Studies
Secondary sources, e.g. journal papers, as opposed to primary
sources, i.e. Ancient Texts
Sets of texts considered so far:
Princeton - Stanford Working Papers in Classics (PSWPC)
LEXIS online: classics journal available online under Open Access
policy
goal -> information extraction
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 5 / 26
6. Introduction
Goal
Devising an automatic system to improve semantic information
retrieval over a discipline-specific corpus of unstructured texts
focus on secondary sources
automatic -> scalable with huge amount of data
information retrieval -> the task of retrieving information
unstructured texts -> raw texts (e.g. .txt files) as opposed to the
structured/encoded XML
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 6 / 26
7. Motivations
Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 7 / 26
8. Motivations
The Million Book Library
archives.org, Google Books -> growth of
volume of information publicly available in
electronic format
longer “shelf-life” of books in
Classics/Humanities
need for effective tools to access
information for research purposes
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 8 / 26
9. Motivations
Information Extraction in Classics: challenges
lack of tools comparable to CiteseerX, GoPubMed, etc.
results of traditional search engines -> high recall but low precision
need to go beyond TOCs or string matching-based IR
still issues with encoding of Ancient Greek
no ad-hoc gold standards/training set
lack of tools specifically tailored to Classics resources
electronically available text does not mean electronic text
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 9 / 26
10. Methodology
Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 10 / 26
11. Methodology
Named Entities as Access Point to Information
mentions of entities matter for Classicists -> importance of print
indexes in Classics
Disambiguation, different spellings or translations of names
relating different expressions to the same entity
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 11 / 26
12. Methodology
Named Entities as Access Point to Information
Entities to be extracted:
1 Place Names (ancient and modern);
2 Relevant Person Names (mythological names, ancient authors,
modern scholars)
3 References to primary and secondary sources (canonical texts
and modern publications about them)
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 12 / 26
13. Methodology
Reuse of Structured Information
Reuse of structured data sources, e.g. thesauri, authority lists, etc.,
produced by scholars over the last two decades.
-> To train machine-learning based tools to mine unstructured texts.
Related work:
Research in the AI field -> Semantic Integration
Use of Wikipedia/DBpedia in NLP
Related projects: EROCS by IBM
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 13 / 26
14. Work Phases
Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 14 / 26
16. Work Phases
Corpus building
Getting materials
Crawling online archives
Extracting the text from collected documents
Tools for text extraction from PDF -> open issues with Ancient
Greek encoding
re-OCR documents even the native digital ones
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 16 / 26
17. Work Phases
Corpus Building II
Corpora
open access, multilingual
Princeton/Stanford Working Papers in Classics (PSWPC)
Lexis online
470 articles in 2 corpora
OCR
Finereader
Ocropus (layout analysis)
text extracted from PDFs (tools like pdftotext etc.)
Alignment of multiple OCR outputs
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 17 / 26
18. Work Phases
Building the Knowledge Base (KB)
Goal: integrate different data sources into a single KB
Why?
Information about the same entities spread over several data
sources
Data sources might use different output formats (raw text, DBs,
HTML, XML etc.)
partial overlappings but no interoperability
How?
Use of high level ontologies to map records related to the same
entity
Result: KB containing semantic data
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 18 / 26
19. Work Phases
Building the Knowledge Base (KB) II
Ontologies -> in CS a formalism to model data
Integrating data sources:
import each datasource
map it to high level ontologies (e.g., CIDOC-CRM)
find overlappings between datasources -> alignign the records
The obtained knowledge base will be used as support for all the text
processing tasks
Implementation of the KB: RDF triple store with a SPARQL interface
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 19 / 26
20. Work Phases
Corpus Processing
1 sentence identification
2 entities extraction (named entities recognition + disambiguation)
KB implied to build up an entity context
3 canonical references extraction
KB provides training data
4 modern bibliographic references extraction
KB provides list of journals/name places/authors to improve the
perfomances of the tool
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 20 / 26
22. Work Phases
Canonical References Extraction
1 citations used specifically for primary sources (i.e. works of ancient
authors)
2 essential entry point to information: refer to the research object, i.e.
ancient texts
3 logical instead of physical citation scheme (e.g., chapter/paragr vs.
page)
4 variation -> time, style, language (regexp insufficient!)
Example
Hom. Il. XII 1
Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803
Hes. fr. 321 M.-W.
Callimaco, ’ep.’ 28 Pf., 5-6
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 22 / 26
23. Expected Results
Overview
1 Introduction
2 Motivations and Background
3 Methodology
4 Work Phases
5 Expected Results
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 23 / 26
24. Expected Results
Results
Provide automatically multiple meaningful entry points to
information
Enrich the corpus with links to resources (particularly primary
sources)
Improve the user access to the corpus
Demonstrate the scalability of the approach
Tools/Resources
Knowledge Base for Classics
Articles with improved text quality
(improved) corpora to be released
single tools for information extraction (e.g. CREX Canonical
References EXtractor)
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 24 / 26
25. Expected Results
Possible Applications
Solution to problems peculiar of Classics might help to improve
the performances of existing tools/algorithms
Collections of secondary sources as corpora:
citation patterns
citation and co-citation networks
trends in the Classics citation practice
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 25 / 26
26. Expected Results
Thanks for your attention!
matteo.romanello@kcl.ac.uk
http://uk.linkedin.com/in/matteoromanello
Romanello (CCH) Extracting Information From Scholarly Texts EIRI - CCH Symposium 26 / 26