The document discusses semantic interoperability in the CLARIN infrastructure. CLARIN aims to provide access to digital language data and tools for researchers. It sets technical standards and recommendations to enable discovery and sharing of resources located in different places. CLARIN achieves a level of semantic interoperability through its Component Metadata Infrastructure (CMDI), which includes a metadata concept registry (ISOcat) and profile (CMDI). This infrastructure allows resources to be semantically mapped and searched despite differences in their metadata structures. While flexible, the system could be improved by simplifying the concept registry workflow and better integrating semantic annotation into researchers' tools and workflows.
Wroclaw University Library presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
This document summarizes an experiment on automatically assigning topics to text from a historical encyclopedia using optical character recognition (OCR).
The researchers tested automated topic assignment on 14 pages from an 18th century German encyclopedia that had been OCR'd. They analyzed the recall and precision of topic assignment on the OCR'd text, original text, and original text with modernized spelling. Topic assignment was challenging due to OCR errors and historical topics not represented in their topic hierarchy.
While automated topic assignment showed some value in organizing the historical texts, errors limited its usefulness if precision needed to be high. The researchers identified ways to improve precision, such as updating the topic hierarchy, and proposed combining it with social tagging to
The document discusses semantic interoperability in the CLARIN infrastructure. CLARIN aims to provide access to digital language data and tools for researchers. It sets technical standards and recommendations to enable discovery and sharing of resources located in different places. CLARIN achieves a level of semantic interoperability through its Component Metadata Infrastructure (CMDI), which includes a metadata concept registry (ISOcat) and profile (CMDI). This infrastructure allows resources to be semantically mapped and searched despite differences in their metadata structures. While flexible, the system could be improved by simplifying the concept registry workflow and better integrating semantic annotation into researchers' tools and workflows.
Wroclaw University Library presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
This document summarizes an experiment on automatically assigning topics to text from a historical encyclopedia using optical character recognition (OCR).
The researchers tested automated topic assignment on 14 pages from an 18th century German encyclopedia that had been OCR'd. They analyzed the recall and precision of topic assignment on the OCR'd text, original text, and original text with modernized spelling. Topic assignment was challenging due to OCR errors and historical topics not represented in their topic hierarchy.
While automated topic assignment showed some value in organizing the historical texts, errors limited its usefulness if precision needed to be high. The researchers identified ways to improve precision, such as updating the topic hierarchy, and proposed combining it with social tagging to
Polaris is one of the leading Ukrainian brand consulting agencies operating on Ukrainian market since 2002. We are focused on branding and design services among which are brand strategy, brand analysis and consulting. Our professional team has created identities for Hike, Zibert, Axent, Orange Boom, Velon, Jaffa Select, Kite, Flower shop, Walzer, Миргородская, Розмай, Кристальная капля, Форес, Прозора, Ахтырское, Уника. For more information on our projects and services visit our website Polaris.ua
1) The document outlines steps in a digitization workflow including scanning, image enhancement, page splitting, border and curl removal, layout analysis, segmentation, and optical character recognition.
2) It describes a fully automated tool for correcting arbitrary geometric distortions in documents, including those with multiple columns. The process is fully parameterized and reversible with no adverse effects on undistorted documents.
3) Preliminary results show the method more accurately corrects distortions compared to another method and the original images, by calculating deviations from straight lines.
There are 4 main types of biomolecules that make up living things: proteins, carbohydrates, lipids, and nucleic acids. These large molecules are composed of carbon and other atoms bonded together. Energy is stored in the covalent bonds of these biomolecules and released when they are broken down during chemical reactions in the body, which allows the body to use the parts to build new molecules and structures.
Governare Reti, Governare con le Reti (con note intervento)Stefano Rossi
Una riflessione critica sulle ipotesi di Governance 2.0 presentata a Venezia BarCamp il 24/10/2009. Aggiungo la versione con le note che avevo preparato come traccia dell'intervento (durato 5 minuti secondo le regole di Ignite). Lo faccio adesso perché mi pare che il fenomeno M5S evidenzi tutte le questioni che avevo poste nell’intervento.
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
This document discusses using text models to improve the accuracy of optical character recognition (OCR) on Chinese rare books. It conducted experiments using n-gram, backward/forward n-gram, and LSTM models on OCR data from ancient medicine books. The backward and forward 4-gram model achieved the highest correction rate at 97.57%. Mixing the LSTM 6-gram model with the OCR's top 5 candidates and probability of the top candidate further improved accuracy to 97.71%, demonstrating that combining text models with OCR probabilities can better correct OCR errors than text models alone. In conclusion, text models are effective for increasing OCR accuracy on rare books, with backward/forward 4-gram and LSTM 6-gram
Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Using lexicography to characterise relations between species mentions in the biodiversity literature by Sandra Young at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Cross-disciplinary collaborations to enrich access to non-Western language material in the Cultural Heritage sector by Tom Derrick and Nora McGregor at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference
Polaris is one of the leading Ukrainian brand consulting agencies operating on Ukrainian market since 2002. We are focused on branding and design services among which are brand strategy, brand analysis and consulting. Our professional team has created identities for Hike, Zibert, Axent, Orange Boom, Velon, Jaffa Select, Kite, Flower shop, Walzer, Миргородская, Розмай, Кристальная капля, Форес, Прозора, Ахтырское, Уника. For more information on our projects and services visit our website Polaris.ua
1) The document outlines steps in a digitization workflow including scanning, image enhancement, page splitting, border and curl removal, layout analysis, segmentation, and optical character recognition.
2) It describes a fully automated tool for correcting arbitrary geometric distortions in documents, including those with multiple columns. The process is fully parameterized and reversible with no adverse effects on undistorted documents.
3) Preliminary results show the method more accurately corrects distortions compared to another method and the original images, by calculating deviations from straight lines.
There are 4 main types of biomolecules that make up living things: proteins, carbohydrates, lipids, and nucleic acids. These large molecules are composed of carbon and other atoms bonded together. Energy is stored in the covalent bonds of these biomolecules and released when they are broken down during chemical reactions in the body, which allows the body to use the parts to build new molecules and structures.
Governare Reti, Governare con le Reti (con note intervento)Stefano Rossi
Una riflessione critica sulle ipotesi di Governance 2.0 presentata a Venezia BarCamp il 24/10/2009. Aggiungo la versione con le note che avevo preparato come traccia dell'intervento (durato 5 minuti secondo le regole di Ignite). Lo faccio adesso perché mi pare che il fenomeno M5S evidenzi tutte le questioni che avevo poste nell’intervento.
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
This document discusses using text models to improve the accuracy of optical character recognition (OCR) on Chinese rare books. It conducted experiments using n-gram, backward/forward n-gram, and LSTM models on OCR data from ancient medicine books. The backward and forward 4-gram model achieved the highest correction rate at 97.57%. Mixing the LSTM 6-gram model with the OCR's top 5 candidates and probability of the top candidate further improved accuracy to 97.71%, demonstrating that combining text models with OCR probabilities can better correct OCR errors than text models alone. In conclusion, text models are effective for increasing OCR accuracy on rare books, with backward/forward 4-gram and LSTM 6-gram
Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Using lexicography to characterise relations between species mentions in the biodiversity literature by Sandra Young at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Cross-disciplinary collaborations to enrich access to non-Western language material in the Cultural Heritage sector by Tom Derrick and Nora McGregor at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards the Extraction of Statistical Information from Digitised Numerical Tables - The Medical Officer of Health Reports Scoping Study by Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw and Justin Hayes at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software by Kimmo Kettunen, Teemu Ruokolainen, Erno Liukkonen, Pierrick Tranouez, Daniel Antelme and Thierry Paquet at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper OCR-D: An end-to-end open-source OCR framework for historical documents by Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Kay-Michael Würzner, Matthias Boenig, Elisa Hermann and Volker Hartmann at the 3rd Edition of the DATeCH2019 International Conference
- The document describes a project to fill gaps in knowledge about diamond mining, trading, and polishing in Borneo by developing a workflow using various CLARIAH tools and resources.
- The workflow involved digitizing a diamond encyclopedia, extracting concepts and place names, linking the data to external sources to create linked open data, and querying newspaper archives to build a corpus of relevant articles.
- Promising results showed mining, trading, and polishing continued in Borneo for Southeast Asian customers, and described previously unknown diamond fields and polishing locations in Borneo. The project aims to apply the workflow to other commodities like sugar.
Slides of the paper Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii by Juri Opitz, Leo Born, Vivi Nastase and Yannick Pultar at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification by Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner and Frank Puppe at the 3rd Edition of the DATeCH2019 International Conference
This document describes the SOS system for segmenting, stemming, and standardizing Arabic text. It presents the challenges of processing Arabic cultural heritage texts which contain orthographic variations. The system uses gradient boosting machines and achieves state-of-the-art performance on segmentation and derives stemming as a byproduct. It also standardizes orthography with high accuracy, which further improves segmentation. The system addresses issues like hamza forms and letter confusions that previous systems did not handle well.