Presentation of the paper User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology by Günter Mühlberger, Johannes Zelger, David Sagmeister and Albert Greinöcker in DATeCH 2014. #digidays
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
Prezentacja artykułu z konferencji Infobazy 2014, prezentującego prace realizowane w projekcie MARKOS. Celem projektu MARKOS jest opracowanie koncepcji i rozwój sieciowej usługi umożliwiającej wyszukanie w globalnej przestrzeni projektów Open Source komponentów, które w sposób optymalny spełniają kryteria wyspecyfikowane przez użytkownika systemu. Dzięki opracowanemu systemowi twórcy i użytkownicy otwartego oprogramowania (ang. Open Source Software, OSS) będą mogli w łatwy i automatyczny sposób analizować zależności pomiędzy użytymi komponentami OSS, biorąc pod uwagę funkcjonalne, strukturalne i licencyjne aspekty kodu źródłowego.
Wynikiem projektu będzie prototyp usługi uruchomionej w Internecie przez partnerów projektu i udostępnionej poprzez zestaw interaktywnych aplikacji, zarówno przez graficzny interfejs użytkownika, jak i semantyczny punkt dostępu do danych w modelu linked data. Wspomniana powyżej usługa będzie realizowana za pomocą zestawu wewnętrznych komponentów systemu MARKOS, których zadaniem będzie wielokontekstowa analiza informacji dostępnych w sieci oraz ich przetwarzanie i przechowywanie w wewnętrznym repozytorium semantycznym systemu.
System MARKOS będzie oferował użytkownikom możliwość semantycznego przeszukiwania i przeglądania komponentów i bibliotek oraz nawigowania po strukturze kodu na wysokim poziomie abstrakcji. Ułatwi to, w szczególności architektom i analitykom, wyszukanie komponentu, który spełnia funkcjonalne, techniczne i prawne wymagania systemu. Z kolei programistom pozwoli lepiej zrozumieć dostępne interfejsy i wewnętrzne zależności oprogramowania. Dodatkowo system MARKOS będzie brał pod uwagę również aspekty integracji kodu, pokazując i wykorzystując zależności i związki między komponentami oprogramowania z różnych projektów. Dzięki temu w systemie MARKOS dostępny będzie zintegrowany globalny widok na istniejące oprogramowanie Open Source. MARKOS wykorzysta również zależności między komponentami do bardziej efektywnej i trafnej analizy kompatybilności licencji, dostarczając podstaw argumentacji prawnej i rozwiązywania konfliktów. W celu ułatwienia współpracy między różnymi projektami, MARKOS dostarczy też narzędzi umożliwiających powiadamianie o istotnych zmianach w komponentach pomiędzy zależnymi projektami. Oczekuje się w związku z powyższym, że system MARKOS ze swoją funkcjonalnością w kontekście globalnym ułatwi rozwój oprogramowania w oparciu o paradygmat Open Source wnosząc swój wkład w globalną społeczność.
El documento describe las actividades y filosofía de Tecnilógica, una compañía de innovación. Tecnilógica se dedica al desarrollo móvil, desarrollo web, digital signage y proyectos innovadores. Su filosofía se basa en un equipo apasionado, un modelo sostenible y el conocimiento. La compañía quiere ser un socio tecnológico para sus clientes.
Purposeful Gaming, OCR Correction and Seed & Nursery Catalog DigitizationMartySchlabach
An online game will be developed to crowd-source the correction of OCRed content in the Biodiversity Heritage Library (BHL). Several additional content types will digitized and added to BHL, namely seed lists, seed & nursery catalogs, and hand-written field notebooks.
The document provides information about IT innovation in Austria through the case study of Softwarepark Hagenberg. It discusses how Softwarepark Hagenberg, located near Linz, Austria, functions as an innovation spiral through its research, education, and business activities. It was founded in 1987 as a spin-off of Johannes Kepler University Linz to foster software development. It now hosts over 2,500 R&D coworkers and students from its 120 company tenants. It also discusses the university's academic programs in information technology and role in supporting the regional economy.
Shaping Collaboration at the University of ZurichRoberto Mazzoni
The document discusses the challenges faced by IT Services at the University of Zurich in implementing a collaboration platform. It describes the University, which has over 26,000 students and 35,000 total users across decentralized faculties and institutes. The University required a solution that provided high availability, scalability, and disaster recovery across a variety of operating systems and devices used by its autonomous and mobile user base. IT Services selected IBM Notes after an evaluation process due to its ability to support the many operating systems and meet the high standards required without enforcement over client or system choices.
A discussion of Text and Data Mining in science and at Springer Nature in particular. As presented at the Frankfurt Book Fair 2018 by Markus Kaindl, Senior Manager Semantic Data, Springer Nature.
This document discusses metadata considerations for the Europeana Newspapers project. It begins with an introduction to the speaker and his background in digital library projects. It then covers general concepts of metadata, how metadata is important for digitized newspapers, and the Europeana Newspaper METS ALTO Profile (ENMAP) that is being developed to provide robust metadata for the project. The goal of ENMAP is to create a standardized format for metadata that can be used for preservation, access, and delivery of newspaper data to Europeana.
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
Prezentacja artykułu z konferencji Infobazy 2014, prezentującego prace realizowane w projekcie MARKOS. Celem projektu MARKOS jest opracowanie koncepcji i rozwój sieciowej usługi umożliwiającej wyszukanie w globalnej przestrzeni projektów Open Source komponentów, które w sposób optymalny spełniają kryteria wyspecyfikowane przez użytkownika systemu. Dzięki opracowanemu systemowi twórcy i użytkownicy otwartego oprogramowania (ang. Open Source Software, OSS) będą mogli w łatwy i automatyczny sposób analizować zależności pomiędzy użytymi komponentami OSS, biorąc pod uwagę funkcjonalne, strukturalne i licencyjne aspekty kodu źródłowego.
Wynikiem projektu będzie prototyp usługi uruchomionej w Internecie przez partnerów projektu i udostępnionej poprzez zestaw interaktywnych aplikacji, zarówno przez graficzny interfejs użytkownika, jak i semantyczny punkt dostępu do danych w modelu linked data. Wspomniana powyżej usługa będzie realizowana za pomocą zestawu wewnętrznych komponentów systemu MARKOS, których zadaniem będzie wielokontekstowa analiza informacji dostępnych w sieci oraz ich przetwarzanie i przechowywanie w wewnętrznym repozytorium semantycznym systemu.
System MARKOS będzie oferował użytkownikom możliwość semantycznego przeszukiwania i przeglądania komponentów i bibliotek oraz nawigowania po strukturze kodu na wysokim poziomie abstrakcji. Ułatwi to, w szczególności architektom i analitykom, wyszukanie komponentu, który spełnia funkcjonalne, techniczne i prawne wymagania systemu. Z kolei programistom pozwoli lepiej zrozumieć dostępne interfejsy i wewnętrzne zależności oprogramowania. Dodatkowo system MARKOS będzie brał pod uwagę również aspekty integracji kodu, pokazując i wykorzystując zależności i związki między komponentami oprogramowania z różnych projektów. Dzięki temu w systemie MARKOS dostępny będzie zintegrowany globalny widok na istniejące oprogramowanie Open Source. MARKOS wykorzysta również zależności między komponentami do bardziej efektywnej i trafnej analizy kompatybilności licencji, dostarczając podstaw argumentacji prawnej i rozwiązywania konfliktów. W celu ułatwienia współpracy między różnymi projektami, MARKOS dostarczy też narzędzi umożliwiających powiadamianie o istotnych zmianach w komponentach pomiędzy zależnymi projektami. Oczekuje się w związku z powyższym, że system MARKOS ze swoją funkcjonalnością w kontekście globalnym ułatwi rozwój oprogramowania w oparciu o paradygmat Open Source wnosząc swój wkład w globalną społeczność.
El documento describe las actividades y filosofía de Tecnilógica, una compañía de innovación. Tecnilógica se dedica al desarrollo móvil, desarrollo web, digital signage y proyectos innovadores. Su filosofía se basa en un equipo apasionado, un modelo sostenible y el conocimiento. La compañía quiere ser un socio tecnológico para sus clientes.
Purposeful Gaming, OCR Correction and Seed & Nursery Catalog DigitizationMartySchlabach
An online game will be developed to crowd-source the correction of OCRed content in the Biodiversity Heritage Library (BHL). Several additional content types will digitized and added to BHL, namely seed lists, seed & nursery catalogs, and hand-written field notebooks.
The document provides information about IT innovation in Austria through the case study of Softwarepark Hagenberg. It discusses how Softwarepark Hagenberg, located near Linz, Austria, functions as an innovation spiral through its research, education, and business activities. It was founded in 1987 as a spin-off of Johannes Kepler University Linz to foster software development. It now hosts over 2,500 R&D coworkers and students from its 120 company tenants. It also discusses the university's academic programs in information technology and role in supporting the regional economy.
Shaping Collaboration at the University of ZurichRoberto Mazzoni
The document discusses the challenges faced by IT Services at the University of Zurich in implementing a collaboration platform. It describes the University, which has over 26,000 students and 35,000 total users across decentralized faculties and institutes. The University required a solution that provided high availability, scalability, and disaster recovery across a variety of operating systems and devices used by its autonomous and mobile user base. IT Services selected IBM Notes after an evaluation process due to its ability to support the many operating systems and meet the high standards required without enforcement over client or system choices.
A discussion of Text and Data Mining in science and at Springer Nature in particular. As presented at the Frankfurt Book Fair 2018 by Markus Kaindl, Senior Manager Semantic Data, Springer Nature.
This document discusses metadata considerations for the Europeana Newspapers project. It begins with an introduction to the speaker and his background in digital library projects. It then covers general concepts of metadata, how metadata is important for digitized newspapers, and the Europeana Newspaper METS ALTO Profile (ENMAP) that is being developed to provide robust metadata for the project. The goal of ENMAP is to create a standardized format for metadata that can be used for preservation, access, and delivery of newspaper data to Europeana.
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The document summarizes research on automatically classifying Springer Nature proceedings using the Smart Topic Miner (STM). STM extracts topics from publications, maps them to a computer science ontology, selects relevant topics using a greedy algorithm, and infers tags. It was tested on 8 Springer Nature editors who found STM accurately classified 75-90% of proceedings and improved their work. However, STM is currently limited to computer science and occasional noisy results were found in books with few chapters. Future work aims to expand STM to characterize topic evolution over time and directly support author tagging.
This is a copy of the presentation given by Ellen Fleurbaay and Marc Holtman of the Amsterdam City Archives at the the MARAC Plenary Session in Jersey City on Friday October 30, 2009.
The document summarizes the JISC HIKE Project at the University of Huddersfield which evaluated the Intota library management system from Serials Solutions and the JISC Knowledge Base+. The project aimed to understand current workflows, identify pain points, evaluate the new systems, provide guidance on integration, and assess the impact on workflows. Intota promises improved integrated workflows from discovery to acquisition and more automated processing. The project found opportunities to reduce duplication and break down silos through new interoperable systems.
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
Grant Muller is the Vice President of Application Software and Architecture at Xylem, a water technology company. He has over 15 years of experience developing software for utilities and has been using MongoDB for around 10 years.
Xylem is a global water technology company with over 17,000 employees operating in over 50 countries. They have been using MongoDB since 2009 when they acquired Verdeeco, an analytics startup that was using MongoDB. Since then, they have continued adopting MongoDB and scaling their usage of it as their data and applications have grown significantly through acquisitions.
Xylem is now developing an IoT platform called Xylem IoT Cloud to connect their various water devices. They are storing the sensor
This document discusses optical character recognition (OCR) of historical newspapers. It describes the digitization process, which includes image capturing, text and structure recognition, natural language processing, and content representation. OCR accuracy can be improved through layout analysis, structural metadata extraction, and identifying different content units like articles, advertisements, and entertainment sections. The goal is to make the content and knowledge within digitized newspapers accessible beyond the scanned text.
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015KISK FF MU
Talk given at the BOBCATSSS 2015 conference - http://www.bobcatsss2015.com/.
In the contribution “Features for the Future Library” the German project “mylibrARy” will be introduced, which is a cooperation project between the University of Applied Sciences in Potsdam, a public library in Berlin and one of the leading AR-software companies metaio GmbH from Munich. The conceptual process of a library AR-app will be presented as well as the results of a user study, which might give an answer to the question, what features of an app the library users want.
Furthermore the possibilities of AR-technology for libraries in general will be discussed and contextualized within the concept of a modern user-friendly library.
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
Publishing conference proceedings internationally: how does it workAliaksandr Birukou
In this presentation we look into main elements one has to consider when organizing an international conference. First, we describe the role of conference proceedings in CS and beyond. Second, we focus on the tasks of conference organizers. Third, we cover the peer review aspects and announce the new group CrossRef and DataCite start with this respect. We then cover indexing and dissemination as well as present several tips and guidelines for organizers of international conferences as well as the word of warning regarding predatory publishers.
В этой презентации мы рассмотрим основные элементы, которые необходимо учитывать при организации международной конференции. Во-первых, мы описываем роль материалов конференций в компьютерных науках и других областях. Во-вторых, мы концентрируемся на задачах организаторов конференции. В-третьих, мы рассмотрим аспекты рецензирования и расскажем о работе группы CrossRef и DataCite. Затем мы расскажем об индексировании и распространении, а также представим несколько советов и рекомендаций для организаторов международных конференций, а также предостережём о феномене хищнических издателей и конференций.
Grant presents a case study of the 19th Century Pamphlets digitisation project, covering the decisions made in planning the project, the challenges encountered, and key lessons learned.
With approximately 1.x years of delay to the US, the term "Data Science" is also gaining speed in Europe. We see more and more job openings for- and business cards of data scientists, new events dedicated to the topic and an increased demand in related education literally every month. In response to this trend, Zurich University of Applied Sciences founded the ZHAW Data Science Laboratory (Datalab) last year.
This talk is to give an updated overview of Data Science in Europe by the example of the Datalab's activities in Switzerland. After a definition and classification of the field, a presentation of real technical projects sets the stage for what Data Science looks like here, offside of internet behemoths and big data clichés. Then, conclusions on the state of the art at least in Switzerland are drawn from evaluating the recent "1st Swiss Workshop on Data Science" event and ZHAW's professional education programme "DAS in Data Science".
With the help of the audience during the subsequent discussion, these results can eventually be extrapolated to the wider European community.
2 nd International Conference on Machine Learning, NLP and Data Mining (MLDA ...gerogepatton
2
nd International Conference on Machine Learning, NLP and Data Mining (MLDA
2023) will provide an excellent international forum for sharing knowledge and results in theory,
methodology and applications of Machine Learning, Natural Language Computing and Data
Mining. Authors are solicited to contribute to the conference by submitting articles that illustrate
research results, projects, surveying works and industrial experiences that describe significant
advances in the following areas, but are not limited to these topics only.
A 4 hour hands on linked data workshop held at ELAG 2013 - http://elag2013.org/ws2-very-gentle-linked-data/. Resources at http://data.archiveshub.ac.uk/workshops/elag2013/
2 nd International Conference on Machine Learning, NLP and Data Mining (MLDA ...gerogepatton
2
nd International Conference on Machine Learning, NLP and Data Mining (MLDA
2023) will provide an excellent international forum for sharing knowledge and results in theory,
methodology and applications of Machine Learning, Natural Language Computing and Data
Mining. Authors are solicited to contribute to the conference by submitting articles that illustrate
research results, projects, surveying works and industrial experiences that describe significant
advances in the following areas, but are not limited to these topics only.
British Library Labs 21st Century Curatorship Talklabsbl
The document discusses the British Library Labs program and lessons learned. It provides an overview of how Labs works with stakeholders like researchers and developers. Labs runs competitions to fund projects that experiment with the Library's digital collections. Winners complete residencies to develop tools and services. Lessons include the need to filter large collections, engage curators, address metadata and system issues, and provide flexible access to support digital research.
Acquisition policy and business models of research libraries in a digital era...dduin
This document discusses how research libraries are adapting to the digital era. It notes that libraries have changed more in the last decade than the last century as they shift resources from print to digital. Libraries are expected to support teaching, research, and scholarly communication. The document recommends that natural history institutions go fully digital with their publications for increased visibility, accessibility, and cost savings. It encourages the use of open access models and infrastructure like the Directory of Open Access Journals to make publications more discoverable.
Traning workshop on ‘Designing an conducting user studies”
Module 1 - Methods and Techniques (Kristien Ooms)
@ ICC&GIS
June 15th, 2016
Albena, Bulgaria
Innovation and project management at ETH LibraryETH-Bibliothek
The document provides information about ETH Zurich Library and its efforts in innovation and project management. It discusses ETH Zurich as an institute of technology and science with over 18,500 students. It then describes ETH Library, which has main and special libraries containing over 7 million holdings. The library has undertaken various innovation initiatives like introducing an ideas management process, project management standards, and launching projects like refreshing the Knowledge Portal and developing the ETHorama tool to enhance access to electronic holdings. It also discusses piloting e-lending of e-books to external users, which started with 26,000 e-books and saw increasing uptake over time.
Part 1 of the printed publication "3D-ICONS Guidelines and Case Studies" First published in November 2014.
Public fascination with the architectural and archaeological heritage is well known, it is proven to be one of the main reasons for tourism according to the UN World Tourism Organisation. Historic buildings and archaeological monuments form a significant component Europe’s cultural heritage; they are the physical testimonies of European history and of the di°erent events that led to the creation of the European landscape, as we know it today.
The documentation of built heritage increasingly avails of 3D scanning and other remote sensing technologies, which produces digital replicas in an accurate and fast way. Such digital models have a large range of uses, from the conservation and preservation of monuments to the communication of their cultural value to the public. They may also support in-depth analysis of their architectural and artistic features as well as allow the production of interpretive reconstructions of their past appearance.
The goal of the 3D-ICONS project, funded under the European Commission’s ICT Policy Support Programme which builds on the results of CARARE (www.carare.eu) and 3D-COFORM (www.3d-coform.eu), is to provide Europeana with 3D models of architectural and archaeological monuments of remarkable cultural importance. The project brings together 16 partners (see appendix 2) from across Europe (11 countries) with relevant expertise in 3D modelling and digitization. The main purpose of this project is to produce around 4000 accurate 3D models which have to be processed into a simplified form in order to be visualized on low end personal computers and on the web.
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
This document discusses using text models to improve the accuracy of optical character recognition (OCR) on Chinese rare books. It conducted experiments using n-gram, backward/forward n-gram, and LSTM models on OCR data from ancient medicine books. The backward and forward 4-gram model achieved the highest correction rate at 97.57%. Mixing the LSTM 6-gram model with the OCR's top 5 candidates and probability of the top candidate further improved accuracy to 97.71%, demonstrating that combining text models with OCR probabilities can better correct OCR errors than text models alone. In conclusion, text models are effective for increasing OCR accuracy on rare books, with backward/forward 4-gram and LSTM 6-gram
More Related Content
Similar to Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The document summarizes research on automatically classifying Springer Nature proceedings using the Smart Topic Miner (STM). STM extracts topics from publications, maps them to a computer science ontology, selects relevant topics using a greedy algorithm, and infers tags. It was tested on 8 Springer Nature editors who found STM accurately classified 75-90% of proceedings and improved their work. However, STM is currently limited to computer science and occasional noisy results were found in books with few chapters. Future work aims to expand STM to characterize topic evolution over time and directly support author tagging.
This is a copy of the presentation given by Ellen Fleurbaay and Marc Holtman of the Amsterdam City Archives at the the MARAC Plenary Session in Jersey City on Friday October 30, 2009.
The document summarizes the JISC HIKE Project at the University of Huddersfield which evaluated the Intota library management system from Serials Solutions and the JISC Knowledge Base+. The project aimed to understand current workflows, identify pain points, evaluate the new systems, provide guidance on integration, and assess the impact on workflows. Intota promises improved integrated workflows from discovery to acquisition and more automated processing. The project found opportunities to reduce duplication and break down silos through new interoperable systems.
MongoDB.local Atlanta: MongoDB @ Sensus: Xylem IoT and MongoDBMongoDB
Grant Muller is the Vice President of Application Software and Architecture at Xylem, a water technology company. He has over 15 years of experience developing software for utilities and has been using MongoDB for around 10 years.
Xylem is a global water technology company with over 17,000 employees operating in over 50 countries. They have been using MongoDB since 2009 when they acquired Verdeeco, an analytics startup that was using MongoDB. Since then, they have continued adopting MongoDB and scaling their usage of it as their data and applications have grown significantly through acquisitions.
Xylem is now developing an IoT platform called Xylem IoT Cloud to connect their various water devices. They are storing the sensor
This document discusses optical character recognition (OCR) of historical newspapers. It describes the digitization process, which includes image capturing, text and structure recognition, natural language processing, and content representation. OCR accuracy can be improved through layout analysis, structural metadata extraction, and identifying different content units like articles, advertisements, and entertainment sections. The goal is to make the content and knowledge within digitized newspapers accessible beyond the scanned text.
Linda Treude, Sabine Wolf: Features for the Future Library #bcs2015KISK FF MU
Talk given at the BOBCATSSS 2015 conference - http://www.bobcatsss2015.com/.
In the contribution “Features for the Future Library” the German project “mylibrARy” will be introduced, which is a cooperation project between the University of Applied Sciences in Potsdam, a public library in Berlin and one of the leading AR-software companies metaio GmbH from Munich. The conceptual process of a library AR-app will be presented as well as the results of a user study, which might give an answer to the question, what features of an app the library users want.
Furthermore the possibilities of AR-technology for libraries in general will be discussed and contextualized within the concept of a modern user-friendly library.
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
Publishing conference proceedings internationally: how does it workAliaksandr Birukou
In this presentation we look into main elements one has to consider when organizing an international conference. First, we describe the role of conference proceedings in CS and beyond. Second, we focus on the tasks of conference organizers. Third, we cover the peer review aspects and announce the new group CrossRef and DataCite start with this respect. We then cover indexing and dissemination as well as present several tips and guidelines for organizers of international conferences as well as the word of warning regarding predatory publishers.
В этой презентации мы рассмотрим основные элементы, которые необходимо учитывать при организации международной конференции. Во-первых, мы описываем роль материалов конференций в компьютерных науках и других областях. Во-вторых, мы концентрируемся на задачах организаторов конференции. В-третьих, мы рассмотрим аспекты рецензирования и расскажем о работе группы CrossRef и DataCite. Затем мы расскажем об индексировании и распространении, а также представим несколько советов и рекомендаций для организаторов международных конференций, а также предостережём о феномене хищнических издателей и конференций.
Grant presents a case study of the 19th Century Pamphlets digitisation project, covering the decisions made in planning the project, the challenges encountered, and key lessons learned.
With approximately 1.x years of delay to the US, the term "Data Science" is also gaining speed in Europe. We see more and more job openings for- and business cards of data scientists, new events dedicated to the topic and an increased demand in related education literally every month. In response to this trend, Zurich University of Applied Sciences founded the ZHAW Data Science Laboratory (Datalab) last year.
This talk is to give an updated overview of Data Science in Europe by the example of the Datalab's activities in Switzerland. After a definition and classification of the field, a presentation of real technical projects sets the stage for what Data Science looks like here, offside of internet behemoths and big data clichés. Then, conclusions on the state of the art at least in Switzerland are drawn from evaluating the recent "1st Swiss Workshop on Data Science" event and ZHAW's professional education programme "DAS in Data Science".
With the help of the audience during the subsequent discussion, these results can eventually be extrapolated to the wider European community.
2 nd International Conference on Machine Learning, NLP and Data Mining (MLDA ...gerogepatton
2
nd International Conference on Machine Learning, NLP and Data Mining (MLDA
2023) will provide an excellent international forum for sharing knowledge and results in theory,
methodology and applications of Machine Learning, Natural Language Computing and Data
Mining. Authors are solicited to contribute to the conference by submitting articles that illustrate
research results, projects, surveying works and industrial experiences that describe significant
advances in the following areas, but are not limited to these topics only.
A 4 hour hands on linked data workshop held at ELAG 2013 - http://elag2013.org/ws2-very-gentle-linked-data/. Resources at http://data.archiveshub.ac.uk/workshops/elag2013/
2 nd International Conference on Machine Learning, NLP and Data Mining (MLDA ...gerogepatton
2
nd International Conference on Machine Learning, NLP and Data Mining (MLDA
2023) will provide an excellent international forum for sharing knowledge and results in theory,
methodology and applications of Machine Learning, Natural Language Computing and Data
Mining. Authors are solicited to contribute to the conference by submitting articles that illustrate
research results, projects, surveying works and industrial experiences that describe significant
advances in the following areas, but are not limited to these topics only.
British Library Labs 21st Century Curatorship Talklabsbl
The document discusses the British Library Labs program and lessons learned. It provides an overview of how Labs works with stakeholders like researchers and developers. Labs runs competitions to fund projects that experiment with the Library's digital collections. Winners complete residencies to develop tools and services. Lessons include the need to filter large collections, engage curators, address metadata and system issues, and provide flexible access to support digital research.
Acquisition policy and business models of research libraries in a digital era...dduin
This document discusses how research libraries are adapting to the digital era. It notes that libraries have changed more in the last decade than the last century as they shift resources from print to digital. Libraries are expected to support teaching, research, and scholarly communication. The document recommends that natural history institutions go fully digital with their publications for increased visibility, accessibility, and cost savings. It encourages the use of open access models and infrastructure like the Directory of Open Access Journals to make publications more discoverable.
Traning workshop on ‘Designing an conducting user studies”
Module 1 - Methods and Techniques (Kristien Ooms)
@ ICC&GIS
June 15th, 2016
Albena, Bulgaria
Innovation and project management at ETH LibraryETH-Bibliothek
The document provides information about ETH Zurich Library and its efforts in innovation and project management. It discusses ETH Zurich as an institute of technology and science with over 18,500 students. It then describes ETH Library, which has main and special libraries containing over 7 million holdings. The library has undertaken various innovation initiatives like introducing an ideas management process, project management standards, and launching projects like refreshing the Knowledge Portal and developing the ETHorama tool to enhance access to electronic holdings. It also discusses piloting e-lending of e-books to external users, which started with 26,000 e-books and saw increasing uptake over time.
Part 1 of the printed publication "3D-ICONS Guidelines and Case Studies" First published in November 2014.
Public fascination with the architectural and archaeological heritage is well known, it is proven to be one of the main reasons for tourism according to the UN World Tourism Organisation. Historic buildings and archaeological monuments form a significant component Europe’s cultural heritage; they are the physical testimonies of European history and of the di°erent events that led to the creation of the European landscape, as we know it today.
The documentation of built heritage increasingly avails of 3D scanning and other remote sensing technologies, which produces digital replicas in an accurate and fast way. Such digital models have a large range of uses, from the conservation and preservation of monuments to the communication of their cultural value to the public. They may also support in-depth analysis of their architectural and artistic features as well as allow the production of interpretive reconstructions of their past appearance.
The goal of the 3D-ICONS project, funded under the European Commission’s ICT Policy Support Programme which builds on the results of CARARE (www.carare.eu) and 3D-COFORM (www.3d-coform.eu), is to provide Europeana with 3D models of architectural and archaeological monuments of remarkable cultural importance. The project brings together 16 partners (see appendix 2) from across Europe (11 countries) with relevant expertise in 3D modelling and digitization. The main purpose of this project is to produce around 4000 accurate 3D models which have to be processed into a simplified form in order to be visualized on low end personal computers and on the web.
Similar to Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology (20)
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
This document discusses using text models to improve the accuracy of optical character recognition (OCR) on Chinese rare books. It conducted experiments using n-gram, backward/forward n-gram, and LSTM models on OCR data from ancient medicine books. The backward and forward 4-gram model achieved the highest correction rate at 97.57%. Mixing the LSTM 6-gram model with the OCR's top 5 candidates and probability of the top candidate further improved accuracy to 97.71%, demonstrating that combining text models with OCR probabilities can better correct OCR errors than text models alone. In conclusion, text models are effective for increasing OCR accuracy on rare books, with backward/forward 4-gram and LSTM 6-gram
Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Using lexicography to characterise relations between species mentions in the biodiversity literature by Sandra Young at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Cross-disciplinary collaborations to enrich access to non-Western language material in the Cultural Heritage sector by Tom Derrick and Nora McGregor at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards the Extraction of Statistical Information from Digitised Numerical Tables - The Medical Officer of Health Reports Scoping Study by Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw and Justin Hayes at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software by Kimmo Kettunen, Teemu Ruokolainen, Erno Liukkonen, Pierrick Tranouez, Daniel Antelme and Thierry Paquet at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper OCR-D: An end-to-end open-source OCR framework for historical documents by Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Kay-Michael Würzner, Matthias Boenig, Elisa Hermann and Volker Hartmann at the 3rd Edition of the DATeCH2019 International Conference
- The document describes a project to fill gaps in knowledge about diamond mining, trading, and polishing in Borneo by developing a workflow using various CLARIAH tools and resources.
- The workflow involved digitizing a diamond encyclopedia, extracting concepts and place names, linking the data to external sources to create linked open data, and querying newspaper archives to build a corpus of relevant articles.
- Promising results showed mining, trading, and polishing continued in Borneo for Southeast Asian customers, and described previously unknown diamond fields and polishing locations in Borneo. The project aims to apply the workflow to other commodities like sugar.
Slides of the paper Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii by Juri Opitz, Leo Born, Vivi Nastase and Yannick Pultar at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification by Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner and Frank Puppe at the 3rd Edition of the DATeCH2019 International Conference
This document describes the SOS system for segmenting, stemming, and standardizing Arabic text. It presents the challenges of processing Arabic cultural heritage texts which contain orthographic variations. The system uses gradient boosting machines and achieves state-of-the-art performance on segmentation and derives stemming as a byproduct. It also standardizes orthography with high accuracy, which further improves segmentation. The system addresses issues like hamza forms and letter confusions that previous systems did not handle well.
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/how-axelera-ai-uses-digital-compute-in-memory-to-deliver-fast-and-energy-efficient-computer-vision-a-presentation-from-axelera-ai/
Bram Verhoef, Head of Machine Learning at Axelera AI, presents the “How Axelera AI Uses Digital Compute-in-memory to Deliver Fast and Energy-efficient Computer Vision” tutorial at the May 2024 Embedded Vision Summit.
As artificial intelligence inference transitions from cloud environments to edge locations, computer vision applications achieve heightened responsiveness, reliability and privacy. This migration, however, introduces the challenge of operating within the stringent confines of resource constraints typical at the edge, including small form factors, low energy budgets and diminished memory and computational capacities. Axelera AI addresses these challenges through an innovative approach of performing digital computations within memory itself. This technique facilitates the realization of high-performance, energy-efficient and cost-effective computer vision capabilities at the thin and thick edge, extending the frontier of what is achievable with current technologies.
In this presentation, Verhoef unveils his company’s pioneering chip technology and demonstrates its capacity to deliver exceptional frames-per-second performance across a range of standard computer vision networks typical of applications in security, surveillance and the industrial sector. This shows that advanced computer vision can be accessible and efficient, even at the very edge of our technological ecosystem.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Datech2014 - Session 3 - User-driven correction of OCR errors. Combining crowdsourcing and information retrieval technology
1. Universität Innsbruck
Christoph-Probst-Platz, Innrain 52
6020 Innsbruck
http://info.uibk.ac.at
User-driven correction of OCR errors.
Combing crowdsourcing and information retrieval
technology
Günter Mühlberger,Johannes Zelger
David Sagmeister,Albert Greinöcker
Universität Innsbruck / Höhere Technische
Bundeslehranstalt Anichstraße - Innsbruck
2. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Introduction
• Crowdsourcing approaches for OCR correction
• Our approach
• Evaluation
• Future work
Agenda
2
3. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Introduction
3
4. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Digitisation of historical printed material
– Google: Billions of files, libraries: Millions of files
– Still hard to get access to these files
• OCR quality
– There are only a few reliable data on the accuracy of OCR on large scale datasets
– E.g. we do not know „how good the Google collection“ is as a whole, or per
language, per century, decade or year, per text type, etc.
• Tanner (2009)
– Has done evaluation of OCR accuracy on British Newspapers
– Differences per newspaper are stronger than per publishing date
– Overall we are speaking about 10% to 40% Word Error Rate, with an average of
22% WER for standard words and 31% for significant words
– Evaluation done within the IMPACT project has shown similar figures
Digitisation and OCR quality
4
5. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• What does this mean for the end-user?
– End-users are either searching a collection or are reading an interesting item
(which they may have found by searching).
– But for reading a page/book they have the original image – so the full-text is much
less important for them
• If we take the figures from above:
– End-users will miss e.g. 20% or 30% of all occurances of a search term which
would be interesting for them simply because the OCR is wrong.
• Maybe acceptable to occasional users, but surely not for humanities
researches or family historians: They want to get „all relevant
occurrences“
– What is “relevant” is decided by the user, some may be interested just within a
specific time period, or periodical, or collection of documents
– Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper
collections is seldom whereas it is frequent in a British Newspaper Collection)
End-usersand OCR quality
5
6. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Crowd sourcing for OCR
6
7. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• OCR as an „ideal“ field for crowd-sourcing
– Simple to realize: Provide link between image and text and let the
user correct it
• Three (and a half) main approaches
– reCAPTCHA
– Australian National Library (Newspaper Digitization Project)
– National Library of Finland (gamefication)
– IBM: CONCERT (CollaborativeCorrection Platform)
Approaches
7
8. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
reCAPTCHA
8
9. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Australian National Library
9
10. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Australian National Library
10
11. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
National Library Finland:Digitalkoot
11
12. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
IBM CONCERT (COoperative eNgine for Correction of ExtRacted Text)
12
13. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• OCR correction with the support of the crowd does work (but not
always)!
• In the case of reCAPTCHA and DigitalKoot users have no influence on
what they correct (de-motivating)
– reCAPTCHA is successful due to the sheer size of interactions
• User specific benefit is provided mainly by the approach of the Australian
National Library
– User reads the text carefully when editing
– Finds corrected words immediately after submitting correct text
– Can decide what to correct
• Power users vs. crowd users
– A very small segment of all users are carrying out the actual work
– Australia: Top 6 users corrected about 25% of the texts
– transcribe Bentham project: Top 7 users produced 70% of all transcripts
Conclusion
13
14. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Proposed approach
14
15. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Let„s combine searching and crowd based correction!
• Provide users with a powerful instrument to correct exactly
those words where they are interested in (searching for)
• Relieve users from actually editing words, but let them just
approve or reject the results of the OCR engine
Searching AND correcting
15
16. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Search interface
16
17. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• User has the chance to
– select the Edit Distance (ED): 0-2
– display already approved words
– search only within the index (without showing word snippets)
• In this way users can play around and
– have influence on the recall of the system
– see the index (which is very helpful to get an impression of the OCR
errors)
– see what already has been done
Search interface:Features
17
18. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Result page: Features
18
19. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Users see the word snippets of their search
• Buttons
– Select all as „false“ or „correct“
• Red: A word snippet does not represent the correct text
• Green: A word snippet represents the correct text (match between search term and
word snippet)
– Deselect all
– Reverse selection
– Save
• Save
– Green word snippets: The text is either approved (if it is the same as in the
OCR text) or the wrong OCR text is corrected by the correct search term
– Red word snippets: Nothing is changed on the OCR text
Features
19
20. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Result page (2)
20
21. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Result sets (on the left hand side)
– 150 word snippets are currently shown in the standard view
– Can be parametrized
– Currently ordered by file path (other criteria could be word
confidence)
• Index (on the right hand side)
– All index terms are listed which are „behind“ a fuzzy search
– Number of occurrences are shown for this result set
– User gets an overview of „which tokens are behind these snippets“
– User is able to decide quickly which tokens are „real“ words
Additional features
21
22. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Improve precision
– Search with ED0
– All word snippets should display the search term
– Those which do not are classical OCR errors
– If they are selected they get the status „approved“
– Those which are errors are currently just deselected (and not marked
as false)
• Approvals are directly written into the ALTO file
– Correction status: true „approved“
Correction strategies (1)
22
23. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Example 1: Search for „nelle“
23
24. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
OCR errors
24
neue nelle neue nelle
25. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Select correct word images = green = approved
25
26. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Search for a word with ED1 or ED2
– The number of hits (and word snippets) increases significantly
– Sometime more, sometimes less, depending very much on the
search string and the length of the string
• Strategy
– One may go through all word snippets and deselect wrong ones or
select correct ones takes some time and is boring
• But
Due to ED2 many other correct words are included in the result set
• Therefore another correction strategy may be more interesting
Correction strategy (2): Improve recall
26
27. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Recommended method
– Go for all tokens representing „real words“ which appear in the index
on the right hand side
– By clicking on a word of the index a ED0 search is triggered
– In many cases ED0 searches retrieve good results with just a few
OCR errors approval is very simple and fast
• Once the „real words“ are done, only those word snippets
appear with „real“ OCR errors of the search term which is our
real objective to correct
Correction strategies (3)
27
28. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Example: Search for „Feuerwehr“ED2
28
29. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
„Feuerwehr“(fire brigade)
29
Feuenvehr Fenerwehr
Feuerwehr, Feuermeh
Feuerwehr- Feuerweh,
Feuerwehr. Feuerwerk
Feuerweh Feuerwehren
Feuerwehr-, Feuerwehr^
Feuerweh? Feuerwehr
Feuermehr Feueràhr
Feuerwert Feuerweihe
Fenerwchr
• Examples of erroneous
words in red
• These words are the „rest“
which appears after having
approved the „real“ words
(green)
• They will finally be replaced
by the correct word:
• In ALTO: correction status
true: substitute: Feuerwehr
30. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Validating „real“ words from the index
30
31. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Those which were approved in the steps before are hidden to
the user.
– But users are able to see them if interested or if they want to do a
final check
– Overwriting is possible, status has to be changed
• Therefore the final correction screen shows now instead 324
word snippets for “Feuerwehr” ED2 only those which were not
approved before.
Repeated search for „Feuerwehr“ED2
31
32. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Finally the „real“ OCR errors are replaced by the
correct word
32
33. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Test set
– From the Europeana Newspaper Project
– 16.000 pages from the Tessmann Library, several millions are waiting to get
indexed
– METS/ALTOfiles
• Standard technology
– JAVA, Javascript (Ajax), Lucene
• Images are cropped on the fly
– „Hardest“ task: takes some seconds on a 4 core engine
– First batch of 150 snippets is done immediatly, second batch preprocessed in the
background
• A testset is available online
– http://dbis-faxe.uibk.ac.at/Website%202.0/CorrectionServlet
– Attention: Not a stable link!
Implementation
33
34. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Our method provides the chance to improve precision and
recall of search terms in a rather quick and straight forward
way.
• Fuzzy search allows to increase the recall of search terms
significantly and to „correct“ erroneous terms quickly
• No need to edit text – only typing a search term once and than
clicking on the index terms for new searches
• Snowball system since approved words are stored
permanently and are reused for the next correction sessions
as well
Conclusion
34
35. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Evaluation
35
36. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Currently not enough data for providing good figures on the
evaluation of the tool – implementation in real world scenario will be
necessary
• But: Doan, A. et al. 2011. Crowdsourcingsystems on the World-
Wide Web. Communications of the ACM.
• Four main criteria for crowd sourcing projects
(1) How to recruit and retain users?
(2) What contributions can users make?
(3) How to combine user contributions to solve the target problem?
(4) How to evaluate users and their contributions?
Evaluation
36
37. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Users are searching anyway!
• Those who are searching have a specific interest!
• Satisfaction will be higher if precision and especially recall is higher for
noisy OCR text
motivation should be there
• Power users of the archive may be willing to contribute a good deal of
their time to improve the full-text search
working power should be there
• Our tool is a piggypack of the search interface – can be integrated in a
simple way (e.g. an extra tab which is performed anyway and users may
try out what is behind)
• Searching the index provides useful insights to the user
learning curve (get to know your full-text archive!)
(1) How to recruit and retain users?
37
38. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Contributions of users are
– Improve precision
– Improve recall by correcting OCR errors of search terms
– All these words are significant and meaningful to a user
• Only a small portion of words is interesting!
– Text contains a lot of words which are not meaningful or are very seldomly
part of a search
– Austrian Newspapers Online: 50% of all full-text searches go for person
names, 20% for geo-names, only a small portion for keywords
– This means that the corrections/approvals done by the user with our method
is more valuable than to correct running text
– The whole number of corrected words may not be so high, but these should
be significant and relevant words
(2) What contributionscan users make?
38
39. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Storage of contributions
– All contributions are stored in two ways:
• The Lucene index is immediately updated so that the next search already takes
benefit from approvals/corrections
• Approvals/corrections are directly stored in the OCR XML files (in this case ALTO):
Words are either marked as „correction status true“ „approved“ or the new
alternative of the word is included as well.
• Main benefit for the next user
– The next user will see which word snippets are already approved (are
shown in blue and gray) – in other words: The contributions are visible to
everyone though they are distributed among large amounts of text
– This should users give the feeling that someone already has worked in this
field as well
(3) How to combine user contributionsto solve the
target problem?
39
40. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Have not tackled this field so far
• Strategy could be
– Randomly select approved or corrected words and provide them to
other users for review
– If specific users provided too many errors a log file could be utilized
to reset the correction status within the ALTO files
(4) How to evaluateusers and their contributions?
40
41. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Future work
41
42. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
• Improve user interface
– Allow to mark word snippets also as „false“
• Release as Open Source package
– Will be done during 2014
– JAVA, AJAX, LUCENE – only OS components
• Implementation of the tool in a real world scenario
• Include a edit distance that is more meaningful for OCR errors than the
Fuzzy search of Lucene
– E.g. larger ED than 2, but based on typical OCR problems (c-e, etc.)
• Use the data for machine learning
– For all word snippets metadata such as title of the publication, size of the print,
language, date of printing, etc. is available
– Use it to discriminate „hard“ cases by asking users to go for specific sets (which are
selected automatically)
Further work and improvements
42
43. Günter Mühlberger | Universitätsbibliothek Innsbruck | Abt. f. Digitalisierung & elektr. ArchivierungDATech 2014 - Madrid
Thank you for your attention!
43