Presentation "Evaluation and post-correction of OCR of digitised historical newspapers" by Lotte Wilms, Koninklijke Bibliotheek, at IMPACT Annual Members' Meeting 2017. www.delpher.nl
Webinar: Neo4j Licensing: Which Edition Is Right For You?Neo4j
Neo4j has long been recognised as the world's leading graph database, and now is expanding to be a true platform for Enterprise Graph Applications. Now is a better time than ever to explore which version of Neo4j you should be using considering the product features and/or services you want to benefit from.
In this webinar, we'll walk through the different versions of Neo4j, discuss and show their remits, and present what the alternative (open source or commercial) licensing options may be. Specifically, we'll dig into the free open-source, free commercial (for individuals, startups, and academics) and paid commercial options and answer any questions that you may have.
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
Slides of my talk on distributed deep learning concepts and platforms, from the "Deep Learning for Poets" workshop at Tehran Polytechnic on December 19th, 2018.
Webinar: Neo4j Licensing: Which Edition Is Right For You?Neo4j
Neo4j has long been recognised as the world's leading graph database, and now is expanding to be a true platform for Enterprise Graph Applications. Now is a better time than ever to explore which version of Neo4j you should be using considering the product features and/or services you want to benefit from.
In this webinar, we'll walk through the different versions of Neo4j, discuss and show their remits, and present what the alternative (open source or commercial) licensing options may be. Specifically, we'll dig into the free open-source, free commercial (for individuals, startups, and academics) and paid commercial options and answer any questions that you may have.
Zillow's favorite big data & machine learning toolsnjstevens
This talk covers Zillow's favorite tools for keeping track of research, cluster computing, machine learning open source, workflow management, logging, deep learning and data storage
Slides of my talk on distributed deep learning concepts and platforms, from the "Deep Learning for Poets" workshop at Tehran Polytechnic on December 19th, 2018.
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
Presenter: John Andleman, Staff Database Engineer, Citrix
In this session, John will share some interesting use cases leveraging the HPCC Systems platform, including those beyond traditional big data uses. John will also share his roadmap of HPCC projects being planned for the next few months and why he feels HPCC Systems is a more suitable solution than Hadoop based on experiences and lessons learned.
NOTE: The video of this presentation is the 3rd one shown in the accompanying YouTube link.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewPrakher Hajela Saxena
MapInfo Professional and Discover3D is a complete suite of software specifically designed for geoscientists, environmentalists, and geochemists.
The software is being used in various industries today like, environment, mining, exploration, hydrology, etc.
Including review the state-of-the-art dialog systems, and their underlying deep learning architecture. Also review two types of dialog systems: Open-domain chatbots that can make conversation on any subject, and task-oriented chatbots that serve the user for specific needs (such as ordering food or booking a ticket). IMore
discussion on the success and challenges in both domains. Finally, it includes share our work at LivePerson for providing conversational AI solutions to more than 18,000 enterprise customers.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico.
The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions.
MODERATOR:
Jeannine Keefer, University of Richmond, VA
PRESENTERS:
Mary Alexander, University of Alabama
Elizabeth Berenz, ARTstor
Ian McDermott, ARTstor
Joshua Polansky, University of Washington
The Elephant in the Library - Integrating Hadoopcneudecker
The Elephant in the Library - Integrating Hadoop
[with Sven Schlarb]
Hadoop Summit Europe, Beurs van Berlage, 20-21 March 2013, Amsterdam, Netherlands.
A presentation on Digital Content Creation by Rupesh Kumar A, Assistant Professor, Department of Studies and Research in Library and Information Science, Tumkur University, Tumakuru, Karnataka, India.
The Poznań Foundation of Scientific Libraries presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model by Hsiang-An Wang and Pin-Ting Liu at the 3rd Edition of the DATeCH2019 International Conference
More Related Content
Similar to Evaluation and post-correction of OCR of digitised historical newspapers
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
Presentation of the European project SCAPE (www.scape-project.eu) at the Elag2013 conference in Gent/Belgium. The presentation includes details about use cases and implementation at the Austrian National LIbrary.
HPCC Systems Engineering Summit: Community Use Case: Because Who Has Time for...HPCC Systems
Presenter: John Andleman, Staff Database Engineer, Citrix
In this session, John will share some interesting use cases leveraging the HPCC Systems platform, including those beyond traditional big data uses. John will also share his roadmap of HPCC projects being planned for the next few months and why he feels HPCC Systems is a more suitable solution than Hadoop based on experiences and lessons learned.
NOTE: The video of this presentation is the 3rd one shown in the accompanying YouTube link.
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Maurice Nsabimana
Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted.
In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.
MapInfo Professional 12.5 and Discover3D 2014 - A brief overviewPrakher Hajela Saxena
MapInfo Professional and Discover3D is a complete suite of software specifically designed for geoscientists, environmentalists, and geochemists.
The software is being used in various industries today like, environment, mining, exploration, hydrology, etc.
Including review the state-of-the-art dialog systems, and their underlying deep learning architecture. Also review two types of dialog systems: Open-domain chatbots that can make conversation on any subject, and task-oriented chatbots that serve the user for specific needs (such as ordering food or booking a ticket). IMore
discussion on the success and challenges in both domains. Finally, it includes share our work at LivePerson for providing conversational AI solutions to more than 18,000 enterprise customers.
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes.
Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail.
In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.
Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico.
The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions.
MODERATOR:
Jeannine Keefer, University of Richmond, VA
PRESENTERS:
Mary Alexander, University of Alabama
Elizabeth Berenz, ARTstor
Ian McDermott, ARTstor
Joshua Polansky, University of Washington
The Elephant in the Library - Integrating Hadoopcneudecker
The Elephant in the Library - Integrating Hadoop
[with Sven Schlarb]
Hadoop Summit Europe, Beurs van Berlage, 20-21 March 2013, Amsterdam, Netherlands.
A presentation on Digital Content Creation by Rupesh Kumar A, Assistant Professor, Department of Studies and Research in Library and Information Science, Tumkur University, Tumakuru, Karnataka, India.
The Poznań Foundation of Scientific Libraries presentation at "Succeed in Digitisation. Spreading Excellence" Conference. Validation and take-up of text digitisation tools.
Slides of the paper Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts by Helmut Schmid at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model by Hsiang-An Wang and Pin-Ting Liu at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project by Katrien Depuydt and Hennie Brugman at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Using lexicography to characterise relations between species mentions in the biodiversity literature by Sandra Young at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability by Evagelos Varthis, Marios Poulos, Ilias Yarenis and Sozon Papavlasopoulos at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Curation Technologies for a Cultural Heritage Archive: Analysing and transforming a heterogeneous data set into an interactive curation workbench by Georg Rehm, Martin Lee, Julián Moreno Schneider and Peter Bourgonje at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Cross-disciplinary collaborations to enrich access to non-Western language material in the Cultural Heritage sector by Tom Derrick and Nora McGregor at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Tribunal Archives as Digital Research Facility (TRIADO): new ways to make archives accessible and useable by Anne Gorter, Edwin Klijn, Rutger Van Koert, Marielle Scherer and Ismee Tames at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Improving OCR of historical newspapers and journals published in Finland by Senka Drobac, Pekka Kauppinen and Krister Lindén at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Towards the Extraction of Statistical Information from Digitised Numerical Tables - The Medical Officer of Health Reports Scoping Study by Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw and Justin Hayes at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software by Kimmo Kettunen, Teemu Ruokolainen, Erno Liukkonen, Pierrick Tranouez, Daniel Antelme and Thierry Paquet at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper OCR-D: An end-to-end open-source OCR framework for historical documents by Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Kay-Michael Würzner, Matthias Boenig, Elisa Hermann and Volker Hartmann at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Diamonds in Borneo: Commodities as Concepts in Context by Karin Hofmeester, Ashkan Ashkpour, Katrien Depuydt and Jesse de Does at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii by Juri Opitz, Leo Born, Vivi Nastase and Yannick Pultar at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification by Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner and Frank Puppe at the 3rd Edition of the DATeCH2019 International Conference
Slides of the paper Arabic-SOS Segmenter, Stemmer and Orthography Standardizer for the Arabic Cultural Heritage by Emad Mohamed & Zeeshas Sayyed at the 3rd Edition of the DATeCH2019 International Conference
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Evaluation and post-correction of OCR of digitised historical newspapers
1. Evaluation and post-correction of OCR of digitised
historical newspapers
A research project
Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center)
@lottewilms @jvdzwa
2. • Digitised Dutch newspapers
• 1618-1995
• Images + metadata + text
• now: 11 million pages (in 1.351.123 issues)
• prognosis 2020: 20 million pages
• Full text searchable on: www.delpher.nl
Delpher newspaper corpus
4. type/format level comments
PDF issue Searchable text + scan
JPEG-2000 page Access: JPEG 2000 lossy compression,
colour (or greyscale in case of original
from microfilm)
Master: JPEG 2000 part 1, lossless
compression, greyscale or colour
Dublin Core iss./p./art. Descriptive metadata
OCR article XML
ALTO page
mpeg21-didl issue Structural metadata
The data
5. Aims of project
• Insight into quality of our OCR
• Insight into automated methods of post-correction
• Reprocessing images
• Machine learning approach
• Other?
6. Output of the project
• Representative sample set of digitised newspapers, with ground truth
• Report on quality of OCR of Delpher’s digitised newspapers
• Report on post-correction possibilities of OCR using automatic
techniques
• Impact analysis of most likely method of improvement
• Prototype for OCR post-correction and evaluation using deep learning
7. Sample set
• 2000 pages
• Representative of the whole collection, taking into account:
• Date of publication
• Date of production
• Software used
9. Database with production
information
• Extracted from the metadata:
• Issue identifier
• Newspaper title
• Publication date
• Producer
• Production date
• Software used
?? Some issues processed twice with ABBYY 8.1 & 9.0
11. Post-correction methods
• Deep learning (by Janneke van der Zwaan from the Netherlands eScience
Center)
• PICCL by Martin Reynaert (UvT & RU)
• https://github.com/LanguageMachines/PICCL
• Proprietary software from a startup?
• More?
12. Improve OCR using Deep
Learning
• Character-Level
Language models
• Long Short Term
Memory (LSTM)
Also applicable for OCR evaluation
without GT!
13. Reprocessing images
• Service provider
• Access or master images?
• Access: JPEG 2000 lossy compression, colour (or greyscale in case of
original from microfilm)
• Master: JPEG 2000 part 1, lossless compression, greyscale and colour
• Standard software or new solutions?
14. Evaluation
• Focused on wordbased searching
• Bag of words?
• Use existing tools or create our own?
15. Impact analysis
• Most likely scenario
• Impact on the organisation
• Percentage of improvement on OCR
• Effort needed to implement method
• How to handle different versions of OCR