Evaluation and post-correction of OCR of digitised historical newspapers

•

1 like•429 views

Presentation "Evaluation and post-correction of OCR of digitised historical newspapers" by Lotte Wilms, Koninklijke Bibliotheek, at IMPACT Annual Members' Meeting 2017. www.delpher.nl

Evaluation and post-correction of OCR of digitised
historical newspapers
A research project
Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center)
@lottewilms @jvdzwa

• Digitised Dutch newspapers
• 1618-1995
• Images + metadata + text
• now: 11 million pages (in 1.351.123 issues)
• prognosis 2020: 20 million pages
• Full text searchable on: www.delpher.nl
Delpher newspaper corpus

type/format level comments
PDF issue Searchable text + scan
JPEG-2000 page Access: JPEG 2000 lossy compression,
colour (or greyscale in case of original
from microfilm)
Master: JPEG 2000 part 1, lossless
compression, greyscale or colour
Dublin Core iss./p./art. Descriptive metadata
OCR article XML
ALTO page
mpeg21-didl issue Structural metadata
The data

Aims of project
• Insight into quality of our OCR
• Insight into automated methods of post-correction
• Reprocessing images
• Machine learning approach
• Other?

Output of the project
• Representative sample set of digitised newspapers, with ground truth
• Report on quality of OCR of Delpher’s digitised newspapers
• Report on post-correction possibilities of OCR using automatic
techniques
• Impact analysis of most likely method of improvement
• Prototype for OCR post-correction and evaluation using deep learning

Sample set
• 2000 pages
• Representative of the whole collection, taking into account:
• Date of publication
• Date of production
• Software used

Database with production
information
• Extracted from the metadata:
• Issue identifier
• Newspaper title
• Publication date
• Producer
• Production date
• Software used
?? Some issues processed twice with ABBYY 8.1 & 9.0

Post-correction methods
• Deep learning (by Janneke van der Zwaan from the Netherlands eScience
Center)
• PICCL by Martin Reynaert (UvT & RU)
• https://github.com/LanguageMachines/PICCL
• Proprietary software from a startup?
• More?

Improve OCR using Deep
Learning
• Character-Level
Language models
• Long Short Term
Memory (LSTM)
 Also applicable for OCR evaluation
without GT!

Reprocessing images
• Service provider
• Access or master images?
• Access: JPEG 2000 lossy compression, colour (or greyscale in case of
original from microfilm)
• Master: JPEG 2000 part 1, lossless compression, greyscale and colour
• Standard software or new solutions?

Evaluation
• Focused on wordbased searching
• Bag of words?
• Use existing tools or create our own?

Impact analysis
• Most likely scenario
• Impact on the organisation
• Percentage of improvement on OCR
• Effort needed to implement method
• How to handle different versions of OCR

Neo4j has long been recognised as the world's leading graph database, and now is expanding to be a true platform for Enterprise Graph Applications. Now is a better time than ever to explore which version of Neo4j you should be using considering the product features and/or services you want to benefit from. In this webinar, we'll walk through the different versions of Neo4j, discuss and show their remits, and present what the alternative (open source or commercial) licensing options may be. Specifically, we'll dig into the free open-source, free commercial (for individuals, startups, and academics) and paid commercial options and answer any questions that you may have.

Zillow's favorite big data & machine learning tools

njstevens

ISBG 2016 - XPages on IBM Bluemix

Oliver Busse

ERPNext Open Day - May 2013rushabh_mehta

Find your data

Oliver Busse

GraphTour - Albelli: Running Neo4j on a large scale image platform

Neo4j

Distributed Deep Learning (And How to Get Involved)

Sina Sheikholeslami

Application softwareArdit Meti

Presenter: John Andleman, Staff Database Engineer, Citrix In this session, John will share some interesting use cases leveraging the HPCC Systems platform, including those beyond traditional big data uses. John will also share his roadmap of HPCC projects being planned for the next few months and why he feels HPCC Systems is a more suitable solution than Hadoop based on experiences and lessons learned. NOTE: The video of this presentation is the 3rd one shown in the accompanying YouTube link.

Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...

Maurice Nsabimana

Volunteers around the world increasingly act as human sensors to collect millions of data points. A team from the World Bank trained deep learning models, using Apache Spark and BigDL, to confirm that photos gathered through a crowdsourced data collection pilot matched the goods for which observations were submitted. In this talk, Maurice Nsabimana, a statistician at the World Bank, and Jiao Wang, a software engineer on the Big Data Technology team at Intel, demonstrate a collaborative project to design and train large-scale deep learning models using crowdsourced images from around the world. BigDL is a distributed deep learning library designed from the ground up to run natively on Apache Spark. It enables data engineers and scientists to write deep learning applications in Scala or Python as standard Spark programs-without having to explicitly manage distributed computations. Attendees of this session will learn how to get started with BigDL, which runs in any Apache Spark environment, whether on-premise or in the Cloud.

MapInfo Professional 12.5 and Discover3D 2014 - A brief overview

Prakher Hajela Saxena

Mapinfo 2014GeoMedeelel

The future of_conver_ai[6933]

LianaYe2

Including review the state-of-the-art dialog systems, and their underlying deep learning architecture. Also review two types of dialog systems: Open-domain chatbots that can make conversation on any subject, and task-oriented chatbots that serve the user for specific needs (such as ordering food or booking a ticket). IMore discussion on the success and challenges in both domains. Finally, it includes share our work at LivePerson for providing conversational AI solutions to more than 18,000 enterprise customers.

Crossmedia WorkflowsDwight Kelly

Converting and Transforming Technical Graphics

dclsocialmedia

NovoDynamics Company OverviewArt Nicholas

Multimodal Perspectives for Digitised Historical Newspapers

cneudecker

Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse

DataWorks Summit

Yahoo Mail has 200+ million users a month and generates hundreds of terabytes of data per day, which continues to grow steadily. The nature of email messages has also evolved: for example, today the majority of them are generated by machines, consisting of newsletters, social media notifications, purchase invoices, travel bookings, and the like, which drove innovations in product development to help users organize their inboxes. Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.

VRA 2012, Cataloging Case Studies, ROBOCATALOGING

Visual Resources Association

Presented by Joshua Polansky at the Annual Conference of the Visual Resources Association, April 18th - April 21st, 2012, in Albuquerque, New Mexico. The Cataloguing Case Studies session will explore metadata migration, workflows, cloud computing, and tagging and how they can be applied to digital collections. Mary Alexander of the University of Alabama will present on the second of two migrations that have taken place at the University of Alabama Libraries and the importance of metadata schema and workflows in that process. Joshua Polansky of the University of Washington will describe his automated workflow using optical character recognition (OCR), Apple Automator, and Microsoft Excel to speed the process of collecting metadata for 75,000 digital assets. Elizabeth Berenz of ARTstor will look at the advantages of cloud based software for image management using Shared Shelf as a working example. And finally Ian McDermott will demonstrate the advantages of expert tagging and annotation in improving metadata. His presentation will focus on two ARTstor collections that could benefit from the knowledge of the larger ARTstor community: the Gernsheim Photographic Corpus of Drawings and the Larry Qualls Archive of contemporary art exhibitions. MODERATOR: Jeannine Keefer, University of Richmond, VA PRESENTERS: Mary Alexander, University of Alabama Elizabeth Berenz, ARTstor Ian McDermott, ARTstor Joshua Polansky, University of Washington

About Scanning and Metadata Standards - NEMO 2010

University of Connecticut Libraries Map and Geographic Information Center - MAGIC

The Elephant in the Library - Integrating Hadoop

cneudecker

Digital Content Creation

Dept of Library and Information Science Tumkur University

The Poznań Foundation of Scientific Libraries - Gorny et Lewandowski

IMPACT Centre of Competence

SPCA2013 - The Newest Trends in Document and Data Capture on Microsoft Platform

NCCOMMS

Session6 01.helmut schmid

IMPACT Centre of Competence

Session1 03.hsian-an wang

IMPACT Centre of Competence

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Product School

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Product School

Assuring Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Recently uploaded (20)

JMeter webinar - integration with InfluxDB and Grafana

The Art of the Pitch: WordPress Relationships and Sales

Monitoring Java Application Security with JDK Tools and JFR Events

Bits & Pixels using AI for Good.........

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...

Essentials of Automations: Optimizing FME Workflows with Parameters

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Leading Change strategies and insights for effective change management pdf 1.pdf

Neuro-symbolic is not enough, we need neuro-*semantic*

Key Trends Shaping the Future of Infrastructure.pdf

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Mission to Decommission: Importance of Decommissioning Products to Increase E...

Securing your Kubernetes cluster_ a step-by-step guide to success !

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Designing Great Products: The Power of Design and Leadership by Chief Designe...

Assuring Contact Center Experiences for Your Customers With ThousandEyes

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Evaluation and post-correction of OCR of digitised historical newspapers

1. Evaluation and post-correction of OCR of digitised historical newspapers A research project Lotte Wilms (KB) & Janneke van der Zwaan (NL eScience Center) @lottewilms @jvdzwa

2. • Digitised Dutch newspapers • 1618-1995 • Images + metadata + text • now: 11 million pages (in 1.351.123 issues) • prognosis 2020: 20 million pages • Full text searchable on: www.delpher.nl Delpher newspaper corpus

3. Crowdsourced corrections

4. type/format level comments PDF issue Searchable text + scan JPEG-2000 page Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm) Master: JPEG 2000 part 1, lossless compression, greyscale or colour Dublin Core iss./p./art. Descriptive metadata OCR article XML ALTO page mpeg21-didl issue Structural metadata The data

5. Aims of project • Insight into quality of our OCR • Insight into automated methods of post-correction • Reprocessing images • Machine learning approach • Other?

6. Output of the project • Representative sample set of digitised newspapers, with ground truth • Report on quality of OCR of Delpher’s digitised newspapers • Report on post-correction possibilities of OCR using automatic techniques • Impact analysis of most likely method of improvement • Prototype for OCR post-correction and evaluation using deep learning

7. Sample set • 2000 pages • Representative of the whole collection, taking into account: • Date of publication • Date of production • Software used

8. Production information <OCRProcessing>

9. Database with production information • Extracted from the metadata: • Issue identifier • Newspaper title • Publication date • Producer • Production date • Software used ?? Some issues processed twice with ABBYY 8.1 & 9.0

10. Ground truth

11. Post-correction methods • Deep learning (by Janneke van der Zwaan from the Netherlands eScience Center) • PICCL by Martin Reynaert (UvT & RU) • https://github.com/LanguageMachines/PICCL • Proprietary software from a startup? • More?

12. Improve OCR using Deep Learning • Character-Level Language models • Long Short Term Memory (LSTM)  Also applicable for OCR evaluation without GT!

13. Reprocessing images • Service provider • Access or master images? • Access: JPEG 2000 lossy compression, colour (or greyscale in case of original from microfilm) • Master: JPEG 2000 part 1, lossless compression, greyscale and colour • Standard software or new solutions?

14. Evaluation • Focused on wordbased searching • Bag of words? • Use existing tools or create our own?

15. Impact analysis • Most likely scenario • Impact on the organisation • Percentage of improvement on OCR • Effort needed to implement method • How to handle different versions of OCR

16. Any tips or questions?

Evaluation and post-correction of OCR of digitised historical newspapers

Recommended

Recommended

More Related Content

Similar to Evaluation and post-correction of OCR of digitised historical newspapers

Similar to Evaluation and post-correction of OCR of digitised historical newspapers (20)

More from IMPACT Centre of Competence

More from IMPACT Centre of Competence (20)

Recently uploaded

Recently uploaded (20)

Evaluation and post-correction of OCR of digitised historical newspapers