This document discusses methods for digitizing historical documents and assessing optical character recognition (OCR) quality. It proposes using machine learning to classify bounding boxes from OCR output as either text or noise in order to estimate OCR quality without ground truth text. It also describes using active learning and features extracted from word images to identify fonts in documents with only a small number of labeled examples. The methods were able to assess OCR quality with non-text regions and identify document fonts using bag-of-words features and labeling fewer than 450 samples.
Predicting Optimal Parallelism for Data AnalyticsDatabricks
A key benefit of serverless computing is that resources can be allocated on demand, but the quantity of resources to request, and allocate, for a job can profoundly impact its running time and cost. For a job that has not yet run, how can we provide users with an estimate of how the job’s performance changes with provisioned resources, so that users can make an informed choice upfront about cost-performance tradeoffs?
This talk will describe several related research efforts at Microsoft to address this question. We focus on optimizing the amount of computational resources that control a data analytics query’s achieved intra-parallelism. These use machine learning models on query characteristics to predict the run time or Performance Characteristic Curve (PCC) as a function of the maximum parallelism that the query will be allowed to exploit.
The AutoToken project uses models to predict the peak number of tokens (resource units) that is determined by the maximum parallelism that the recurring SCOPE job can ever exploit while running in Cosmos, an Exascale Big Data analytics platform at Microsoft. AutoToken_vNext, or TASQ, predicts the PCC as a function of the number of allocated tokens (limited parallelism). The AutoExecutor project uses models to predict the PCC for Apache Spark SQL queries as a function of the number of executors. The AutoDOP project uses models to predict the run time for SQL Server analytics queries, running on a single machine, as a function of their maximum allowed Degree Of Parallelism (DOP).
We will present our approaches and prediction results for these scenarios, discuss some common challenges that we handled, and outline some open research questions in this space.
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
Deep learning is having a profound impact on AI applications. With the future of neural network-inspired computing in mind, re:Invent is hosting the first ever Deep Learning Summit. Designed for developers to learn about the latest in deep learning research and emerging trends, attendees will hear from industry thought leaders—members of the academic and venture capital communities—who will share their perspectives in 30-minute Lightning Talks.
The Summit will be held on Thursday, November 30th at the Venetian from 1-5pm.
The Deep Learning Revolution - Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous Robotics - Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language - Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less - Martial Herbert, Carnegie Mellon University
Learning Where to Look in Video - Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and Sound - Antonio Torralba, MIT
Investing in the Deep Learning Future - Matt Ocko, Data Collective Venture Capital
Predicting Optimal Parallelism for Data AnalyticsDatabricks
A key benefit of serverless computing is that resources can be allocated on demand, but the quantity of resources to request, and allocate, for a job can profoundly impact its running time and cost. For a job that has not yet run, how can we provide users with an estimate of how the job’s performance changes with provisioned resources, so that users can make an informed choice upfront about cost-performance tradeoffs?
This talk will describe several related research efforts at Microsoft to address this question. We focus on optimizing the amount of computational resources that control a data analytics query’s achieved intra-parallelism. These use machine learning models on query characteristics to predict the run time or Performance Characteristic Curve (PCC) as a function of the maximum parallelism that the query will be allowed to exploit.
The AutoToken project uses models to predict the peak number of tokens (resource units) that is determined by the maximum parallelism that the recurring SCOPE job can ever exploit while running in Cosmos, an Exascale Big Data analytics platform at Microsoft. AutoToken_vNext, or TASQ, predicts the PCC as a function of the number of allocated tokens (limited parallelism). The AutoExecutor project uses models to predict the PCC for Apache Spark SQL queries as a function of the number of executors. The AutoDOP project uses models to predict the run time for SQL Server analytics queries, running on a single machine, as a function of their maximum allowed Degree Of Parallelism (DOP).
We will present our approaches and prediction results for these scenarios, discuss some common challenges that we handled, and outline some open research questions in this space.
Presentation of the paper PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text by Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter and Klaus Schulz in DATeCH 2014. #digidays
Deep learning is having a profound impact on AI applications. With the future of neural network-inspired computing in mind, re:Invent is hosting the first ever Deep Learning Summit. Designed for developers to learn about the latest in deep learning research and emerging trends, attendees will hear from industry thought leaders—members of the academic and venture capital communities—who will share their perspectives in 30-minute Lightning Talks.
The Summit will be held on Thursday, November 30th at the Venetian from 1-5pm.
The Deep Learning Revolution - Terrence Sejnowski, The Salk Institute for Biological Studies
Eye, Robot: Computer Vision and Autonomous Robotics - Aaron Ames & Pietro Perona, California Institute of Technology
Exploiting the Power of Language - Alexander Smola, Amazon Web Services
Reducing Supervision: Making More with Less - Martial Herbert, Carnegie Mellon University
Learning Where to Look in Video - Kristen Grauman, University of Texas
Look, Listen, Learn: The Intersection of Vision and Sound - Antonio Torralba, MIT
Investing in the Deep Learning Future - Matt Ocko, Data Collective Venture Capital
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Jonathon Hare
Multimedia Content Access: Algorithms and Systems IV (SPIE Electronic Imaging 2010). January 2010.
http://eprints.soton.ac.uk/268496/
This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library.
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
We describe the LACS submission to the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Our experiments investigate how different retrieval models interact with word stemming and stopword removal. On the development data, we segment the subtitle and Automatic Speech Recognition (ASR) transcripts into fixed length time units, and examine the effect of different retrieval models. We find that stemming provides consistent improvement; stopword removal is more sensitive to the retrieval models on the subtitles. These manipulations do not contribute to stable improvement on the ASR transcripts. Our experiments on test data focus of the subtitle. The gap in performance for different retrieval models is much less compared to the development data. We achieved 0.477 MAP on the test data.
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_27.pdf
Measuring Search Engine Quality using Spark and PythonSujit Pal
Presented at PyData Amsterdam 2016. Describes the Rewinder tool, to compare search engine configuration performance between Microsoft FAST and Apache Solr for the ScienceDirect search backend migration.
Design and implementation of optical character recognition using template mat...eSAT Journals
Abstract
Optical character recognition (OCR) is an efficient way of converting scanned image into machine code which can further edit. There are variety of methods have been implemented in the field of character recognition. This paper proposes Optical character recognition by using Template Matching. The templates formed, having variety of fonts and size .In this proposed system, Image pre-processing, Feature extraction and classification algorithms have been implemented so as to build an excellent character recognition technique for different scripts .Result of this approach is also discussed in this paper. This system is implemented in Matlab.
Keywords- OCR, Feature Extraction, Classification
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
Boosted by Apache Spark’s data processing engine, machine learning as a service (MLaaS) is now faster and more powerful. However, Spark MLlib is developing and is limited by data preprocessing algorithms. In this session, learn how Suning R&D’s MLaaS platform abstracted, standardized and implemented a very rich machine learning pipeline on top of Spark, from data pre-processing, supervised and unsupervised modeling, performance evaluation, to model deployment. Their feature Spark extensions are: 1) a rich function set of data pre-processing, such as missing data treatment, many types of sampling, outlier detecting, advanced binning, etc.; (2) time series analysis/modeling algorithms; (3) domain-specific library for finance, such as cost sensitive decision tree for fraud detection; (4) a user-friendly drag-and-play codeless modeling canvas.
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
The slides from my 2hr tutorial organised at 2018 Learning Analytics Summer Institute (LASI) at Teachers College, Columbia University on June 11, 2018.
Roger Labahn (University of Rostock, DE): Handwritten Text Recognition. Key concepts
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
The talk explores following topics:
- What is the search relevance and why is it important?
- Relevance scoring in Elasticsearch
- Manipulating relevance with Query DSL structure
- Pros and cons in using Machine Learning for improving search relevance
- Using Learning to Rank (aka Machine Learning for better relevance) in Elasticsearch
Apache Solr is a powerful search and analytics engine with features such as full-text search, faceting, joins, sorting and capable of handling large amounts of data across a large number of servers. However, with all that power and scalability comes complexity. Solr 6 supports a Parallel SQL feature which provides a simplified, well-known interface to your data in Solr, performs key operations such as sorts and shuffling inside Solr for massive speedups, provides best-practices based query optimization and by leveraging the scalability of SolrCloud and a clever implementation, allows you to throw massive amounts of computation power behind analytical queries.
In this talk, we will explore the why, what and how of Parallel SQL and its building block Streaming Expressions in Solr 6 with a hint of the exciting new developments around this feature.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
More Related Content
Similar to Assessment of OCR quality and font identification in historical documents
Semantic Retrieval and Automatic Annotation: Linear Transformations, Correlat...Jonathon Hare
Multimedia Content Access: Algorithms and Systems IV (SPIE Electronic Imaging 2010). January 2010.
http://eprints.soton.ac.uk/268496/
This paper proposes a new technique for auto-annotation and semantic retrieval based upon the idea of linearly mapping an image feature space to a keyword space. The new technique is compared to several related techniques, and a number of salient points about each of the techniques are discussed and contrasted. The paper also discusses how these techniques might actually scale to a real-world retrieval problem, and demonstrates this though a case study of a semantic retrieval technique being used on a real-world data-set (with a mix of annotated and unannotated images) from a picture library.
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
We describe the LACS submission to the Search sub-task of the Search and Hyperlinking Task at MediaEval 2014. Our experiments investigate how different retrieval models interact with word stemming and stopword removal. On the development data, we segment the subtitle and Automatic Speech Recognition (ASR) transcripts into fixed length time units, and examine the effect of different retrieval models. We find that stemming provides consistent improvement; stopword removal is more sensitive to the retrieval models on the subtitles. These manipulations do not contribute to stable improvement on the ASR transcripts. Our experiments on test data focus of the subtitle. The gap in performance for different retrieval models is much less compared to the development data. We achieved 0.477 MAP on the test data.
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_27.pdf
Measuring Search Engine Quality using Spark and PythonSujit Pal
Presented at PyData Amsterdam 2016. Describes the Rewinder tool, to compare search engine configuration performance between Microsoft FAST and Apache Solr for the ScienceDirect search backend migration.
Design and implementation of optical character recognition using template mat...eSAT Journals
Abstract
Optical character recognition (OCR) is an efficient way of converting scanned image into machine code which can further edit. There are variety of methods have been implemented in the field of character recognition. This paper proposes Optical character recognition by using Template Matching. The templates formed, having variety of fonts and size .In this proposed system, Image pre-processing, Feature extraction and classification algorithms have been implemented so as to build an excellent character recognition technique for different scripts .Result of this approach is also discussed in this paper. This system is implemented in Matlab.
Keywords- OCR, Feature Extraction, Classification
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Databricks
Boosted by Apache Spark’s data processing engine, machine learning as a service (MLaaS) is now faster and more powerful. However, Spark MLlib is developing and is limited by data preprocessing algorithms. In this session, learn how Suning R&D’s MLaaS platform abstracted, standardized and implemented a very rich machine learning pipeline on top of Spark, from data pre-processing, supervised and unsupervised modeling, performance evaluation, to model deployment. Their feature Spark extensions are: 1) a rich function set of data pre-processing, such as missing data treatment, many types of sampling, outlier detecting, advanced binning, etc.; (2) time series analysis/modeling algorithms; (3) domain-specific library for finance, such as cost sensitive decision tree for fraud detection; (4) a user-friendly drag-and-play codeless modeling canvas.
Introduction to R for Learning Analytics ResearchersVitomir Kovanovic
The slides from my 2hr tutorial organised at 2018 Learning Analytics Summer Institute (LASI) at Teachers College, Columbia University on June 11, 2018.
Roger Labahn (University of Rostock, DE): Handwritten Text Recognition. Key concepts
co:op-READ-Convention Marburg
Technology meets Scholarship, or how Handwritten Text Recognition will Revolutionize Access to Archival Collections.
With a special focus on biographical data in archives
Hessian State Archives Marburg Friedrichsplatz 15, D - 35037 Marburg
19-21 January 2016
The talk explores following topics:
- What is the search relevance and why is it important?
- Relevance scoring in Elasticsearch
- Manipulating relevance with Query DSL structure
- Pros and cons in using Machine Learning for improving search relevance
- Using Learning to Rank (aka Machine Learning for better relevance) in Elasticsearch
Apache Solr is a powerful search and analytics engine with features such as full-text search, faceting, joins, sorting and capable of handling large amounts of data across a large number of servers. However, with all that power and scalability comes complexity. Solr 6 supports a Parallel SQL feature which provides a simplified, well-known interface to your data in Solr, performs key operations such as sorts and shuffling inside Solr for massive speedups, provides best-practices based query optimization and by leveraging the scalability of SolrCloud and a clever implementation, allows you to throw massive amounts of computation power behind analytical queries.
In this talk, we will explore the why, what and how of Parallel SQL and its building block Streaming Expressions in Solr 6 with a hint of the exciting new developments around this feature.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Similar to Assessment of OCR quality and font identification in historical documents (20)
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Nutraceutical market, scope and growth: Herbal drug technology
Assessment of OCR quality and font identification in historical documents
1.
2. Anshul Gupta | CSE@TAMU 2
What are historical documents?
– Correspondence
– Diaries
– Newspapers
– Government Documents
– Books
3. Anshul Gupta | CSE@TAMU 3
Digitizing historical documents
• Why?
– Historical records are in analog
form
– Due to their fragility, most of them
are not accessible
– Not searchable
• How to make them accessible?
– Digital text transcription
• Ways of digitization
– Hand transcribe each book
• Resource intensive
– OCR: optical character recognition
• high-error in text transcription
• Mass digitization projects
4. Anshul Gupta | CSE@TAMU 4
Early modern OCR project (eMOP)
• Goal
– Improve OCR accuracy for
early modern texts
• 300k documents, 45M
pages
– Open source OCR tools
• Challenges
– Early modern printing
• Irregular fonts
• Decorative page elements
– Document image
problems
– Problems get severe
• Images are binarized
Pictures
Decorative page
elements
5. Anshul Gupta | CSE@TAMU 5
Goals
Font
Metadata
Automatic Quality
Assessment*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal1 Goal2
Denoised
hOCR+
Document
Images
*hOCR output from Tesseract OCR.
6. Anshul Gupta | CSE@TAMU 6
Why we want to assess OCR quality?
• Improve runtime
– Focus on documents with good OCR quality
– Send bad quality documents to a separate diagnostics pipeline
• How to measure OCR quality?
– Number of methods exists
• EMOP use Juxta score
• Measures similarity between OCR output and ground truth text
– But, such scores need ground truth
• Not available for all documents
• Automated way to assess OCR quality
7. Anshul Gupta | CSE@TAMU 7
Our approach
• Post-process OCR output
– Page segmentation result such as bounding box (BB) coordinates
– OCR word confidence
• Build ML models to remove noise
– Binary classification: classify each BB either as text or noise
• 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐎𝐂𝐑 ∝
𝟏
% 𝐧𝐨𝐢𝐬𝐞 𝐁𝐁𝐬
10. Anshul Gupta | CSE@TAMU 10
• Prefiltering
– Provides initial labels to
be refined in later stages
– Rule based classifier
• BB properties
and OCR word confidence
• Conjunction of rules
– Problems
• Many text BBs classified as
noise
• Need a way to recover mis-
classified text BBs
Area of BB > 1st percentile ?
OCR word confidence
in (0,0.95)?
Yes
Non-text
No
Height/Width < 2?
YesNo
Non-text
NoYes
Non-textText
Height
Width
11. Anshul Gupta | CSE@TAMU 11
• Column extraction
– Extract individual column and then process each column
Leftmost
text BB
Rightmost
text BB
Trough
12. Anshul Gupta | CSE@TAMU 12
• Local iterative relabeling
– Refines initial labels
– Based on BB properties
and its neighborhood
– Applies an MLP classifier
iteratively to refine BB
labels (text/noise)
Features used during local iterative relabeling
Features Description
𝑆 Score from nearest neighbors ; see eq. (1)
𝐶 𝑂𝐶𝑅 OCR word confidence*
𝐻/𝑊 Height-to-width ratio of BB*
𝐴 Area of BB*
𝐻 𝑛𝑜𝑟𝑚 Normalized height: 𝐻 𝑛𝑜𝑟𝑚 = 𝐻 − 𝐻 𝑚𝑒𝑑 𝐻𝐼𝑄𝑅
𝐻 𝑑𝑖𝑠𝑡 Horizontal distance from the middle of the page
𝑉𝑑𝑖𝑠𝑡 Vertical distance from the top margin
*available from the pre-filtering stage
BBs for a
column
Initial labels from pre-filter stage
Multi-layer
perceptron
model
New
labels
old
labels=
= new
labels?
No
Yes
Labels: Text or Noise
𝑘=1
𝑃
𝑤 𝑘 𝐿 𝑘
𝑘=1
𝑃
𝑤 𝑘
Geometric
Features
𝑫 𝒎𝒂𝒙
13. Anshul Gupta | CSE@TAMU 13
Final output
3
Confidence
1 1
0 0
Text
Noise
1
2
4
14. Anshul Gupta | CSE@TAMU 14
Results
• Label refinement: local iterative relabeling.
0.85
0.9
0.95
1
Precision Recall F1 score
Pre-filtering After iterative relabeling
• Dataset
– Binarized page images
– Images are selected to represent variety in the eMOP corpora.
• Multi-page; single column; ink bleed-through; multiple skew; warping;
printed margins
• Label creation
– Each BB returned by OCR is manually labelled as 0:noise and
1: text
– 72,366 BBs are labelled0%
20%
40%
60%
80%
100%
1 2 3 4 5
PROPORTION
NUMBER OF ITERATIONS
15. Anshul Gupta | CSE@TAMU 15
• Quality assessment result
– % noise BBs = 𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
– Juxta Score
• 𝑆𝐽𝑊 similarity b/w OCR
output and ground truth
• eMOP uses juxta-cl* to
generate 𝑆𝐽𝑊
– 6,775 test documents with
ground truth text
– Compare % noise BBs and
Juxta score 𝑠𝐽𝑊 0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
𝐵𝐵 𝑛𝑜𝑖𝑠𝑒
𝜌 = −0.7038
*implementations from juxtacommons.org
17. Anshul Gupta | CSE@TAMU 17
Recap
Font
Metadata
Automatic Quality
Assessment
Yes
No
*hOCR
Quality Score
Active Font
Identification
Black
font
Roman
Mixed
Good Documents
Bad Documents
Goal 1 Goal 2
Denoised
hOCR
+ Document
Images
18. Anshul Gupta | CSE@TAMU 18
Why we need font identification?
• Improve OCR quality
– EMOP collections have documents in multiple fonts
– OCR system works best when knowledge of font is available
– Don’t have font database for EMOP collections
• How?
– Manual tagging
• Human can label/tag each document
– Automatic tagging
• Machine learning models that can recognize fonts
– But for font identification for EMOP
• Need a labeled data (a training set)
• Getting labeled data from millions of page images
– Need an efficient way to train supervised ML models
19. Anshul Gupta | CSE@TAMU 19
Our approach
• Active learning
– Allow ML algorithm to
– Acquire its own training data
– Select most informative examples for labelling
– Build ML models using as few labeled data
20. Anshul Gupta | CSE@TAMU 20
Active learning
• A learning paradigm
– Train a classifier using labelled data
– Sample most informative instances : Active sampling
– Ask for labels from a human
L
U
ML algo
Query
Selects most
informative
instances
{X,?}
{X, Label}
Add to L
Small L
21. Anshul Gupta | CSE@TAMU 21
• Active learning for font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier
Training Active sampling
22. Font classes and characteristics
Blackletter
• Examples
• Characteristics
Roman
• Examples
• Characteristics
Thick stroke
Thin stroke
Thick stroke
Angles strokes
Horizontal
serifs
Similar vertical stroke width
23. Anshul Gupta | CSE@TAMU 23
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
24. Anshul Gupta | CSE@TAMU 24
Preprocess word image
• Normalize the height of
word images
– Resize each word image to
have same height
• Remove salt and pepper
noise
• Correct skew
– Calculate a time frequency
distribution for different skew
angles
– Skew angle at which
distributions shows a peak
(a)
(b)
(a)
(b)
25. Anshul Gupta | CSE@TAMU 25
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
26. Anshul Gupta | CSE@TAMU 26
Mean and IQR character width
• Roman fonts have smaller vertical stroke width than the
Blackletter
– Mean character width
• Blackletter fonts have drastic differences in the stroke
widths
– IQR character width
• How to capture these characteristics?
Mid
Mid + 20
Mid - 20
27. Anshul Gupta | CSE@TAMU 27
Feature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
28. Anshul Gupta | CSE@TAMU 28
Slant line density
• Blackletters fonts are characterized by angled lines and
serifs
– Capture the amount of angled straight lines in a word image
– Density of angled lines per character
• How?
– Hough transform
– Number of lines with slope between 45° ± 5° and -45° ± 5°
– Divide by number of characters (hOCR)
Edge
Detection
Hough
Transform
Word Image Edge Image
29. Anshul Gupta | CSE@TAMU 29
Feature extractionFeature extraction
Denoise hOCR
Extract
features
from word
images
Mean and
IQR character
widths
Slant line
density
Zernike
Moments
Preprocess
30. Anshul Gupta | CSE@TAMU 30
Zernike Moments
• Zernike Moments (ZMs) are
shape descriptors
– To capture the visual
appearance of the text (words)
• 6 ZMs along with their
transformations
– Total of 15 features similar ones
used in tumor classification
problem by Tahmasbi et al.
(2011).
33. Anshul Gupta | CSE@TAMU 33
Classifier
• Label propagation
– Graph based semi-supervised classifier
– Uses labeled and unlabeled data to form a graph structure
– Labeled data act like source that transmit labels to
unlabeled data according to similarity wij
Unlabeled
example
Two labeled
examples
wij
34. Anshul Gupta | CSE@TAMU 34
Active
sampling
Recap
OCR
TIFF
Feature
extraction
Select
sampleshOCR
TIFF
Train font
classifier
35. Anshul Gupta | CSE@TAMU 35
Active sampling
Feature space
Classifier decision boundary
Class 1
Class 2
Uncertainty based sampling
(HS)
Dissimilarity based
sampling (DS)
Diversity (D’)
36. Anshul Gupta | CSE@TAMU 36
Recap
OCR
TIFF
Feature
extraction
Select
samples
Tag
sampleshOCR
TIFF
Train font
classifier
37. Anshul Gupta | CSE@TAMU 37
Results
• Dataset
– 3272 documents from ECCO and EEBO collections
– eMOP experts labeled documents
• 1005 Black documents
• 1768 Roman documents
• 498 Mixed documents – text printed in both fonts
38. Anshul Gupta | CSE@TAMU 38
Experiment 1
• Quality of extracted features : Word level
500 Roman
& 500
blackletter
Images
Feature
extraction
Classifier
Blackletter word
Roman word
ALL
Zernike
Moments
(ZMs)
Mean and
IQR CW,
SLD
39. Anshul Gupta | CSE@TAMU 39
Result 1
0.8433
0.805
0.6717
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ALL Only ZMs CW(mean &
IQR) and SLD
Cross-validationF1score
Feature sets
42. Anshul Gupta | CSE@TAMU 42
Experiment 3
• Performance of active learning model
L
U
Train ML
model
Query labels for 20
instances
Select most
informative
instances{X,?}
{X, Label}
Add to L
3 labeled examples
T
Repeat 20 times
Store validation
accuracy
45. Anshul Gupta | CSE@TAMU 45
Future work
• Automatic assessment of OCR quality
– Linguistic features can be explored
– Denoise hOCR can be used to detect unknown noises
• Bleedthrough, irregular fonts, speckle noise, etc
• Are there any other types of page problems exists in eMOP
collections?
• Active learning based font identification
OCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
hOCR
TIFF
Feature
extraction
Select
samples
Tag
samples
Update font
classifier
Font
Bleedthrough
Musical scripts
Picture
46. Anshul Gupta | CSE@TAMU 46
Conclusion
• Summary
– Automatic assessment of OCR quality
• Non-text OCR outputs suffice to
– Identify text and noise in a document image
– Estimate the document’s overall quality
– Improve OCR transcription performance when image processing
based preprocessing is prohibitive
– Active learning based font identification
• Word image features capture the font characteristics
• Bag-of-word features show good class separability
• A robust font classifier is trained using just 443 labeled instances
Good Afternoon everyone! I am Anshul Gupta and today I going to present my work on Automatic assessment of OCR quality in historical documents.
So… what are historical documents? Anything that can give us information about certain event in past or about a particular period.
Some examples of historical documents are correspondence, diaris, newspaper, gov docs, and books.
In this presentation we will focus on old printed books, along with algorithms to improve their digitization quality.
Since, these documents are in printed form – with time they degrade. Due to their fragility most of these documents are not accessible.
So, basically Digitization helps to preserve these documents and also makes them searchable.
How can we make then searchable?
We can get get digital text for these documents and then plug this text into a search engine. Sudeenly we can make billions of historical documents searchable. But the challenge here is that how can we get this digital text transcription?
One naïve way is to had transcribe each documents – definatly given billions of documents it is not a feasible option.
We need to use a automated method that is Optical character recognition. These system are great but generate high error text output. Hence, we need to customize these system for historical documents.
Some of the successful mass digitizations projects are from library of congress, google books, Proquest, gale and early modern OCR projects.
The EMOP project is still in progress and the work that I am presenting today is a part of EMOP.
So, lets see what emop is all about?
Explain the font identification… introduce
The two goals of emop are:
Two improve OCR of early texts that is texts printed between 14th centaury to 18th centaury.
Second goal is to produce open source tools such as font databases, crowdsourcing correction tools and post processing tools.
The database of images contains 45 million page images and these images have variety of problems with them.
So, the first set of problems arise due to early modern printing. During that time the printing process was not formalized. They used very odd fonts as shown in the zoomed picture. This is called as blackletter font and these fonts varies from image to image.
As shown in the highlighted region, these are the decorative elemnts and when OCR this page, OCR system sometimes recognize these elemnts as valid text.
Other set of issues are related to the degradation of these documents.
So when the images of these documents are generated , we get issues such as faded fonts, black patchs due to torn page, multiple skew on a same page image.
All the issues are gets severe because all the images that we have are of low-quality binarized images.
Hence, when we OCR these documents we get lots of junk text.
But the fact is that not all the documents are of such bad quality. So, the challenge here is that can we separate the good quality documents are the bad ones?
Talk about binarzed images, so you need not talk about it later.
Talk about two main goals
First part is…
Animate it
Increase the fonts
So, our approach to measure OCR quality is by post processing OCR outputs such as OCR bounding box coordinates and OCR word confidence.
We basically, pose this problem of measuring quality as a binary classification problem where we want to classify each bounding box either as text or noise.
Once we have our labels that is noise or text of all bounding boxes, we can get OCR quality as percentage of predicted noise BBs.
Also, our approach does not depend on text written on the page image.
Here, when we passed this document image from OCR pipeline, we got these green bounding boxes as output.
When pass this OCR out to our algorithm, we just passing these rectangle. Hence, this makes our algorithm language agnoistic.
So, here is the block diagram showing steps in assessment algorithm.
In step 1, the algorithm generates and initial set of labels. Then it divides page image into its constituent columns. It locally refines the bounding box labels.
As I mentioned prefiltering generates intial set of labels. It use a rule based classifier as represented by this tree.
Since, we designed our algorithm to be conservative in predicting text. It loses many text BBs at this point. Hence, we need to recover these text BBs
In order to extract contextual information, page image is divided into its constituent columns. For this, we first generated bounding box density profile along x axis. Then the troughs in this profile represent the column separator.
In this step we process each columns separately,
So, the idea behind this step is that in a book, a word is usually surrounded by more words. Hence, we embedded this locak information into our algorithm by constructing local feature. With this local feature and other geometric features, we trained a Multi layer perceptron. We then used this trained MLP model to iteratively refine the bb labels.
So, the process goes like this, We get the bounding boxes for a column, for each bounding box we calculate local feature as a weighted average of labels of neighbouring boxes. We then pass this local feature and other features to MLP model, which then outputs new labels. If new labels are not equal to old labels, we use new lables to recalculated local feature and the process is repated until labels stops changing.
So the final output looks like this. Here the predicted noise is in red and predicted text is in green. We can see that the algorithm has done a good job in predict non-text as noise such as here picture is classified as noise. Also, it has found out noise even when it is buried in text bounding boxes.
Now lets see how well the proposed algorithm works.
To evaluate the algorithm , we selected a set of images which represents variety in the emop database.
Then we hand labelled around 72,000 bounding boxes.
So now lets look at how well local iterative relabeling works.
So, in this plot, blue bars are the prefilterng result and red are the result after local iterative relabeling. Here can see that both the precision and recall has improved after local refinement. Thins means that local refinement make the algorithm more precisice in predicting what is text and also, recovers lost text from prefiltering step.
Another important aspect of the local refinement is its convergence rate. So for that what we dd we plotted the proportion of document for which local refinement convergeved within certain iterations.
We can see that for almost all documents local refinement step converged in 4 iterations.
So the classification prolem that I presented here, is basically a filtering problem. Here we are trying to filterout noise. Hence, it make sense to see how good is the filtering quality.
For this, we selected around 6700 documents to generate our test set. For all these documents we had ground truth text transcription.
Also, whenever a groundtruth text is available, emop compare OCR text output for that document image with its ground truth.
Asimilarity score Sjw is generated by emop.
So, we did, for each of these test documents, we calculated this similarity score, before and after application of our algorithm. Then we calculated the change in similarity. We plotted that along y axis verses noiseness present on a document image.
We can see that the filtering has a large effect on very noise documents and small on good quality.
Also, for 85 % documents out of these 6700, our algorithm gave a positive change.
Correct the lp diagram
Feature works
Features
Get results for just zms, cw+sl+iqr; combine
So, to summarize
This work proves that OCR output such as BB corrdinates and word confidence can be used to identify tex and noise, can also be used to measure documents overall quality and wherever prepreocessing based filtering is prohibitive, this algorithm can be used in postprocessing stage to remove noise.
Currently, I am working on building a diagnostic pipeline using Active learning
Also, adding linguistic features such as character n-grams can give us cues about certain kinds of noises.
Thank you all for your attention.
And now I am open to questions.