The document discusses a post-correction system for historical optical character recognition (OCR) documents. It is supported by the European Community under the FP7 ICT Work Programme and coordinated by the National Library of the Netherlands. The system allows efficient post-correction of OCR errors through a customizable user interface that displays OCR text, image snippets, and correction candidates. Underlying language models help identify historical variants and distinguish them from OCR errors.
Slides of the second paper on the ULiS project, availiable at http://maxime-lefrancois.info/Publications
We are interested in bridging the world of natural language and the world of the semantic web in particular to support multilingual access to the web of data. In this paper we introduce the ULiS project, that aims at designing a pivot-based NLP technique called Universal Linguistic System, 100% using the semantic web formalisms, and being compliant with the Meaning-Text theory. Through the ULiS, a user could interact with an interlingual knowledge base (IKB) in controlled natural language. Linguistic resources themselves are part of a specific IKB: The Universal Lexical Knowledge base (ULK), so that actors may enhance their controlled natural language, through requests in controlled natural language. We describe a basic interaction scenario at the system level, and provide an overview of the architecture of ULiS. We then introduce the core of the ULiS: the interlingual lexical ontology (ILexicOn), in which each interlingual lexical unit class (ILUc) supports the projection of its semantic decomposition on itself. We validate our model with a standalone ILexicOn, and introduce and explain a concise human-readable notation for it.
A plethora of programming languages have been and continue to be developed to keep pace with hardware advancements and the ever more demanding requirements of software development. <br> As these increasingly sophisticated languages need to be well understood by both programmers and implementors, precise specifications are increasingly required. Moreover, the safety and adequacy with respect to requirements of programs written in these languages needs to be tested, analyzed, and, if possible, proved. <br> This dissertation proposes a rigorous approach to define programming languages based on rewriting, which allows to easily design and test language extensions, and to specify and analyze safety and adequacy of program executions.
To this aim, this dissertation describes the K Framework, an executable semantic framework inspired from rewriting logic but specialized and optimized for programming languages.
The K Framework consists of three components: (1) a language definitional technique; (2) a specialized notation; and (3) a resource-sharing concurrent rewriting semantics. The language definitional technique is a rewriting technique built upon the lessons learned from capturing and studying existing operational semantics frameworks within rewriting logic, and upon attempts to combine their strengths while avoiding their limitations. The specialized notation makes the technical details of the technique transparent to the language designer, and enhances modularity, by allowing the designer to specify the minimal context needed for a semantic rule. Finally, the resource-sharing concurrent semantics relies on the particular form of the semantic rules to enhance concurrency, by allowing overlapping rule instances (e.g., two threads writing in different locations in the store, which overlap on the store entity) to apply concurrently as long as they only overlap on the parts they do not change.
The main contributions of the dissertation are:
(1) a uniform recasting of the major existing operational semantics techniques within rewriting logic;
(2) an overview description of the K Framework and how it can be used to define, extend and analyze programming languages;
(3) a semantics for K concurrent rewriting obtained through an embedding in graph rewriting; and
(4) a description of the K-Maude tool, a tool for defining programming languages using the K technique on top of the Maude rewriting language.
Slides of the second paper on the ULiS project, availiable at http://maxime-lefrancois.info/Publications
We are interested in bridging the world of natural language and the world of the semantic web in particular to support multilingual access to the web of data. In this paper we introduce the ULiS project, that aims at designing a pivot-based NLP technique called Universal Linguistic System, 100% using the semantic web formalisms, and being compliant with the Meaning-Text theory. Through the ULiS, a user could interact with an interlingual knowledge base (IKB) in controlled natural language. Linguistic resources themselves are part of a specific IKB: The Universal Lexical Knowledge base (ULK), so that actors may enhance their controlled natural language, through requests in controlled natural language. We describe a basic interaction scenario at the system level, and provide an overview of the architecture of ULiS. We then introduce the core of the ULiS: the interlingual lexical ontology (ILexicOn), in which each interlingual lexical unit class (ILUc) supports the projection of its semantic decomposition on itself. We validate our model with a standalone ILexicOn, and introduce and explain a concise human-readable notation for it.
A plethora of programming languages have been and continue to be developed to keep pace with hardware advancements and the ever more demanding requirements of software development. <br> As these increasingly sophisticated languages need to be well understood by both programmers and implementors, precise specifications are increasingly required. Moreover, the safety and adequacy with respect to requirements of programs written in these languages needs to be tested, analyzed, and, if possible, proved. <br> This dissertation proposes a rigorous approach to define programming languages based on rewriting, which allows to easily design and test language extensions, and to specify and analyze safety and adequacy of program executions.
To this aim, this dissertation describes the K Framework, an executable semantic framework inspired from rewriting logic but specialized and optimized for programming languages.
The K Framework consists of three components: (1) a language definitional technique; (2) a specialized notation; and (3) a resource-sharing concurrent rewriting semantics. The language definitional technique is a rewriting technique built upon the lessons learned from capturing and studying existing operational semantics frameworks within rewriting logic, and upon attempts to combine their strengths while avoiding their limitations. The specialized notation makes the technical details of the technique transparent to the language designer, and enhances modularity, by allowing the designer to specify the minimal context needed for a semantic rule. Finally, the resource-sharing concurrent semantics relies on the particular form of the semantic rules to enhance concurrency, by allowing overlapping rule instances (e.g., two threads writing in different locations in the store, which overlap on the store entity) to apply concurrently as long as they only overlap on the parts they do not change.
The main contributions of the dissertation are:
(1) a uniform recasting of the major existing operational semantics techniques within rewriting logic;
(2) an overview description of the K Framework and how it can be used to define, extend and analyze programming languages;
(3) a semantics for K concurrent rewriting obtained through an embedding in graph rewriting; and
(4) a description of the K-Maude tool, a tool for defining programming languages using the K technique on top of the Maude rewriting language.
Project number: 224348
Project acronym: AEGIS
Project title: Open Accessibility Everywhere: Groundwork, Infrastructure, Standards
Starting date: 1 September 2008
Duration: 48 Months
AEGIS is an Integrated Project (IP) within the ICT programme of FP7
Project number: 224348
Project acronym: AEGIS
Project title: Open Accessibility Everywhere: Groundwork, Infrastructure, Standards
Starting date: 1 September 2008
Duration: 48 Months
AEGIS is an Integrated Project (IP) within the ICT programme of FP7
Development, distribution and use of open source software comprise a market of data (source code, bug reports, documentation, number of downloads, etc.) from projects, developers and users. This large amount of data makes it difficult for people involved to make sense of implicit links between software projects, e.g., dependencies, patterns, licenses. This context raises the question of what techniques and mechanisms can be used to help users and developers to link related pieces of information across software projects. In this paper, we propose a framework for a marketplace enhanced using linked open data (LOD) technology for linking software artifacts within projects as well as across software projects. The marketplace provides the infrastructure for collecting and aggregating software engineering data as well as developing services for mining, statistics, analytics and visualization of software data. Based on cross-linking software artifacts and projects, the marketplace enables developers and users to understand the individual value of components, their relationship to bigger software systems. Improved understanding creates new business opportunities for software companies: users will be better able to analyze and compare projects, developers can increase the visibility of their products, hosts may offer plug-ins and services over the data to paying customers.
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...IJCI JOURNAL
Β
Recent advancements in the field of natural language processing have markedly enhanced the capability of machines to comprehend human language. However, as language models progress, they require continuous architectural enhancements and different approaches to text processing. One significant challenge stems from the rich diversity of languages, each characterized by its distinctive grammar resulting ina decreased accuracy of language models for specific languages, especially for low-resource languages. This limitation is exacerbated by the reliance of existing NLP models on rigid tokenization methods, rendering them susceptible to issues with previously unseen or infrequent words. Additionally, models based on word and subword tokenization are vulnerable to minor typographical errors, whether they occur naturally or result from adversarial misspellings. To address these challenges, this paper presents the utilization of a recently proposed free-tokenization method, such as Cannine, to enhance the comprehension of natural language. Specifically, we employ this method to develop an Arabic-free tokenization language model. In this research, we will precisely evaluate our modelβs performance across a range of eight tasks using Arabic Language Understanding Evaluation (ALUE) benchmark. Furthermore, we will conduct a comparative analysis, pitting our free-tokenization model against existing Arabic language models that rely on sub-word tokenization. By making our pre-training and fine-tuning models accessible to the Arabic NLP community, we aim to facilitate the replication of our experiments and contribute to the advancement of Arabic language processing capabilities. To further support reproducibility and open-source collaboration, the complete source code and model checkpoints will be made publicly available on our Huggingface1 . In conclusion, the results of our study will demonstrate that the free-tokenization approach exhibits comparable performance to established Arabic language models that utilize sub-word tokenization techniques. Notably, in certain tasks, our model surpasses the performance of some of these existing models. This evidence underscores the efficacy of free-tokenization in processing the Arabic language, particularly in specific linguistic contexts.
Learning Usage of English KWICly with WebLEAP/DSRTakashi Yamanoue
Β
WebLEAP(Web Language Evaluation Assistant Program) is a system that helps us with writing in English. It informs us about the popularities of expressions by displaying the frequencies of subsequences of words that are included in the given sentences or expressions. It collects these data from the Internet by calling a Web search engine. We have reported its system organization and basic features in our previous papers including the ICITA2002 conference.
In this paper, we first summarize our motivations and basic features of WebLEAP. Then we describe some of the new features of WebLEAP together with its new interface. The most significant features are βKWIC(KeyγWord in Context)β andγβdomain specification.β We can see how the expressions are actually used in context with the KWIC feature. We can specify the search domain such as βuk,β βcn,β βjp,β and βusβ so that we can compare the usages between different countries. Finally we demonstrate its usefulness by giving some examples.
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...ESEM 2014
Β
Context: Duplicate detection is a fundamental part of issue management. Systems able to predict whether a new defect report will be closed as a duplicate, may decrease costs by limiting rework and collecting related pieces of information. Goal: Our work explores using Apache Lucene for large- scale duplicate detection based on textual content. Also, we evaluate the previous claim that results are improved if the title is weighted as more important than the description. Method: We conduct a conceptual replication of a well-cited study conducted at Sony Ericsson, using Lucene for searching in the public Android defect repository. In line with the original study, we explore how varying the weight- ing of the title and the description affects the accuracy. Results: We show that Lucene obtains the best results when the defect report title is weighted three times higher than the description, a bigger difference than has been previously acknowledged. Conclusions: Our work shows the potential of using Lucene as a scalable solution for duplicate detection.
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Project
Β
Free Webinar on the Lynx Services Platform LySP: Architecture and basic Services
The main objective of the Lynx research and innovation project is to create an ecosystem of smart cloud services to better manage compliance, based on a Legal Knowledge Graph (LKG) which integrates and links multilingual and heterogeneous compliance data sources including legislation, case law, standards, regulations and other private contracts, beside others.
This webinar will provide insights into all smart services of the Lynx Services Platform (LySP) including demos of these LySP services, as for instance: Named Entity Extraction (NER) by DFKI, Relation Extraction and Question-Answering by SWC, Machine Translation by Tilde or the Lexicala cross-lingual lexical data service by KDictionaries.
PresentaciΓ³n realizada por Ricardo Santos, miembro del VIAF GDPR Working Group, en la reuniΓ³n anual de VIAF. La presentaciΓ³n muestra los resultados de una encuesta sobre privacidad de datos de autores en ficheros de autoridad.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Β
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Β
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Β
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Β
Are you looking to streamline your workflows and boost your projectsβ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, youβre in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part βEssentials of Automationβ series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Hereβs what youβll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
Weβll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Donβt miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
Β
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. Whatβs changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Β
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overviewβ
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
DevOps and Testing slides at DASA ConnectKari Kakkonen
Β
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Β
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as βpredictable inferenceβ.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
TR5 Prolifer and Post-Correction System. Ludwig Maximilians
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Profiler and Post-Correction System
Ludwig-Maximilians-UniversitΓ€t MΓΌnchen
Centrum fΓΌr Informations- und Sprachverarbeitung
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
TR5 Post-Correction System
User interface for easy postcorrection of
User interface for easy postcorrection of
historical OCR'd documents
historical OCR'd documents
Stand-alone user interface
Stand-alone user interface
Innovative language technology enables
Innovative language technology enables
identification, presentation of recognition
identification, presentation of recognition
errors and efficient correction
errors and efficient correction
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Customizable user interface Font size
Freely rearrangeable interface
Freely rearrangeable interface
elements:
elements:
ββ OCR with Image snippets
OCR with Image snippets
ββ Complete image
Complete image
ββ Correction candidates/ Special OCR and image fragments
Correction candidates/ Special
functions
functions
Complete image
Correction candidates,
Special functions
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
View: OCR and Image clippings
Word by word presentation of
Word by word presentation of
recognized text and image clippings.
recognized text and image clippings.
Comparison of text and image follows
Comparison of text and image follows
reading order and isismuch easier than
reading order and much easier than
side-by-side presentation of image and
side-by-side presentation of image and
text.
text.
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
View: Original image
ββ For difficult cases
For difficult cases
ββ When word segmentation by OCR
When word segmentation by OCR
fails
fails
ββ Current word isis highlighted
Current word highlighted
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Word by word correction of text
Correction by manual text entry
Correction by manual text entry
Choosing correction candidates
Choosing correction candidates
Faster correction thanks to candidates
Faster correction thanks to candidates
proposed by the postcorrection system
proposed by the postcorrection system
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction: efficient postcorrection
Batch correction
Batch correction
ββ Several occurences of identical
Several occurences of identical
word
word
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Batch correction: efficient postcorrection
Batch correction
Batch correction
ββ classes of systematic errors
classes of systematic errors
ββ errors where the correction
errors where the correction
candidate has aa high degree of
candidate has high degree of
certainty
certainty
ββ further possilities
further possilities
Frequent errors
Frequent errors
For instance Location names
For instance Location names
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Postcorrection system: Evaluation
User Experiment with 14 individual instances
Result:
Result:
Error correction thanks to text and error
Error correction thanks to text and error
profiling is 2.7 times faster
profiling is 2.7 times faster
9
Ulrich Reffle, 4,
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Korrektursystem
11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Why another postcorrection system?
Targets more specialist audience
Targets more specialist audience
Thanks to underlying language technology:
Thanks to underlying language technology:
Historical variants are recognized and
Historical variants are recognized and
not marked as errors ββ evenwhen not in
not marked as errors even when not in
historical lexicon
historical lexicon
Historical variants are proposed as
Historical variants are proposed as
correction candidates
correction candidates
Typical error patterns are exploited
Typical error patterns are exploited
Ranking of correction candidates
Ranking of correction candidates
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Underlying language technology
Lexica and language models help dealing with orthographical variants und
Lexica and language models help dealing with orthographical variants und
unknown words.
unknown words.
Recognition of OCR errors and proposal of Correction candidates depends
Recognition of OCR errors and proposal of Correction candidates depends
on specially developed LMU language technology
on specially developed LMU language technology
Approximate search inin βhypothetical lexicaβ
Approximate search βhypothetical lexicaβ
An analysis of the whole work (βtext and error profileβ) produces document-
An analysis of the whole work (βtext and error profileβ) produces document-
specific information about the language and the type of OCR errors
specific information about the language and the type of OCR errors
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Text and error profiles
Text profile Error profile
Coverage of lexica
Coverage of lexica
Estimate of error rate
Estimate of error rate
Typical variant patterns Typical OCR errors
Typical OCR errors
Typical variant patterns
β Targeted selection of lexica
β Targeted selection of lexica
β Better language models β Better modeling of error channel
β Better modeling of error channel
β Better language models
β Distinguishing historical variants β Distinguishing historical variants
β Distinguishing historical variants β Distinguishing historical variants
and OCR errors and OCT errors
and OCR errors and OCT errors
β Ranking of correction candidates β Ranking of correction candidates
β Ranking of correction candidates β Ranking of correction candidates
β Recall and Precision in IR βTreatment of systematic errors
β Recall and Precision in IR βTreatment of systematic errors
14
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Underlying logic: Dual noisy channel model
Interpretation of OCR output tokens as result of two βnoisy channelsβ
modern word u historical variant v OCR result w
patterns OCR errors
Given an OCR token w, give possible interpretations of w in terms of
β’ βunderlyingβ modern word u (IR!)
β’ correct historical word v and its derivation from u via βpatternsβ
β’ OCR errors garbling v into w
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Historical variant and OCR error patterns
teil theil
Historical
Variants
OCR
Error patterns theil iheil
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Relative frequency: 2.9% of all
βtβ are rewritten to βthβ
Absolute frequency: Pattern
was found 120 times in the
current document.
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local view: interpretations of tokens
β Local view: βMeaningful interpretationsβ for all tokens of the
ocr text are the matches in all attached lexicons, using the
given settings.
Occurrence of spelling variant
βiβyβ:
Occurrence of ocr error
βiβyβ:
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global view: pattern frequencies
β Global view: Increment counters to estimate (relative)
frequencies.
Occurrences of spelling variant
βiβyβ:
+0.999771
Occurrences of ocr error
βiβyβ:
+0.000224948
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: initialization
Initial global profile
Non-specific model with
probabilities for
β’Words
β’Variant Patterns
β’Error
OCR result
w0, w1 ,w2, w3, β¦
0 1 2 3
20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: global to local
Initial global profile
Local profile
Non-specific model with ww:33::
w:
wwβ¦ β β¦ β β¦
:
w22:33 β β¦ β β¦
β¦β β¦ β β¦
probabilities for β¦β¦ββ¦β¦ββ¦β¦
β¦β β¦β β¦
β¦β β¦ β β¦
β¦β¦ β β¦ β β¦
w11::β¦ β β¦ β β¦
w β¦ββ¦ββ¦ β β
β’Words β¦β¦β¦β ββ β¦β¦
β¦β¦β¦β¦ β¦β¦β¦ β¦
β β¦ ββ
β
w00β¦β¦ βββ¦β β¦
β β¦ ββ¦ β¦
ββ ββ¦
w :: β¦ β β¦β¦ββ¦
β¦ ββ¦β
β¦ β
β¦β¦β¦ β¦
β’Variant Patterns β¦β¦β¦β¦β¦ β¦β¦ββ¦
ββ βββ β¦
ββ¦β β β¦ β¦
β β¦β¦ β
β¦β¦β¦β¦ββ¦β¦ββ¦ β¦
β¦ ββ¦ β
β¦β β¦
β¦β β
β’Error β β
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦ββ¦ββ¦
β¦ββ¦ββ¦
OCR result
w0, w1 ,w2, w3, β¦
0 1 2 3
21
Ulrich Reffle, 4,
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: local to global
Global profile
Local profile
Improved model with ww:33::
w:
wwβ¦ β β¦ β β¦
:
w22:33 β β¦ β β¦
β¦β β¦ β β¦
probabilities for β¦β¦ββ¦β¦ββ¦β¦
β¦β β¦β β¦
β¦β β¦ β β¦
β¦β¦ β β¦ β β¦
w11::β¦ β β¦ β β¦
w β¦ββ¦ββ¦ β β
β’Words β¦β¦β¦β ββ β¦β¦
β¦β¦β¦β¦ β¦β¦β¦ β¦
β β¦ ββ
β
w00β¦β¦ βββ¦β β¦
β β¦ ββ¦ β¦
ββ ββ¦
w :: β¦ β β¦β¦ββ¦
β¦ ββ¦β
β¦ β
β¦β¦β¦ β¦
β’Variant Patterns β¦β¦β¦β¦β¦ β¦β¦ββ¦
ββ βββ β¦
ββ¦β β β¦ β¦
β β¦β¦ β
β¦β¦β¦β¦ββ¦β¦ββ¦ β¦
β¦ ββ¦ β
β¦β β¦
β¦β β
β’Error β β
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦ββ¦ββ¦
β¦ββ¦ββ¦
OCR result
w0, w1 ,w2, w3, β¦
0 1 2 3
22
Ulrich Reffle, 4,
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Computation of profile: iteration
Global profile
Local profile
Improved model with ww:33::
w:
wwβ¦ β β¦ β β¦
:
w22:33 β β¦ β β¦
β¦β β¦ β β¦
probabilities for β¦β¦ββ¦β¦ββ¦β¦
β¦β β¦β β¦
β¦β β¦ β β¦
β¦β¦ β β¦ β β¦
w11::β¦ β β¦ β β¦
w β¦ββ¦ββ¦ β β
β’Words β¦β¦β¦β ββ β¦β¦
β¦β¦β¦β¦ β¦β¦β¦ β¦
β β¦ ββ
β
w00β¦β¦ βββ¦β β¦
β β¦ ββ¦ β¦
ββ ββ¦
w :: β¦ β β¦β¦ββ¦
β¦ ββ¦β
β¦ β
β¦β¦β¦ β¦
β’Variant Patterns β¦β¦β¦β¦β¦ β¦β¦ββ¦
ββ βββ β¦
ββ¦β β β¦ β¦
β β¦β¦ β
β¦β¦β¦β¦ββ¦β¦ββ¦ β¦
β¦ ββ¦ β
β¦β β¦
β¦β β
β’Error β β
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦β¦ β β¦ β β¦
β¦β¦ β β¦ β β¦
ββ¦ββ¦
ββ¦ββ¦
β¦ββ¦ββ¦
β¦ββ¦ββ¦
OCR result
w0, w1 ,w2, w3, β¦
0 1 2 3
23
Ulrich Reffle, 4,
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Profiler Evaluation
Measure the quality
1. of global profiles
2. of OCR error detection
Challenges
Measures not obvious
Good evaluation data is difficult to gather
Results need interpretation
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Measures
(1) Global Profiles
Percentage of matches for the first 10 patterns in the ranked output lists
Two Values: Historical Patterns, OCR Patterns
(2) OCR Error Detection
Precision and Recall for the OCR errors detected by the Profiler
(3) Indirect evaluation
(For instance, by means of the postcorrection system)
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data preparation
(1) Deep Evaluation:
For each token of the evaluation document the historical interpretation and the
OCR interpretation have been manually annotated.
++ fully accurate -- manual work
(2) Shallow Evaluation:
The OCRβed document is automatically aligned with its re-typed ground truth;
For each token of the evaluation document the historical and the OCR
interpretation is automatically assigned from the ground truth.
++ no manual work β not completely accurate
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Data
Deep: Eckartshausen 100 pages
Briefkunst 40 pages
Shallow: 5 books each,
16th, 17th and 18th century
28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation: Eckartshausen
(1) historical patterns
matches first 10 70%
precision all 68%
recall all 73%
(2) OCR patterns
matches first 6 67%
precision all 59%
recall all 19%
(3) OCR error detection
precision 86%
recall 46%
29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: Eckartshausen
30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Graphical Evaluation: diacritics
Hist. Var.
OCR
31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Shallow Evaluation Results
16th 17th 18th
HIST Patterns first 10 60% 74% 78%
OCR Patterns first 10 48% 70% 50%
Error Detection Prec 95% 92% 81%
Error Detection Recall 49% 43% 45%
Content Words Errors 64% 44% 16%
Easy Interactive Correction per β3000 words β 1892 words β 720 words
10,000 words
32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Global Profile: Spelling variation patterns
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Spelling variation profile
34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Error Profile
35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.