SlideShare a Scribd company logo
1 of 35
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




TR5 Profiler and Post-Correction System
Ludwig-Maximilians-Universität München
Centrum für Informations- und Sprachverarbeitung
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    TR5 Post-Correction System

User interface for easy postcorrection of
  User interface for easy postcorrection of
historical OCR'd documents
  historical OCR'd documents
Stand-alone user interface
  Stand-alone user interface
Innovative language technology enables
  Innovative language technology enables
identification, presentation of recognition
  identification, presentation of recognition
errors and efficient correction
  errors and efficient correction
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Customizable user interface                                                                                                                       Font size

Freely rearrangeable interface
 Freely rearrangeable interface
elements:
 elements:
 ––   OCR with Image snippets
        OCR with Image snippets
 ––   Complete image
        Complete image
 ––   Correction candidates/ Special                                                                                              OCR and image fragments
        Correction candidates/ Special
      functions
        functions




                                                                                                                                                      Complete image



             Correction candidates,
               Special functions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    View: OCR and Image clippings
Word by word presentation of
 Word by word presentation of
recognized text and image clippings.
 recognized text and image clippings.
Comparison of text and image follows
 Comparison of text and image follows
reading order and isismuch easier than
 reading order and much easier than
side-by-side presentation of image and
 side-by-side presentation of image and
text.
 text.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     View: Original image

–– For difficult cases
     For difficult cases
–– When word segmentation by OCR
     When word segmentation by OCR
   fails
     fails
–– Current word isis highlighted
     Current word highlighted
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    Word by word correction of text
Correction by manual text entry
 Correction by manual text entry
Choosing correction candidates
 Choosing correction candidates
Faster correction thanks to candidates
 Faster correction thanks to candidates
proposed by the postcorrection system
 proposed by the postcorrection system
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Batch correction: efficient postcorrection
 Batch correction
  Batch correction
    –– Several occurences of identical
        Several occurences of identical
       word
        word
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Batch correction: efficient postcorrection
Batch correction
 Batch correction
  –– classes of systematic errors
       classes of systematic errors
  –– errors where the correction
       errors where the correction
     candidate has aa high degree of
       candidate has high degree of
     certainty
       certainty
  –– further possilities
       further possilities
                  Frequent errors
                   Frequent errors
                  For instance Location names
                   For instance Location names
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Postcorrection system: Evaluation
User Experiment with 14 individual instances


        Result:
         Result:
        Error correction thanks to text and error
         Error correction thanks to text and error
        profiling is 2.7 times faster
         profiling is 2.7 times faster




                                                                                                                                       9
                                                                                                                                              Ulrich Reffle, 4,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Korrektursystem




                                                                                                                                       10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Korrektursystem




                                                                                                                                       11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Why another postcorrection system?

   Targets more specialist audience
    Targets more specialist audience

Thanks to underlying language technology:
 Thanks to underlying language technology:
   Historical variants are recognized and
    Historical variants are recognized and
   not marked as errors –– evenwhen not in
    not marked as errors even when not in
   historical lexicon
    historical lexicon
   Historical variants are proposed as
    Historical variants are proposed as
   correction candidates
    correction candidates
   Typical error patterns are exploited
    Typical error patterns are exploited
   Ranking of correction candidates
    Ranking of correction candidates
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Underlying language technology
  Lexica and language models help dealing with orthographical variants und
   Lexica and language models help dealing with orthographical variants und
  unknown words.
   unknown words.
  Recognition of OCR errors and proposal of Correction candidates depends
   Recognition of OCR errors and proposal of Correction candidates depends
  on specially developed LMU language technology
   on specially developed LMU language technology
           Approximate search inin “hypothetical lexica“
            Approximate search “hypothetical lexica“
           An analysis of the whole work („text and error profile“) produces document-
            An analysis of the whole work („text and error profile“) produces document-
           specific information about the language and the type of OCR errors
            specific information about the language and the type of OCR errors
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Text and error profiles
             Text profile                                                                                         Error profile
  Coverage of lexica
   Coverage of lexica
                                                                                                        Estimate of error rate
                                                                                                         Estimate of error rate
  Typical variant patterns                                                                              Typical OCR errors
                                                                                                         Typical OCR errors
   Typical variant patterns

  → Targeted selection of lexica
  → Targeted selection of lexica
  → Better language models                                                                              → Better modeling of error channel
                                                                                                        → Better modeling of error channel
  → Better language models
           → Distinguishing historical variants                                                                    → Distinguishing historical variants
            → Distinguishing historical variants                                                                    → Distinguishing historical variants
             and OCR errors                                                                                          and OCT errors
              and OCR errors                                                                                          and OCT errors
           → Ranking of correction candidates                                                                      → Ranking of correction candidates
            → Ranking of correction candidates                                                                      → Ranking of correction candidates
           → Recall and Precision in IR                                                                            →Treatment of systematic errors
            → Recall and Precision in IR                                                                           →Treatment of systematic errors




                                                                                                                                         14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Underlying logic: Dual noisy channel model
Interpretation of OCR output tokens as result of two “noisy channels”


            modern word u                                historical variant v                                    OCR result w
                                           patterns                                        OCR errors




Given an OCR token w, give possible interpretations of w in terms of
         • “underlying” modern word u (IR!)
         • correct historical word v and its derivation from u via “patterns”
         • OCR errors garbling v into w
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Historical variant and OCR error patterns

                                                                                                                        teil            theil
Historical
Variants




 OCR
 Error patterns                                                                                                                  theil             iheil
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




  Relative frequency: 2.9% of all
  ‘t’ are rewritten to ‘th’




                                                                                                      Absolute frequency: Pattern
                                                                                                      was found 120 times in the
                                                                                                      current document.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Local view: interpretations of tokens
     –         Local view: “Meaningful interpretations” for all tokens of the
               ocr text are the matches in all attached lexicons, using the
               given settings.
                                                                                        Occurrence of spelling variant
                                                                                        “i→y”:




Occurrence of ocr error
“i→y”:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Global view: pattern frequencies
     –         Global view: Increment counters to estimate (relative)
               frequencies.

                                                                                        Occurrences of spelling variant
                                                                                        “i→y”:
                                                                                        +0.999771




Occurrences of ocr error
“i→y”:
+0.000224948
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Computation of profile: initialization

       Initial global profile


Non-specific model with
probabilities for
•Words
•Variant Patterns
•Error


        OCR result
     w0, w1 ,w2, w3, …
      0   1   2   3
                                                                                                                                             20
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




      Computation of profile: global to local

       Initial global profile
                                                                                                                                                    Local profile

Non-specific model with                                                                                                                    ww:33::
                                                                                                                                            w:
                                                                                                                                         ww… → … → …
                                                                                                                                             :
                                                                                                                                         w22:33 → … → …
                                                                                                                                            …→ … → …
probabilities for                                                                                                                        ……→……→……
                                                                                                                                           …→ …→ …
                                                                                                                                           …→ … → …
                                                                                                                                         …… → … → …
                                                                                                                                      w11::… → … → …
                                                                                                                                       w …→…→…    → →
•Words                                                                                                                                ………→ →→ ……
                                                                                                                                         ………… ……… …
                                                                                                                                              → … →→
                                                                                                                                              →
                                                                                                                                    w00…… →→…→ …
                                                                                                                                          → … →… …
                                                                                                                                           →→ →…
                                                                                                                                    w :: … → ……→…
                                                                                                                                           … →…→
                                                                                                                                           …         →
                                                                                                                                                   ……… …
•Variant Patterns                                                                                                                   …………… ……→…
                                                                                                                                       →→ →→→ …
                                                                                                                                        →…→ → … …
                                                                                                                                          → …… →
                                                                                                                                    …………→……→… …
                                                                                                                                           … →… →
                                                                                                                                            …→ …
                                                                                                                                            …→    →
•Error                                                                                                                                        → →
                                                                                                                                    …… → … → …
                                                                                                                                    …… → … → …
                                                                                                                                        →…→…
                                                                                                                                       →…→…
                                                                                                                                    …… → … → …
                                                                                                                                    …… → … → …
                                                                                                                                        →…→…
                                                                                                                                       →…→…
                                                                                                                                    …→…→…
                                                                                                                                    …→…→…
        OCR result
     w0, w1 ,w2, w3, …
      0   1   2   3
                                                                                                                                             21
                                                                                                                                                    Ulrich Reffle, 4,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Computation of profile: local to global

      Global profile
                                                                                                                                                   Local profile

Improved model with                                                                                                                       ww:33::
                                                                                                                                           w:
                                                                                                                                        ww… → … → …
                                                                                                                                            :
                                                                                                                                        w22:33 → … → …
                                                                                                                                           …→ … → …
probabilities for                                                                                                                       ……→……→……
                                                                                                                                          …→ …→ …
                                                                                                                                          …→ … → …
                                                                                                                                        …… → … → …
                                                                                                                                     w11::… → … → …
                                                                                                                                      w …→…→…    → →
•Words                                                                                                                               ………→ →→ ……
                                                                                                                                        ………… ……… …
                                                                                                                                             → … →→
                                                                                                                                             →
                                                                                                                                   w00…… →→…→ …
                                                                                                                                         → … →… …
                                                                                                                                          →→ →…
                                                                                                                                   w :: … → ……→…
                                                                                                                                          … →…→
                                                                                                                                          …         →
                                                                                                                                                  ……… …
•Variant Patterns                                                                                                                  …………… ……→…
                                                                                                                                      →→ →→→ …
                                                                                                                                       →…→ → … …
                                                                                                                                         → …… →
                                                                                                                                   …………→……→… …
                                                                                                                                          … →… →
                                                                                                                                           …→ …
                                                                                                                                           …→    →
•Error                                                                                                                                       → →
                                                                                                                                   …… → … → …
                                                                                                                                   …… → … → …
                                                                                                                                       →…→…
                                                                                                                                      →…→…
                                                                                                                                   …… → … → …
                                                                                                                                   …… → … → …
                                                                                                                                       →…→…
                                                                                                                                      →…→…
                                                                                                                                   …→…→…
                                                                                                                                   …→…→…
       OCR result
    w0, w1 ,w2, w3, …
     0   1   2   3
                                                                                                                                            22
                                                                                                                                                   Ulrich Reffle, 4,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




     Computation of profile: iteration

      Global profile
                                                                                                                                                   Local profile

Improved model with                                                                                                                       ww:33::
                                                                                                                                           w:
                                                                                                                                        ww… → … → …
                                                                                                                                            :
                                                                                                                                        w22:33 → … → …
                                                                                                                                           …→ … → …
probabilities for                                                                                                                       ……→……→……
                                                                                                                                          …→ …→ …
                                                                                                                                          …→ … → …
                                                                                                                                        …… → … → …
                                                                                                                                     w11::… → … → …
                                                                                                                                      w …→…→…    → →
•Words                                                                                                                               ………→ →→ ……
                                                                                                                                        ………… ……… …
                                                                                                                                             → … →→
                                                                                                                                             →
                                                                                                                                   w00…… →→…→ …
                                                                                                                                         → … →… …
                                                                                                                                          →→ →…
                                                                                                                                   w :: … → ……→…
                                                                                                                                          … →…→
                                                                                                                                          …         →
                                                                                                                                                  ……… …
•Variant Patterns                                                                                                                  …………… ……→…
                                                                                                                                      →→ →→→ …
                                                                                                                                       →…→ → … …
                                                                                                                                         → …… →
                                                                                                                                   …………→……→… …
                                                                                                                                          … →… →
                                                                                                                                           …→ …
                                                                                                                                           …→    →
•Error                                                                                                                                       → →
                                                                                                                                   …… → … → …
                                                                                                                                   …… → … → …
                                                                                                                                       →…→…
                                                                                                                                      →…→…
                                                                                                                                   …… → … → …
                                                                                                                                   …… → … → …
                                                                                                                                       →…→…
                                                                                                                                      →…→…
                                                                                                                                   …→…→…
                                                                                                                                   …→…→…
       OCR result
    w0, w1 ,w2, w3, …
     0   1   2   3
                                                                                                                                            23
                                                                                                                                                   Ulrich Reffle, 4,
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Profiler Evaluation

Measure the quality
1.   of global profiles
2.   of OCR error detection

   Challenges
      Measures not obvious
      Good evaluation data is difficult to gather
      Results need interpretation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation: Measures
(1) Global Profiles
    Percentage of matches for the first 10 patterns in the ranked output lists
    Two Values: Historical Patterns, OCR Patterns

(2) OCR Error Detection
    Precision and Recall for the OCR errors detected by the Profiler

(3) Indirect evaluation
    (For instance, by means of the postcorrection system)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation: Data preparation
(1) Deep Evaluation:
    For each token of the evaluation document the historical interpretation and the
    OCR interpretation have been manually annotated.
    ++ fully accurate -- manual work

(2) Shallow Evaluation:
    The OCR’ed document is automatically aligned with its re-typed ground truth;
    For each token of the evaluation document the historical and the OCR
    interpretation is automatically assigned from the ground truth.

   ++ no manual work – not completely accurate
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation: Data


Deep:        Eckartshausen 100 pages
             Briefkunst                           40 pages
Shallow: 5 books each,
             16th, 17th and 18th century
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation: Eckartshausen


     (1)         historical patterns
                 matches first 10                                              70%
                 precision all                                                 68%
                 recall    all                                                 73%
     (2)         OCR patterns
                matches first 6                                               67%
                precision all                                                59%
                recall all                                                   19%
     (3)        OCR error detection
                precision                                                     86%
                recall                                                        46%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Graphical Evaluation: Eckartshausen
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Graphical Evaluation: diacritics


Hist. Var.




   OCR
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




 Shallow Evaluation Results


                                                                           16th                                     17th                                  18th
HIST Patterns first 10                                                     60%                                      74%                                   78%
OCR Patterns first 10                                                      48%                                      70%                                   50%
Error Detection Prec                                                       95%                                      92%                                   81%
Error Detection Recall                                                     49%                                      43%                                   45%
Content Words Errors                                                       64%                                      44%                                   16%
Easy Interactive Correction per                                            ≈3000 words                              ≈ 1892 words                          ≈ 720 words
10,000 words
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Global Profile: Spelling variation patterns
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Spelling variation profile
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR Error Profile
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

More Related Content

What's hot (10)

Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Pal gov.tutorial4.session1 2.whatisontology
Pal gov.tutorial4.session1 2.whatisontologyPal gov.tutorial4.session1 2.whatisontology
Pal gov.tutorial4.session1 2.whatisontology
 
Pal gov.tutorial4.session3.lab bankcustomerontology
Pal gov.tutorial4.session3.lab bankcustomerontologyPal gov.tutorial4.session3.lab bankcustomerontology
Pal gov.tutorial4.session3.lab bankcustomerontology
 
CSTalks-Natural Language Processing-17Aug
CSTalks-Natural Language Processing-17AugCSTalks-Natural Language Processing-17Aug
CSTalks-Natural Language Processing-17Aug
 
Seeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for PortugueseSeeing is Correcting:Linked Open Data for Portuguese
Seeing is Correcting:Linked Open Data for Portuguese
 
56 o oo ccf_final
56 o oo ccf_final56 o oo ccf_final
56 o oo ccf_final
 
Open Source Natural Language Processing - Francis Bond
Open Source Natural Language Processing - Francis BondOpen Source Natural Language Processing - Francis Bond
Open Source Natural Language Processing - Francis Bond
 
Pargram2011
Pargram2011Pargram2011
Pargram2011
 
Portuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and HowPortuguese Linguistic Tools: What, Why and How
Portuguese Linguistic Tools: What, Why and How
 
Why Languages Matter 20090123
Why Languages Matter 20090123Why Languages Matter 20090123
Why Languages Matter 20090123
 

Viewers also liked

¿Qué es un archivo?
¿Qué es un archivo?¿Qué es un archivo?
¿Qué es un archivo?
David Gómez
 

Viewers also liked (8)

Redes sociales y Microblogs
Redes sociales y MicroblogsRedes sociales y Microblogs
Redes sociales y Microblogs
 
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
Structural analysis of documents Functional Extension Parser (FEP). Günter Mü...
 
El archivo de Internet, bibliotecas que piensan en el futuro. Mar Pérez Morillo
El archivo de Internet, bibliotecas que piensan en el futuro. Mar Pérez MorilloEl archivo de Internet, bibliotecas que piensan en el futuro. Mar Pérez Morillo
El archivo de Internet, bibliotecas que piensan en el futuro. Mar Pérez Morillo
 
Biblioteca Digital del Patrimonio Iberoamericano
Biblioteca Digital del Patrimonio IberoamericanoBiblioteca Digital del Patrimonio Iberoamericano
Biblioteca Digital del Patrimonio Iberoamericano
 
Máster / Curso de experto en bibliotecas y patrimonio documental. Rosario Lóp...
Máster / Curso de experto en bibliotecas y patrimonio documental. Rosario Lóp...Máster / Curso de experto en bibliotecas y patrimonio documental. Rosario Lóp...
Máster / Curso de experto en bibliotecas y patrimonio documental. Rosario Lóp...
 
Biblioteca Digital del Patrimonio Iberoamericano
Biblioteca Digital del Patrimonio IberoamericanoBiblioteca Digital del Patrimonio Iberoamericano
Biblioteca Digital del Patrimonio Iberoamericano
 
IMPACT implicación de la BNE-UA y resultados preliminares del proyecto. Isabe...
IMPACT implicación de la BNE-UA y resultados preliminares del proyecto. Isabe...IMPACT implicación de la BNE-UA y resultados preliminares del proyecto. Isabe...
IMPACT implicación de la BNE-UA y resultados preliminares del proyecto. Isabe...
 
¿Qué es un archivo?
¿Qué es un archivo?¿Qué es un archivo?
¿Qué es un archivo?
 

Similar to TR5 Prolifer and Post-Correction System. Ludwig Maximilians

Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
Emma Huber
 
Bratislava WS - Fuchs - Abbyy - OCR overview_pdf
Bratislava WS - Fuchs - Abbyy - OCR overview_pdfBratislava WS - Fuchs - Abbyy - OCR overview_pdf
Bratislava WS - Fuchs - Abbyy - OCR overview_pdf
IMPACT Centre of Competence
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
Fernando Silva Parreiras
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
IJCI JOURNAL
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
Rikki Wright
 
44 language resources for computer assisted translation
44 language resources for computer assisted translation44 language resources for computer assisted translation
44 language resources for computer assisted translation
AEGIS-ACCESSIBLE Projects
 

Similar to TR5 Prolifer and Post-Correction System. Ludwig Maximilians (20)

IMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich ReffleIMPACT Final Conference - Ulrich Reffle
IMPACT Final Conference - Ulrich Reffle
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
Impact centre of_competence_for_workshop_ocr_rouen_march_2011[1]
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
Achievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An LocAchievement And Lessons Learned By An Loc
Achievement And Lessons Learned By An Loc
 
Bratislava WS - Fuchs - Abbyy - OCR overview_pdf
Bratislava WS - Fuchs - Abbyy - OCR overview_pdfBratislava WS - Fuchs - Abbyy - OCR overview_pdf
Bratislava WS - Fuchs - Abbyy - OCR overview_pdf
 
The Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiativesThe Improving Access to Text (IMPACT) project and other European initiatives
The Improving Access to Text (IMPACT) project and other European initiatives
 
Towards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software DataTowards a Marketplace of Open Source Software Data
Towards a Marketplace of Open Source Software Data
 
Oc wg-nif-20130711
Oc wg-nif-20130711Oc wg-nif-20130711
Oc wg-nif-20130711
 
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
Semantics, Automatic Metadata and Audiovisual Contents. A case of study: the ...
 
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
 
Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)Workflow Development for OCR (and beyond)
Workflow Development for OCR (and beyond)
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
IRJET - Language Linguist using Image Processing on Intelligent Transport Sys...
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
Learning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSRLearning Usage of English KWICly with WebLEAP/DSR
Learning Usage of English KWICly with WebLEAP/DSR
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
44 language resources for computer assisted translation
44 language resources for computer assisted translation44 language resources for computer assisted translation
44 language resources for computer assisted translation
 
IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012IIIF: International Image Interoperability Framework @ DLF2012
IIIF: International Image Interoperability Framework @ DLF2012
 

More from Biblioteca Nacional de España

More from Biblioteca Nacional de España (20)

La colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de EspañaLa colección de relaciones de sucesos en la Biblioteca Nacional de España
La colección de relaciones de sucesos en la Biblioteca Nacional de España
 
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos AramburoIdentidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
Identidad común: las fuentes del patrimonio bibliográfico. Ana Santos Aramburo
 
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
La Biblioteca Nacional de España como centro de apoyo a la investigación. Ana...
 
Data privacy in library authority files: a survey
Data privacy in library authority files: a surveyData privacy in library authority files: a survey
Data privacy in library authority files: a survey
 
Perfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambiosPerfil de RDA de la BNE. Resumen de cambios
Perfil de RDA de la BNE. Resumen de cambios
 
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. RelacionesRDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
RDA. Autoridades. Fundamentos. Identificación de entidades. Relaciones
 
RDA: el nuevo texto
RDA: el nuevo textoRDA: el nuevo texto
RDA: el nuevo texto
 
Pleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de EspañaPleno del Real Patronato. Biblioteca Nacional de España
Pleno del Real Patronato. Biblioteca Nacional de España
 
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de EspañaObjetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
Objetivos 2019. Pleno del Real Patronato. Biblioteca Nacional de España
 
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
Pleno del Real Patronato. Biblioteca Nacional de España. Evaluación actuacion...
 
Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019Evaluación actuaciones 2018. Planificación actuaciones 2019
Evaluación actuaciones 2018. Planificación actuaciones 2019
 
Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019Dirección Técnica. Objetivos 2019
Dirección Técnica. Objetivos 2019
 
Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019Evaluación 2018. Objetivos 2019
Evaluación 2018. Objetivos 2019
 
Evaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección CulturalEvaluación actuaciones 2018. Dirección Cultural
Evaluación actuaciones 2018. Dirección Cultural
 
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos AramburoPleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
Pleno CCB. Consejo de Cooperación Bibliotecaria. Ana Santos Aramburo
 
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
Descubrir, aprender, disfrutar en la Biblioteca Nacional de España. Ana Santo...
 
VIAF GDPR
VIAF GDPRVIAF GDPR
VIAF GDPR
 
Renacer prensa historica
Renacer prensa historicaRenacer prensa historica
Renacer prensa historica
 
RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)RDA y Linked data (Ricardo Santos Muñoz)
RDA y Linked data (Ricardo Santos Muñoz)
 
Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)Desarrollo actual de RDA (Pilar Tejero López)
Desarrollo actual de RDA (Pilar Tejero López)
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

TR5 Prolifer and Post-Correction System. Ludwig Maximilians

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TR5 Profiler and Post-Correction System Ludwig-Maximilians-Universität München Centrum für Informations- und Sprachverarbeitung
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. TR5 Post-Correction System User interface for easy postcorrection of User interface for easy postcorrection of historical OCR'd documents historical OCR'd documents Stand-alone user interface Stand-alone user interface Innovative language technology enables Innovative language technology enables identification, presentation of recognition identification, presentation of recognition errors and efficient correction errors and efficient correction
  • 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Customizable user interface Font size Freely rearrangeable interface Freely rearrangeable interface elements: elements: –– OCR with Image snippets OCR with Image snippets –– Complete image Complete image –– Correction candidates/ Special OCR and image fragments Correction candidates/ Special functions functions Complete image Correction candidates, Special functions
  • 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. View: OCR and Image clippings Word by word presentation of Word by word presentation of recognized text and image clippings. recognized text and image clippings. Comparison of text and image follows Comparison of text and image follows reading order and isismuch easier than reading order and much easier than side-by-side presentation of image and side-by-side presentation of image and text. text.
  • 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. View: Original image –– For difficult cases For difficult cases –– When word segmentation by OCR When word segmentation by OCR fails fails –– Current word isis highlighted Current word highlighted
  • 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Word by word correction of text Correction by manual text entry Correction by manual text entry Choosing correction candidates Choosing correction candidates Faster correction thanks to candidates Faster correction thanks to candidates proposed by the postcorrection system proposed by the postcorrection system
  • 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Batch correction: efficient postcorrection Batch correction Batch correction –– Several occurences of identical Several occurences of identical word word
  • 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Batch correction: efficient postcorrection Batch correction Batch correction –– classes of systematic errors classes of systematic errors –– errors where the correction errors where the correction candidate has aa high degree of candidate has high degree of certainty certainty –– further possilities further possilities Frequent errors Frequent errors For instance Location names For instance Location names
  • 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Postcorrection system: Evaluation User Experiment with 14 individual instances Result: Result: Error correction thanks to text and error Error correction thanks to text and error profiling is 2.7 times faster profiling is 2.7 times faster 9 Ulrich Reffle, 4,
  • 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Korrektursystem 10
  • 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Korrektursystem 11
  • 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Why another postcorrection system? Targets more specialist audience Targets more specialist audience Thanks to underlying language technology: Thanks to underlying language technology: Historical variants are recognized and Historical variants are recognized and not marked as errors –– evenwhen not in not marked as errors even when not in historical lexicon historical lexicon Historical variants are proposed as Historical variants are proposed as correction candidates correction candidates Typical error patterns are exploited Typical error patterns are exploited Ranking of correction candidates Ranking of correction candidates
  • 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Underlying language technology Lexica and language models help dealing with orthographical variants und Lexica and language models help dealing with orthographical variants und unknown words. unknown words. Recognition of OCR errors and proposal of Correction candidates depends Recognition of OCR errors and proposal of Correction candidates depends on specially developed LMU language technology on specially developed LMU language technology Approximate search inin “hypothetical lexica“ Approximate search “hypothetical lexica“ An analysis of the whole work („text and error profile“) produces document- An analysis of the whole work („text and error profile“) produces document- specific information about the language and the type of OCR errors specific information about the language and the type of OCR errors
  • 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Text and error profiles Text profile Error profile Coverage of lexica Coverage of lexica Estimate of error rate Estimate of error rate Typical variant patterns Typical OCR errors Typical OCR errors Typical variant patterns → Targeted selection of lexica → Targeted selection of lexica → Better language models → Better modeling of error channel → Better modeling of error channel → Better language models → Distinguishing historical variants → Distinguishing historical variants → Distinguishing historical variants → Distinguishing historical variants and OCR errors and OCT errors and OCR errors and OCT errors → Ranking of correction candidates → Ranking of correction candidates → Ranking of correction candidates → Ranking of correction candidates → Recall and Precision in IR →Treatment of systematic errors → Recall and Precision in IR →Treatment of systematic errors 14
  • 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Underlying logic: Dual noisy channel model Interpretation of OCR output tokens as result of two “noisy channels” modern word u historical variant v OCR result w patterns OCR errors Given an OCR token w, give possible interpretations of w in terms of • “underlying” modern word u (IR!) • correct historical word v and its derivation from u via “patterns” • OCR errors garbling v into w
  • 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Historical variant and OCR error patterns teil theil Historical Variants OCR Error patterns theil iheil
  • 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Relative frequency: 2.9% of all ‘t’ are rewritten to ‘th’ Absolute frequency: Pattern was found 120 times in the current document.
  • 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local view: interpretations of tokens – Local view: “Meaningful interpretations” for all tokens of the ocr text are the matches in all attached lexicons, using the given settings. Occurrence of spelling variant “i→y”: Occurrence of ocr error “i→y”:
  • 19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global view: pattern frequencies – Global view: Increment counters to estimate (relative) frequencies. Occurrences of spelling variant “i→y”: +0.999771 Occurrences of ocr error “i→y”: +0.000224948
  • 20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: initialization Initial global profile Non-specific model with probabilities for •Words •Variant Patterns •Error OCR result w0, w1 ,w2, w3, … 0 1 2 3 20
  • 21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: global to local Initial global profile Local profile Non-specific model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → … probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → → •Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… … •Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ → •Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 21 Ulrich Reffle, 4,
  • 22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: local to global Global profile Local profile Improved model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → … probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → → •Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… … •Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ → •Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 22 Ulrich Reffle, 4,
  • 23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Computation of profile: iteration Global profile Local profile Improved model with ww:33:: w: ww… → … → … : w22:33 → … → … …→ … → … probabilities for ……→……→…… …→ …→ … …→ … → … …… → … → … w11::… → … → … w …→…→… → → •Words ………→ →→ …… ………… ……… … → … →→ → w00…… →→…→ … → … →… … →→ →… w :: … → ……→… … →…→ … → ……… … •Variant Patterns …………… ……→… →→ →→→ … →…→ → … … → …… → …………→……→… … … →… → …→ … …→ → •Error → → …… → … → … …… → … → … →…→… →…→… …… → … → … …… → … → … →…→… →…→… …→…→… …→…→… OCR result w0, w1 ,w2, w3, … 0 1 2 3 23 Ulrich Reffle, 4,
  • 24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Profiler Evaluation Measure the quality 1. of global profiles 2. of OCR error detection Challenges Measures not obvious Good evaluation data is difficult to gather Results need interpretation
  • 25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation: Measures (1) Global Profiles Percentage of matches for the first 10 patterns in the ranked output lists Two Values: Historical Patterns, OCR Patterns (2) OCR Error Detection Precision and Recall for the OCR errors detected by the Profiler (3) Indirect evaluation (For instance, by means of the postcorrection system)
  • 26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation: Data preparation (1) Deep Evaluation: For each token of the evaluation document the historical interpretation and the OCR interpretation have been manually annotated. ++ fully accurate -- manual work (2) Shallow Evaluation: The OCR’ed document is automatically aligned with its re-typed ground truth; For each token of the evaluation document the historical and the OCR interpretation is automatically assigned from the ground truth. ++ no manual work – not completely accurate
  • 27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation: Data Deep: Eckartshausen 100 pages Briefkunst 40 pages Shallow: 5 books each, 16th, 17th and 18th century
  • 28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation: Eckartshausen (1) historical patterns matches first 10 70% precision all 68% recall all 73% (2) OCR patterns matches first 6 67% precision all 59% recall all 19% (3) OCR error detection precision 86% recall 46%
  • 29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Graphical Evaluation: Eckartshausen
  • 30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Graphical Evaluation: diacritics Hist. Var. OCR
  • 31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Shallow Evaluation Results 16th 17th 18th HIST Patterns first 10 60% 74% 78% OCR Patterns first 10 48% 70% 50% Error Detection Prec 95% 92% 81% Error Detection Recall 49% 43% 45% Content Words Errors 64% 44% 16% Easy Interactive Correction per ≈3000 words ≈ 1892 words ≈ 720 words 10,000 words
  • 32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Global Profile: Spelling variation patterns
  • 33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Spelling variation profile
  • 34. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR Error Profile
  • 35. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.