SlideShare a Scribd company logo
1 of 19
How to create
a corpus of
machine-readable texts:
challenges and solutions
What is OCR and how does it work?
Definition of OCR according to the Oxford
Dictionary of Computer Science, p. 379:
„OCR = optical character recognition; a
process in which a machine scans,
recognizes, and encodes information
printed or typed in alphanumerical
characters. (…) OCR software is now
readily available for many low-cost
scanners giving good recognition rates for
printed material using the Latin
alphabet. The more difficult problems
posed by other character sets and
handwriting are areas of ongoing
research.“
When was OCR software invented?
mid-1970s: OCR A font and
OCR B font (similar to
normal letter-press
appearance)
Ca. 1955: early OCR
devices only recognised
limited set of characters in
machine-optimised font
Do we encounter OCR in everyday-life?
High accuracy rates have popularised OCR in the following areas:
• banking (machines „reading“ paper cheques and transfer forms)
• public administration
• health-care (e.g. machine-readable precriptions)
NOTE:
In cases where absolute perfection is needed,
OCR A and OCR B fonts are still used.
If sensitive information is handled, OCR
technology can be combined with the so-called
MICR technology (magnetic-ink character
recognition) checking the legitimacy or
originality of paper documents.
Are humanities tools using OCR?
Google Books: full-text search +
highlighting of text results
HathiTrust full-text view
What are ORC problems in historical research?
„Hannoverisches Magazin”, 1776
Best historical OCR results:
texts in standardised formats (e.g. periodicals)
Improving results for minority languages and old fonts –
an on-going challenge
Recent innovation: merging OCR and handwriting
recognition technologies (HWR/HTR)
“Handwriting recognition (HWR), also known as Handwritten Text
Recognition (HTR), is the ability of a computer to receive and interpret
intelligible handwritten input from sources such as paper documents,
photographs, touch-screens and other devices. The image of the written
text may be sensed "off line" from a piece of paper by optical scanning
(optical character recognition) or intelligent word recognition. Alternatively,
the movements of the pen tip may be sensed "on line", for example by a
pen-based computer screen surface, a generally easier task as there are
more clues available. A handwriting recognition system handles formatting,
performs correct segmentation into characters, and finds the most plausible
words.”
Wikipedia.org
The machine learning revolution in OCR
How does machine learning work?
Cf. Stanford OCR pipeline:
• text detection (layout recognition)
• character segmentation (using
„sliding window“ technique)
• character classification
• spell correction
(http://doremi2016.logdown.com/posts/
2017/01/20/standford-machine-
learning-photo-ocr-machine-learning-
pipeline)
New OCR tools based on machine learning
E.g. OCR-D project
in Germany:
• improved visual
character
recognition
• context analysis of
n-grams
• trainer feedback to
exclude potential
mistakes
Current range of OCR-tools for researchers
• Transkribus.eu (free of charge, cloud-based, each user contributes training data to the
community)
• OCR4all (free command-line OCR software for desktop-installation, difficult set-up,
does not run smoothly on Windows)
• KRAKEN (Python package for OCR, usage not monitored, data do not need to be
shared with developers or others users)
• ABBYY FineReader (one of the most popular proprietary OCR tools)
• Tesseract (originally developed as proprietary software at Hewlett Packard labs in
England and the US, released as open source in 2005, supported by Google since 2006,
available for Linux as well as Windows and Mac OS X, high pre-processing
requirements)
• PICCL/TICCL (free corpus building and corpus clean-up system performing spelling
correction and OCR post-correction, developed for LINUX, requires virtual machine
on Windows)
GBV-Verbund: Intranda OCR Service
And the development continues…
PROs and CONs of open-source OCR software:
CONs:
• takes up a lot of storage space
• difficult installation
• often limited performance on
Windows and Mac)
• usually requires command-line
operation (no GUI)
• conducting own training can be
time-consuming
• copyright issues if software
provider requires you to ingest
your (training) data into a public
pool
PROs:
• flexible integration of historical
texts in different digital formats
• adaptable to multiple languages
and new fonts / page layouts
Photo by Luca Bravo on Unsplash
How reliable are current OCR tools?
Results based on an OCR test based on the US
driver‘s licence, published on September 18,
2019:
https://mobidev.biz/blog/ocr-machine-learning-
implementation
How can we integrate OCR into our own workflow?
Export scan as PDF or
image files to perform OCR!
Analyse non-coded
plain text with topic
modelling or
stylometry tools not
requiring structured
data!
Code information (e.g. in XML/TEI or
JSON) and use software to analyse
networks between specific tagged
entities or visualise geographic data!
Export scan as PDF or image file to let
humans extract metadata and
transcribe the text!
Original manuscript
or print
Use transcriptions to train OCR-
software and improve results for
similar sources (e.g. issues of the
same newspaper)!
Perform quantitative analysis on
more texts in less time and
generate more reliable results!
Testing a cloud-based OCR tool: transkribus.eu

More Related Content

What's hot

A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESijcsitcejournal
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character RecognitionRahul Mallik
 
Optical Character Recognition Using Python
Optical Character Recognition Using PythonOptical Character Recognition Using Python
Optical Character Recognition Using PythonYogeshIJTSRD
 
A detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionA detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionShruthiamar
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalBiniam Asnake
 
Machine learning
Machine learningMachine learning
Machine learningAmit Gupta
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Karan Panjwani
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using LabviewBharat Thakur
 
Handwriting recogntion slides boeing
Handwriting recogntion slides boeingHandwriting recogntion slides boeing
Handwriting recogntion slides boeingTejashree Gharat
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) Systemiosrjce
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character RecognitionDurjoy Saha
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyEr. Ashish Pandey
 
Ocr algorithm for ge’ez characters
Ocr algorithm for ge’ez charactersOcr algorithm for ge’ez characters
Ocr algorithm for ge’ez charactersNegash Desalegn
 
Presentation on OCR
Presentation on OCRPresentation on OCR
Presentation on OCRxsconfused
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR RecognitionBharat Kalia
 

What's hot (20)

A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUESA STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
A STUDY ON OPTICAL CHARACTER RECOGNITION TECHNIQUES
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Optical Character Recognition Using Python
Optical Character Recognition Using PythonOptical Character Recognition Using Python
Optical Character Recognition Using Python
 
A detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognitionA detailed study and recent research on handwritten recognition
A detailed study and recent research on handwritten recognition
 
Optical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based RetrievalOptical Character Recognition (OCR) based Retrieval
Optical Character Recognition (OCR) based Retrieval
 
Machine learning
Machine learningMachine learning
Machine learning
 
OCR Text Extraction
OCR Text ExtractionOCR Text Extraction
OCR Text Extraction
 
Optical Character Recognition( OCR )
Optical Character Recognition( OCR )Optical Character Recognition( OCR )
Optical Character Recognition( OCR )
 
OCR speech using Labview
OCR speech using LabviewOCR speech using Labview
OCR speech using Labview
 
Handwriting recogntion slides boeing
Handwriting recogntion slides boeingHandwriting recogntion slides boeing
Handwriting recogntion slides boeing
 
05a
05a05a
05a
 
Optical Character Recognition (OCR) System
Optical Character Recognition (OCR) SystemOptical Character Recognition (OCR) System
Optical Character Recognition (OCR) System
 
Optical Character Recognition
Optical Character RecognitionOptical Character Recognition
Optical Character Recognition
 
Optical character recognition IEEE Paper Study
Optical character recognition IEEE Paper StudyOptical character recognition IEEE Paper Study
Optical character recognition IEEE Paper Study
 
Ocr algorithm for ge’ez characters
Ocr algorithm for ge’ez charactersOcr algorithm for ge’ez characters
Ocr algorithm for ge’ez characters
 
Basics of-optical-character-recognition
Basics of-optical-character-recognitionBasics of-optical-character-recognition
Basics of-optical-character-recognition
 
Presentation on OCR
Presentation on OCRPresentation on OCR
Presentation on OCR
 
Mob ocr
Mob ocrMob ocr
Mob ocr
 
Ocr abstract
Ocr abstractOcr abstract
Ocr abstract
 
Project report of OCR Recognition
Project report of OCR RecognitionProject report of OCR Recognition
Project report of OCR Recognition
 

Similar to How to create a corpus of machine-readable texts: challenges and solutions

optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition systemVijay Apurva
 
OCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxOCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxShalini104884
 
OCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxOCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxShalini104884
 
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptx
OPTICAL CHARACTER RECOGNIZATION  NEERAJ.pptxOPTICAL CHARACTER RECOGNIZATION  NEERAJ.pptx
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptxNeerajBudhlakoti
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptxDanielJDanso
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization wordDhana K
 
Vexo - Handwriting recognition software
Vexo - Handwriting recognition softwareVexo - Handwriting recognition software
Vexo - Handwriting recognition softwareAthul Suresh
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Editor IJARCET
 
What is Optical Character Recognition (OCR) Technology?
What is Optical Character Recognition (OCR) Technology?What is Optical Character Recognition (OCR) Technology?
What is Optical Character Recognition (OCR) Technology?ARC Document Solutions
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRIRJET Journal
 
A Detailed Study And Recent Research On OCR
A Detailed Study And Recent Research On OCRA Detailed Study And Recent Research On OCR
A Detailed Study And Recent Research On OCRDaniel Wachtel
 

Similar to How to create a corpus of machine-readable texts: challenges and solutions (20)

D017222226
D017222226D017222226
D017222226
 
Ocr 1
Ocr 1Ocr 1
Ocr 1
 
50120130406005
5012013040600550120130406005
50120130406005
 
optical character recognition system
optical character recognition systemoptical character recognition system
optical character recognition system
 
Paper based interaction
Paper based interactionPaper based interaction
Paper based interaction
 
CRC Final Report
CRC Final ReportCRC Final Report
CRC Final Report
 
Bj35343348
Bj35343348Bj35343348
Bj35343348
 
OCR, optical character reader
OCR, optical character readerOCR, optical character reader
OCR, optical character reader
 
OCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxOCR Datasets Unleashed.docx
OCR Datasets Unleashed.docx
 
OCR Datasets Unleashed.docx
OCR Datasets Unleashed.docxOCR Datasets Unleashed.docx
OCR Datasets Unleashed.docx
 
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptx
OPTICAL CHARACTER RECOGNIZATION  NEERAJ.pptxOPTICAL CHARACTER RECOGNIZATION  NEERAJ.pptx
OPTICAL CHARACTER RECOGNIZATION NEERAJ.pptx
 
300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx300GroupProject_handwritingsoftware.pptx
300GroupProject_handwritingsoftware.pptx
 
Optical character recognization word
Optical character recognization wordOptical character recognization word
Optical character recognization word
 
Vexo - Handwriting recognition software
Vexo - Handwriting recognition softwareVexo - Handwriting recognition software
Vexo - Handwriting recognition software
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015Volume 2-issue-6-2009-2015
Volume 2-issue-6-2009-2015
 
I doc chip
I doc chipI doc chip
I doc chip
 
What is Optical Character Recognition (OCR) Technology?
What is Optical Character Recognition (OCR) Technology?What is Optical Character Recognition (OCR) Technology?
What is Optical Character Recognition (OCR) Technology?
 
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCRA SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
A SMART LANGUAGE TRANSLATION TECHNIQUE USING OCR
 
A Detailed Study And Recent Research On OCR
A Detailed Study And Recent Research On OCRA Detailed Study And Recent Research On OCR
A Detailed Study And Recent Research On OCR
 

Recently uploaded

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 

Recently uploaded (20)

Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 

How to create a corpus of machine-readable texts: challenges and solutions

  • 1. How to create a corpus of machine-readable texts: challenges and solutions
  • 2. What is OCR and how does it work? Definition of OCR according to the Oxford Dictionary of Computer Science, p. 379: „OCR = optical character recognition; a process in which a machine scans, recognizes, and encodes information printed or typed in alphanumerical characters. (…) OCR software is now readily available for many low-cost scanners giving good recognition rates for printed material using the Latin alphabet. The more difficult problems posed by other character sets and handwriting are areas of ongoing research.“
  • 3. When was OCR software invented? mid-1970s: OCR A font and OCR B font (similar to normal letter-press appearance) Ca. 1955: early OCR devices only recognised limited set of characters in machine-optimised font
  • 4. Do we encounter OCR in everyday-life? High accuracy rates have popularised OCR in the following areas: • banking (machines „reading“ paper cheques and transfer forms) • public administration • health-care (e.g. machine-readable precriptions) NOTE: In cases where absolute perfection is needed, OCR A and OCR B fonts are still used. If sensitive information is handled, OCR technology can be combined with the so-called MICR technology (magnetic-ink character recognition) checking the legitimacy or originality of paper documents.
  • 5. Are humanities tools using OCR? Google Books: full-text search + highlighting of text results HathiTrust full-text view
  • 6. What are ORC problems in historical research? „Hannoverisches Magazin”, 1776
  • 7. Best historical OCR results: texts in standardised formats (e.g. periodicals)
  • 8. Improving results for minority languages and old fonts – an on-going challenge
  • 9. Recent innovation: merging OCR and handwriting recognition technologies (HWR/HTR) “Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most plausible words.” Wikipedia.org
  • 10. The machine learning revolution in OCR
  • 11. How does machine learning work? Cf. Stanford OCR pipeline: • text detection (layout recognition) • character segmentation (using „sliding window“ technique) • character classification • spell correction (http://doremi2016.logdown.com/posts/ 2017/01/20/standford-machine- learning-photo-ocr-machine-learning- pipeline)
  • 12. New OCR tools based on machine learning E.g. OCR-D project in Germany: • improved visual character recognition • context analysis of n-grams • trainer feedback to exclude potential mistakes
  • 13. Current range of OCR-tools for researchers • Transkribus.eu (free of charge, cloud-based, each user contributes training data to the community) • OCR4all (free command-line OCR software for desktop-installation, difficult set-up, does not run smoothly on Windows) • KRAKEN (Python package for OCR, usage not monitored, data do not need to be shared with developers or others users) • ABBYY FineReader (one of the most popular proprietary OCR tools) • Tesseract (originally developed as proprietary software at Hewlett Packard labs in England and the US, released as open source in 2005, supported by Google since 2006, available for Linux as well as Windows and Mac OS X, high pre-processing requirements) • PICCL/TICCL (free corpus building and corpus clean-up system performing spelling correction and OCR post-correction, developed for LINUX, requires virtual machine on Windows)
  • 15. And the development continues…
  • 16. PROs and CONs of open-source OCR software: CONs: • takes up a lot of storage space • difficult installation • often limited performance on Windows and Mac) • usually requires command-line operation (no GUI) • conducting own training can be time-consuming • copyright issues if software provider requires you to ingest your (training) data into a public pool PROs: • flexible integration of historical texts in different digital formats • adaptable to multiple languages and new fonts / page layouts Photo by Luca Bravo on Unsplash
  • 17. How reliable are current OCR tools? Results based on an OCR test based on the US driver‘s licence, published on September 18, 2019: https://mobidev.biz/blog/ocr-machine-learning- implementation
  • 18. How can we integrate OCR into our own workflow? Export scan as PDF or image files to perform OCR! Analyse non-coded plain text with topic modelling or stylometry tools not requiring structured data! Code information (e.g. in XML/TEI or JSON) and use software to analyse networks between specific tagged entities or visualise geographic data! Export scan as PDF or image file to let humans extract metadata and transcribe the text! Original manuscript or print Use transcriptions to train OCR- software and improve results for similar sources (e.g. issues of the same newspaper)! Perform quantitative analysis on more texts in less time and generate more reliable results!
  • 19. Testing a cloud-based OCR tool: transkribus.eu