Matatag-Curriculum and the 21st Century Skills Presentation.pptx
Bratislava WS - Mühlberger - OCR in libraries_pdf
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR in libraries – some practical remarks
Günter Mühlberger
Department for Digitisation and Digital Preservation
University Innsbruck Library
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR in Libraries
Not an easy chapter...
Is the glass half empty or half full?
Historical fonts: Black letter, gothic, Old Cyrillic, ...
Great attempts for full-text
– JSTOR (1994)
– Google (2004)
But: Still many digital libraries without integrated full-text
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR and Digitization
OCR changes everything!
Workflow has to be adopted at all steps
– Preparation and selection of material
– Image processing & scanning
– Quality control
– Storage and preservation
– Correction and user involvement
– Full-text search
– Web interfaces for digital libraries
Significant increase in complexity
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
4
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
5
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
7
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
11
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
13
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Preparation
Which material will be taken for scanning? Options:
– Bound volumes?
– Microfilm?
– Loose folios?
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Option: Bound volumes
Bound volumes
– Pros:
That’s the way books/journals/newspapers are in the library
– Cons:
Often narrow binding, especially with newspapers
Often warping due to humidity
– Remark
Technical solution: ScanRobots make life easier and double the speed
compared to manual interaction, e.g. 700 – 1000 pages per hour
Investment for ScanRobots must not be underestimated
15
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Option: Microfilm
Microfilm
– Pros:
If a microfilm is available it is a cheap alternative
Easy option (no handling of volumes)
– Cons:
Microfilms have the same problems as bound volumes
Microfilms were often produced with minimum quality control
Microfilms before 1990 are often not in a good condition
Remark
– If microfilm was produced with good quality than there is no significant
difference in the OCR quality
Case study with BL material will be published on IMPACT site
16
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Option: Loose folios
Pros
– No narrow binding, less warping
– Extremely fast performance with industry scanners – low price
– Duplicates can be sent to off-shore providers in huge packages
Cons
– Not feasible for material before 1850 – libraries would run into justification problems
– Organisational effort to organise duplicates (but completeness has to be evaluated
anyway)
Remark
– By far the best option to produce high quality with the lowest resources
– Especially interesting for newspapers, 20th century material and grey literature
– Used e.g. by MOA, JSTOR
17
18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Good, bad and ugly images
Careful scanning is A and O
– Scanrobots and document scanners lower the requirements for a good
operator, but still individual capability is decisive
Criteria for a good page image are simple:
– sharp
– significant fonts with clear curves
– clear background, no shining through from the backside
– no warping of the page and no geometrical distortions
– complete shot with some white frame around the text borders
– lines to be parallel resp. rectangle to borders
– no noise of users
If you have perfect images you can wait until OCR technology
improves, with bad images you never get good results
19. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
19
20. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
20
21. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
21
22. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
22
23. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Bad print – broken characters
24. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
und wenn
24
25. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
25
26. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Bitonal or 8/24 Bit – 300 or 400 ppi – JPEG or TIFF?
Bitonal vs. 8/24 bit
– Rose Holley: Dlib Paper 2009: Grey scanning does not lead to better results
– Experiment: Microfilm scanned bitonal or greyscale – no difference
Simple experiments show the opposite
– Innsbrucker Zeitungsarchiv: bitonal and 24 bit
– Results are clearly better with colour
300 or 400 Resolution
– Very small font: Word text: 4 point font
JPEG vs. TIFF RGB
– Tests with the Treventus ScanRobot but also with other material show that
there is no advantage of TIFF RGB images compared to compressed
JPEGs
Modern documents with medium sized fonts can be scanned with 300
ppi and bitonal, but documents with small fonts and challenging paper
quality etc. should be scanned with 400 ppi and 8 or 24 bit and can be
stored as JPEGs with e.g. 90% compression rate
27. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Accuracy
Is the glas half full or half empty?
– Rose Holley <90% word recognition: Poor result
– Google: OCR every image, so every correctly recognized word is better
than nothing
– Painful errors?
– Mature users?
Character vs. word accuracy
– Word accuracy says much more, and is much easier to gain: Each word
which would be correctly found in a full-text search, can be counted as
correct.
28. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Examples from real world projects
Based on: ABBYY Recognition Server 2
– Reichstagsprotokolle, 1925
– Zedler, 1744
– Coburger Zeitung, 1808
– Judentum, 1803
– Eckartshausen, 1792
– Landesbauernkammer, 1921
– Galvani, 1793
– Hieber, 1722
– Hofmann, 1875
– Buschendorf, 1805
– Schreiben, 1689
– Lateinische Texte
29. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Correction of OCR text
Until recently regarded as „absurd“
But:
– Crowd sourcing
– New technologies
Crowd sourcing
– Figures from Austrialian Newspaper Project:
– Correction via a simple editor: line by line correctioin
– Since August 2008 6000 users contributed
– 7 Mill. lines in 318.000 articles were corrected
– If you count 50 characters per line it is worth about 200.000 EUR (=
compared to the prices of service providers)
New technologies
– IBM: CONCERT Tool, LMU: PostCorrection Tool
– Productivity compared to simple rekeying will be enhanced by several
factors (at least 1:5)
30. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
What to do with OCR results?
Structural enhancement
– INEX: competition based on OCR files
– Functional Extension Parser
Preservation
– Complexity is significantly increased
– Output: TXT, PDF, ABBYY XML
– ALTO Format
– How to integrated corrective actions of users?
– Proposition for enhancing ALTO format
31. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Digital library applications
Fulltext search
– JSTOR, Google, publishers
– Facetted Search (SOLR)
Indexing through search engines
– Site XML
Visibility of the OCR text
– User training (by doing)
– Necessary if correction shall be included
New research fields
– Text mining
– Linking of texts
– Near duplicates, similiarity and new identifiers
32. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Summary
OCR is a „must“
– For documents of the 19. and 20th century OCR provides in general
useful or even very good results
– Bevore 1800: Improvements can be expected by IMPACT
– Careful and exact scanning is always the main prerequisite, preferable
in 400 ppi and 8 or 24 bit
– Test runs with random sets
Modern applications
– Fulltext search
– Visibility of the erroneous text
– Options for correcting the text by users
– Several export formats (also for end-users)
– Site XML for search engines
33. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you for your attention!