SlideShare a Scribd company logo
1 of 59
How can Optical Character 
Recognition technology help users in 
their research? 
Günter Mühlberger, Innsbruck University 
Digitisation and Digital Preservation group
Agenda 
Part 1: Optical Character Recognition – Some basics 
Part 2: Users – The Unknown Creature? 
Part 3: Some ideas!
Part 1 
Some basics on 
Optical Character Recognition (OCR) 
(a story about errors…) 
3
Berufsgenossenschaften 
IMPACT 
EVA/MIN 
ERVA 
12th Nov. 
2008 
4
Digitisation and OCR 
• Digitisation of historical printed material 
• Google: Billions of files, libraries: Millions of files 
• Google books: Would never have started without full-text 
• BNF: Partner in EU Project METADATA ENGINE (2000-2003, ABBYY Historical OCR) 
• OCR quality 
• There are only a few reliable data on the accuracy of OCR on large scale datasets 
• E.g. we do not know „how good the Google collection“ is as a whole, or per language, per 
century, decade or year, per text type, etc. 
• Simon Tanner (2009) 
• Has done evaluation of OCR accuracy on British Newspapers 
• Differences per newspaper are stronger than per publishing date 
• Overall we are speaking about 10% to 40% Word Error Rate, with an average of 22% WER 
for standard words and 31% for significant words 
• Evaluation done within the IMPACT project has shown similar figures
IMPACT 
EVA/MIN 
ERVA 
12th Nov. 
2008 
6 
und wenn 
???
83,4 % Correct Words for French in EU News 
7
Part 2 
Users 
(the unknown creature) 
8
(1) 
Occasional users 
9 
Typology of users
Occasional users 
10
Occasional users – Google Analytics 
11
Occasional users 
•Occasional users 
• Come by coincidence or curiosity 
• Just typing in something without real interest in the results 
• Developers of websites 
• Test users for new websites 
• Decision makers for digital library projects 
• More interested in features than in content 
12
(2) 
Researchers 
13 
Typology of users
Researchers  Scholars 
14
Researchers 
•Definition 
• Anyone who is actually looking for some specific content and 
invests some reasonable time into these investigations 
• Professional researchers (e.g. historians,…) 
• Students (e.g. writing their thesis) 
• Family historians (e.g. searching for their family members) 
•Citizen scientists (e.g. writing Wikipedia articles) 
• Volunteers (e.g. contributing to improve OCR text) 
• Teachers (e.g. preparing lessons) 
• School pupils (e.g. doing their homework) 
• Etc. 
15
Researchers 
• Researchers are not searching a collection because they WANT 
to search the full-text – it is just a tool to satisfy their need for 
information! 
• Researchers are looking for answers on their specific questions! 
• Was my grandfather mentioned in the local newspaper when he returned 
from first World War? 
• What was written about my village in 1870? 
• Are there interesting news from the French Revolution in a newspaper 
from Vienna in 1789? 
• How were companies advertising their products in 1750, 1850 and 1950 
in newspapers? 
• How did newspapers write about “sex and crime” in 1900? 
• How did people find new jobs in the early 19th century? 
16
What researchers are doing with their sources 
•Read articles 
• Researchers want to know what is written in an article 
•Download – Collect – Print out 
• Researchers are conservative and pragmatic in organising their 
work 
• Want to work on their own computers, want to read offline, etc. 
•Work 
• Collecting the material is just the beginning 
17
Annotate 
18
Excerpt 
19
Arrange 
20
Fill databases 
21
Analyse 
22
Draw conclusions 
23
Exchange with others 
24
Cite text 
25
Link sources 
26
Write publications 
27
And many, many other activities… 
28
(3) 
Machines 
29 
Typology of users
Machines as users 
30
Machines as users 
• Google 
• Is just the beginning (though an important one) 
• Facebook, LinkedIn, Academia.edu,… 
• Image you could see from all users in Gallica their affiliation to a social 
network! 
• You would get the “social graph” of these users and therefore also see 
(understand) all connected users 
• Machines like very much 
• Rich data (machine generated) 
• Standardized formats (XML) 
• Normalized data 
• Clear distinction of metadata and content data 
• Permanent links 
• Open Data 
• … 
31
Part 3 
Some ideas… 
32
(1) 
Back to the sources! 
33 
Ideas
Source critics 
•Get to know your source! 
• Attitude of historians: Don’t trust your source! 
•Researchers need to know “What is my source? How 
reliable is it? What can I find, what not?” 
•Needs to be applied to OCR as well! 
• Simple information: 
• Number of pages per average day, month, year, decade, century 
• Number of words/articles on a page 
• Number of words missed on a page due to OCR errors 
• Etc… 
34
Tools 
•Users need to know more about the quantitative shape of the 
collection they are searching 
• The number of pages is increasing during the centuries 
• The number of words on a page is increasing until the 1950ies 
• The number of photos is increasing from the 1920ies onwards 
• The number of OCR errors (missing hits when searching) is in 
general decreasing but depends on many other factors as well 
35
Mapping Texts – Univ. North Texas and Stanford 
36
(2) 
Natural involvement – search and correct! 
37 
Ideas
10-30% errors… 
•What does this mean for the researcher? 
• For reading a page they have the original image 
• Simply because the OCR has errors they will miss e.g. 20% of all 
occurrences of a search term! 
•Maybe acceptable to specific use cases, but surely not for 
humanities scholars or family historians: They want to get 
„all relevant occurrences“ 
• What is “relevant” is decided by the user, some may be interested 
just within a specific time period, or periodical, or collection of 
documents 
• Note: Not all words are frequent in all collections („London“ in a 
Tyrolian newspaper collections is seldom whereas it is frequent in 
a British Newspaper Collection) 
38
Australian National Library 
39
Australian National Library 
40
Searching AND correcting 
• Let‘s combine searching and crowd based correction! 
• Provide users with a powerful instrument to correct exactly 
those words where they are interested in (searching for) 
•Relieve users from actually editing words, but let them just 
approve or reject the results of the OCR engine 
41
Interface with Word Snippets 
42
OCR errors 
43 
neue nelle neue nelle
Select correct word images = green = approved 
44
Consequences 
•User corrects exactly those words he is looking for 
• Together with an annotation tool he will be able to find ALL 
OCCURENCES of a search term and e.g. tag them as 
important, less important, etc. 
•Other users will benefit (and see) the corrections carried out 
by another user 
• Export feature where all occurrences are put together in one 
PDF would be a next step… 
45
(3) 
Knowledge based searching 
46 
Ideas
What users get now with full-text searching 
47
What they would like to get: Overview 
AND detail 
48
Named Entities and Wikipedia Linking 
49 
Search for “Vranitzky” 
Number of hits in full-text 
and on article level 
List of Persons, Institutions and geographical 
Names appearing in the articles with “Vranitzky”
Named Entities integrated into search interface 
50 
Search for “Vranitzky” AND 
“Wolfgang Schüssel”
Article about Schüssel AND Vranitzky 
51
Wikipedia Categories 
52 
Search for “Vranitzky” retrieves 
also 
(1)The fact that it is the person 
“Franz Vranitzky” 
(2)the categories in Wikipedia of 
this person
Utilizing Wikipedia Knowledge 
53 
Search for 
“Bundeskanzler_Österreich” 
(chancellor Austria) retrieves 
(1)All other chancellors from 
Austria appearing in the 
newspaper 
(2)All articles connected with 
this category
The new Encyclopédie  Gallica 
54
(4) 
Let’ machines play – and learn! 
55 
Ideas
Let ‘em play! 
• Progress/Innovation 
• = Computer Scientists + User needs + Data (from libraries) 
• Computer Science 
• Break through in face and speech recognition, big data analysis, recommender systems, 
information retrieval is based on statistical methods! 
• Statistical algorithm need data 
• Metadata are not enough (though important)! 
• Sample data are not enough! 
• The more data the better 
• An example 
• If you have 10 mill. digitized newspaper pages published within 200 years. How many pages 
do you have on average per day? 
• 136! 
• We have done 2 mill. pages for BNF within EU Newspapers! 
• The easier to access the data, the better! 
• Download (simple, easy, fast, cheap!) 
• Nice to have: APIs and dedicated web-services (something for real experts) 
56
What machines (computer scientists) can do… 
• Information extraction 
• Get names of persons, locations, 
• Images within printed text (photos…) 
• Book titles (reviewed), theatre plays, advertisments,… 
• But also: facts about car accidents, sex and crime, stock exchange rates, 
• And: Sentiment analysis… 
• Linking of text with external sources 
• A lot of the information in (historical) newspapers can be found elsewhere 
in a much better way 
• Start of World War I 
• Dreyfuss – Affair 
• German “Reichstagswahl” in March 1933 
• Wikipedia was just a simple example… 
57
Machine learning 
58
Thank you for your attention! 
Contact: Günter Mühlberger 
<guenter.muehlberger@uibk.ac.at> 
www.europeana-newspapers.eu

More Related Content

What's hot

Don’t fear the data: Statistics in Information Literacy Instruction
Don’t fear the data: Statistics in Information Literacy InstructionDon’t fear the data: Statistics in Information Literacy Instruction
Don’t fear the data: Statistics in Information Literacy InstructionLynda Kellam
 
Transfer students and the library 2015
Transfer students and the library 2015Transfer students and the library 2015
Transfer students and the library 2015Traciwm
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅kulibrarians
 
Engl 1421 smith
Engl 1421 smithEngl 1421 smith
Engl 1421 smithTraciwm
 
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...AINL Conferences
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for developmentSara-Jayne Terp
 
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the research
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the researchWriting The Research Paper A Handbook (7th ed) - Ch 6 doing the research
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the researchtedster777
 

What's hot (7)

Don’t fear the data: Statistics in Information Literacy Instruction
Don’t fear the data: Statistics in Information Literacy InstructionDon’t fear the data: Statistics in Information Literacy Instruction
Don’t fear the data: Statistics in Information Literacy Instruction
 
Transfer students and the library 2015
Transfer students and the library 2015Transfer students and the library 2015
Transfer students and the library 2015
 
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
20170222 ku-librarians勉強会 #211 :海外研修報告:英国大学図書館を北から南へ巡る旅
 
Engl 1421 smith
Engl 1421 smithEngl 1421 smith
Engl 1421 smith
 
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...
Ainl 2013 toschev-talanov_практическое применение модели мышления и машинного...
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the research
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the researchWriting The Research Paper A Handbook (7th ed) - Ch 6 doing the research
Writing The Research Paper A Handbook (7th ed) - Ch 6 doing the research
 

Similar to Présentation Günter Mühlberger, BnF Information Day

Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...WiLS
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Bramesha B
 
Assessing user experience of e-books in academic libraries
Assessing user experience of e-books in academic librariesAssessing user experience of e-books in academic libraries
Assessing user experience of e-books in academic librariesTao Zhang
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)TimelessFuture
 
Conclusions and Learned Lessons - Visual Navigation Project Outcomes -
Conclusions and Learned Lessons - Visual Navigation Project Outcomes - Conclusions and Learned Lessons - Visual Navigation Project Outcomes -
Conclusions and Learned Lessons - Visual Navigation Project Outcomes - Visual Navigation Project
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)Dorothea Salo
 
User Required? On the Value of User Research in the Digital Humanities
User Required? On the Value of User Research in the Digital HumanitiesUser Required? On the Value of User Research in the Digital Humanities
User Required? On the Value of User Research in the Digital HumanitiesMaxKemman
 
Publishing about Educational Technology
Publishing about Educational TechnologyPublishing about Educational Technology
Publishing about Educational TechnologyShalin Hai-Jew
 
Exds 2001 Nostalgics and Nowhere-ians
Exds 2001  Nostalgics and Nowhere-iansExds 2001  Nostalgics and Nowhere-ians
Exds 2001 Nostalgics and Nowhere-iansTraciwm
 

Similar to Présentation Günter Mühlberger, BnF Information Day (20)

Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
Conforming to Destiny or Adapting to Circumstance: The State of Cataloging in...
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Transkribus | Günter Mühlberger
Transkribus | Günter MühlbergerTranskribus | Günter Mühlberger
Transkribus | Günter Mühlberger
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
Searching of Web and Electronic Resources
Searching of Web and Electronic Resources Searching of Web and Electronic Resources
Searching of Web and Electronic Resources
 
Assessing user experience of e-books in academic libraries
Assessing user experience of e-books in academic librariesAssessing user experience of e-books in academic libraries
Assessing user experience of e-books in academic libraries
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)Towards Research Engines: Supporting Search Stages in Web Archives (2015)
Towards Research Engines: Supporting Search Stages in Web Archives (2015)
 
Text Mining : Experience
Text Mining : ExperienceText Mining : Experience
Text Mining : Experience
 
Ojala "The Sophisticated User"
Ojala "The Sophisticated User"Ojala "The Sophisticated User"
Ojala "The Sophisticated User"
 
Digital Humanities Workshop
Digital Humanities WorkshopDigital Humanities Workshop
Digital Humanities Workshop
 
Conclusions and Learned Lessons - Visual Navigation Project Outcomes -
Conclusions and Learned Lessons - Visual Navigation Project Outcomes - Conclusions and Learned Lessons - Visual Navigation Project Outcomes -
Conclusions and Learned Lessons - Visual Navigation Project Outcomes -
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Data Mining Newspapers Metadata
Data Mining Newspapers MetadataData Mining Newspapers Metadata
Data Mining Newspapers Metadata
 
World ctc2013scoopitcytomics
World ctc2013scoopitcytomicsWorld ctc2013scoopitcytomics
World ctc2013scoopitcytomics
 
User Required? On the Value of User Research in the Digital Humanities
User Required? On the Value of User Research in the Digital HumanitiesUser Required? On the Value of User Research in the Digital Humanities
User Required? On the Value of User Research in the Digital Humanities
 
Publishing about Educational Technology
Publishing about Educational TechnologyPublishing about Educational Technology
Publishing about Educational Technology
 
Ir1
Ir1Ir1
Ir1
 
Intl190 kahler guide
Intl190 kahler guideIntl190 kahler guide
Intl190 kahler guide
 
Exds 2001 Nostalgics and Nowhere-ians
Exds 2001  Nostalgics and Nowhere-iansExds 2001  Nostalgics and Nowhere-ians
Exds 2001 Nostalgics and Nowhere-ians
 

More from Europeana Newspapers

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisEuropeana Newspapers
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayEuropeana Newspapers
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayEuropeana Newspapers
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayEuropeana Newspapers
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayEuropeana Newspapers
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayEuropeana Newspapers
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers
 

More from Europeana Newspapers (20)

Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in ParisPresentation of Philippe Mezzasalma at the BnF Information Day in Paris
Presentation of Philippe Mezzasalma at the BnF Information Day in Paris
 
Presentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information DayPresentation of Ioannis Anagnostopoulos at BnF Information Day
Presentation of Ioannis Anagnostopoulos at BnF Information Day
 
Presentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information DayPresentation of Clemens Neudecker, BnF Information Day
Presentation of Clemens Neudecker, BnF Information Day
 
Presentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information DayPresentation of Hans-Jörg Lieder, BnF Information Day
Presentation of Hans-Jörg Lieder, BnF Information Day
 
Presentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information DayPresentation of Claus Gravenhorst, BnF Information Day
Presentation of Claus Gravenhorst, BnF Information Day
 
Presentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information DayPresentation of Alaa Abi Haidar at the BnF Information Day
Presentation of Alaa Abi Haidar at the BnF Information Day
 
IFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza AtanassovaIFLA 2014 Europeana Newspapers Rossitza Atanassova
IFLA 2014 Europeana Newspapers Rossitza Atanassova
 
Europeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne KoutsEuropeana Newspapers Estonian Infoday Ragne Kouts
Europeana Newspapers Estonian Infoday Ragne Kouts
 
Europeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel VeimannEuropeana Newspapers Estonian Infoday Kristel Veimann
Europeana Newspapers Estonian Infoday Kristel Veimann
 
Europeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista KiisaEuropeana Newspapers Estonian Infoday Krista Kiisa
Europeana Newspapers Estonian Infoday Krista Kiisa
 
Europeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista AruEuropeana Newspapers Estonian Infoday Krista Aru
Europeana Newspapers Estonian Infoday Krista Aru
 
Europeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred PussEuropeana Newspapers Estonian Infoday Fred Puss
Europeana Newspapers Estonian Infoday Fred Puss
 
Europeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday NeudeckerEuropeana Newpapers LFT Infoday Neudecker
Europeana Newpapers LFT Infoday Neudecker
 
Europeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday ThompsonEuropeana Newspapers LFT Infoday Thompson
Europeana Newspapers LFT Infoday Thompson
 
Europeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday RossiEuropeana Newspapers LFT Infoday Rossi
Europeana Newspapers LFT Infoday Rossi
 
Enp lft infoday_neudecker
Enp lft infoday_neudeckerEnp lft infoday_neudecker
Enp lft infoday_neudecker
 
Europeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday MuehlbergerEuropeana Newspapers LFT Infoday Muehlberger
Europeana Newspapers LFT Infoday Muehlberger
 
Europeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday MessinaEuropeana Newspapers LFT Infoday Messina
Europeana Newspapers LFT Infoday Messina
 
Europeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday MarchettiEuropeana Newspapers Infoday Marchetti
Europeana Newspapers Infoday Marchetti
 
Europeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday KempfEuropeana Newspapers LFT Infoday Kempf
Europeana Newspapers LFT Infoday Kempf
 

Recently uploaded

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 

Recently uploaded (20)

HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 

Présentation Günter Mühlberger, BnF Information Day

  • 1. How can Optical Character Recognition technology help users in their research? Günter Mühlberger, Innsbruck University Digitisation and Digital Preservation group
  • 2. Agenda Part 1: Optical Character Recognition – Some basics Part 2: Users – The Unknown Creature? Part 3: Some ideas!
  • 3. Part 1 Some basics on Optical Character Recognition (OCR) (a story about errors…) 3
  • 5. Digitisation and OCR • Digitisation of historical printed material • Google: Billions of files, libraries: Millions of files • Google books: Would never have started without full-text • BNF: Partner in EU Project METADATA ENGINE (2000-2003, ABBYY Historical OCR) • OCR quality • There are only a few reliable data on the accuracy of OCR on large scale datasets • E.g. we do not know „how good the Google collection“ is as a whole, or per language, per century, decade or year, per text type, etc. • Simon Tanner (2009) • Has done evaluation of OCR accuracy on British Newspapers • Differences per newspaper are stronger than per publishing date • Overall we are speaking about 10% to 40% Word Error Rate, with an average of 22% WER for standard words and 31% for significant words • Evaluation done within the IMPACT project has shown similar figures
  • 6. IMPACT EVA/MIN ERVA 12th Nov. 2008 6 und wenn ???
  • 7. 83,4 % Correct Words for French in EU News 7
  • 8. Part 2 Users (the unknown creature) 8
  • 9. (1) Occasional users 9 Typology of users
  • 11. Occasional users – Google Analytics 11
  • 12. Occasional users •Occasional users • Come by coincidence or curiosity • Just typing in something without real interest in the results • Developers of websites • Test users for new websites • Decision makers for digital library projects • More interested in features than in content 12
  • 13. (2) Researchers 13 Typology of users
  • 15. Researchers •Definition • Anyone who is actually looking for some specific content and invests some reasonable time into these investigations • Professional researchers (e.g. historians,…) • Students (e.g. writing their thesis) • Family historians (e.g. searching for their family members) •Citizen scientists (e.g. writing Wikipedia articles) • Volunteers (e.g. contributing to improve OCR text) • Teachers (e.g. preparing lessons) • School pupils (e.g. doing their homework) • Etc. 15
  • 16. Researchers • Researchers are not searching a collection because they WANT to search the full-text – it is just a tool to satisfy their need for information! • Researchers are looking for answers on their specific questions! • Was my grandfather mentioned in the local newspaper when he returned from first World War? • What was written about my village in 1870? • Are there interesting news from the French Revolution in a newspaper from Vienna in 1789? • How were companies advertising their products in 1750, 1850 and 1950 in newspapers? • How did newspapers write about “sex and crime” in 1900? • How did people find new jobs in the early 19th century? 16
  • 17. What researchers are doing with their sources •Read articles • Researchers want to know what is written in an article •Download – Collect – Print out • Researchers are conservative and pragmatic in organising their work • Want to work on their own computers, want to read offline, etc. •Work • Collecting the material is just the beginning 17
  • 28. And many, many other activities… 28
  • 29. (3) Machines 29 Typology of users
  • 31. Machines as users • Google • Is just the beginning (though an important one) • Facebook, LinkedIn, Academia.edu,… • Image you could see from all users in Gallica their affiliation to a social network! • You would get the “social graph” of these users and therefore also see (understand) all connected users • Machines like very much • Rich data (machine generated) • Standardized formats (XML) • Normalized data • Clear distinction of metadata and content data • Permanent links • Open Data • … 31
  • 32. Part 3 Some ideas… 32
  • 33. (1) Back to the sources! 33 Ideas
  • 34. Source critics •Get to know your source! • Attitude of historians: Don’t trust your source! •Researchers need to know “What is my source? How reliable is it? What can I find, what not?” •Needs to be applied to OCR as well! • Simple information: • Number of pages per average day, month, year, decade, century • Number of words/articles on a page • Number of words missed on a page due to OCR errors • Etc… 34
  • 35. Tools •Users need to know more about the quantitative shape of the collection they are searching • The number of pages is increasing during the centuries • The number of words on a page is increasing until the 1950ies • The number of photos is increasing from the 1920ies onwards • The number of OCR errors (missing hits when searching) is in general decreasing but depends on many other factors as well 35
  • 36. Mapping Texts – Univ. North Texas and Stanford 36
  • 37. (2) Natural involvement – search and correct! 37 Ideas
  • 38. 10-30% errors… •What does this mean for the researcher? • For reading a page they have the original image • Simply because the OCR has errors they will miss e.g. 20% of all occurrences of a search term! •Maybe acceptable to specific use cases, but surely not for humanities scholars or family historians: They want to get „all relevant occurrences“ • What is “relevant” is decided by the user, some may be interested just within a specific time period, or periodical, or collection of documents • Note: Not all words are frequent in all collections („London“ in a Tyrolian newspaper collections is seldom whereas it is frequent in a British Newspaper Collection) 38
  • 41. Searching AND correcting • Let‘s combine searching and crowd based correction! • Provide users with a powerful instrument to correct exactly those words where they are interested in (searching for) •Relieve users from actually editing words, but let them just approve or reject the results of the OCR engine 41
  • 42. Interface with Word Snippets 42
  • 43. OCR errors 43 neue nelle neue nelle
  • 44. Select correct word images = green = approved 44
  • 45. Consequences •User corrects exactly those words he is looking for • Together with an annotation tool he will be able to find ALL OCCURENCES of a search term and e.g. tag them as important, less important, etc. •Other users will benefit (and see) the corrections carried out by another user • Export feature where all occurrences are put together in one PDF would be a next step… 45
  • 46. (3) Knowledge based searching 46 Ideas
  • 47. What users get now with full-text searching 47
  • 48. What they would like to get: Overview AND detail 48
  • 49. Named Entities and Wikipedia Linking 49 Search for “Vranitzky” Number of hits in full-text and on article level List of Persons, Institutions and geographical Names appearing in the articles with “Vranitzky”
  • 50. Named Entities integrated into search interface 50 Search for “Vranitzky” AND “Wolfgang Schüssel”
  • 51. Article about Schüssel AND Vranitzky 51
  • 52. Wikipedia Categories 52 Search for “Vranitzky” retrieves also (1)The fact that it is the person “Franz Vranitzky” (2)the categories in Wikipedia of this person
  • 53. Utilizing Wikipedia Knowledge 53 Search for “Bundeskanzler_Österreich” (chancellor Austria) retrieves (1)All other chancellors from Austria appearing in the newspaper (2)All articles connected with this category
  • 54. The new Encyclopédie  Gallica 54
  • 55. (4) Let’ machines play – and learn! 55 Ideas
  • 56. Let ‘em play! • Progress/Innovation • = Computer Scientists + User needs + Data (from libraries) • Computer Science • Break through in face and speech recognition, big data analysis, recommender systems, information retrieval is based on statistical methods! • Statistical algorithm need data • Metadata are not enough (though important)! • Sample data are not enough! • The more data the better • An example • If you have 10 mill. digitized newspaper pages published within 200 years. How many pages do you have on average per day? • 136! • We have done 2 mill. pages for BNF within EU Newspapers! • The easier to access the data, the better! • Download (simple, easy, fast, cheap!) • Nice to have: APIs and dedicated web-services (something for real experts) 56
  • 57. What machines (computer scientists) can do… • Information extraction • Get names of persons, locations, • Images within printed text (photos…) • Book titles (reviewed), theatre plays, advertisments,… • But also: facts about car accidents, sex and crime, stock exchange rates, • And: Sentiment analysis… • Linking of text with external sources • A lot of the information in (historical) newspapers can be found elsewhere in a much better way • Start of World War I • Dreyfuss – Affair • German “Reichstagswahl” in March 1933 • Wikipedia was just a simple example… 57
  • 59. Thank you for your attention! Contact: Günter Mühlberger <guenter.muehlberger@uibk.ac.at> www.europeana-newspapers.eu