SlideShare a Scribd company logo
1 of 34
Download to read offline
Scanned texts as corpora — a case study
Scanned texts as corpora — a case study
Janusz S. Bień
Formal Linguistics Department, University of Warsaw
SLAVICORP. CORPORA OF SLAVIC LANGUAGES
University of Warsaw, 22-23 November 2010
(presented by Alicja Wójcicka)
http://bc.klf.uw.edu.pl/173/ 1/34
Scanned texts as corpora — a case study
Preliminaries
Absence excuse
IMPACT (http://www.impact-project.eu/)
All Staff Meeting, Alicante, Spain, 23–25 November 2010
http://bc.klf.uw.edu.pl/173/ 2/34
Scanned texts as corpora — a case study
Preliminaries
Acknowledgment
Digitalization tools for philological research
The Ministry of Science and Higher Education’s grant
no. N N519 384036
May 2009 — November 2011
Janusz S. Bień (project leader), Jakub Wilk and others
A result:
Lexicographical search engine
http://poliqarp.wbl.klf.uw.edu.pl/
http://bc.klf.uw.edu.pl/173/ 3/34
Scanned texts as corpora — a case study
DjVu
DjVu and DjVuLibre
Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard
1996
What is DjVu? More then just a format for scans. . .
an image compression technique, a document format,
and a software platform for delivering documents images
over the Internet
OCR, searching and indexing
DjVu pages can contain a "hidden text" chunk which
includes the recognized text as well as the coordinates of
each word on the page in a compressed form.
Quoted from:
http://leon.bottou.org/papers/lecun-2001
http://bc.klf.uw.edu.pl/173/ 4/34
Scanned texts as corpora — a case study
DjVu
DjVu and DjVuLibre
Some design principles
Action Real-word equivalent Acceptable delay
Zooming/Panning Moving the eyes Immediate
Next/Previous Page Turning a page < 1 second
Random Page Access Finding a page < 3 seconds
Quoted from:
http://leon.bottou.org/papers/lecun-2001
http://bc.klf.uw.edu.pl/173/ 5/34
Scanned texts as corpora — a case study
DjVu
GNU GPL
GNU General Public License
4 freedoms (http://www.gnu.org/philosophy/free-sw.html):
The freedom to run the program, for any purpose.
The freedom to study how the program works,
and adapt it to your needs.
The freedom to redistribute copies
so you can help your neighbor.
The freedom to improve the program,
and release your improvements to the public,
so that the whole community benefits.
http://bc.klf.uw.edu.pl/173/ 6/34
Scanned texts as corpora — a case study
DjVu
DjVu and GPLed tools
DjVuLibre
Open Source DjVu library and viewer
maintained by the original inventors of DjVu
http://djvu.sourceforge.net/
Jakub Wilk’s software
pdf2djvu
(http://code.google.com/p/pdf2djvu/)
Debian/Ubuntu GNU/Linux, . . . , MS Windows
ocrodjvu
(http://jwilk.net/software/ocrodjvu)
Debian/Ubuntu GNU/Linux, . . .
djvusmooth, didjvu
cf. http://jwilk.net/software/
http://bc.klf.uw.edu.pl/173/ 7/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A new DjVu search engine needed
The goal
Efficient search in the results of dirty OCR
(Optical Character Recognition without proof-reading)
Highlighting the hits on the page images
Existing solutions
closed source
(e.g. http://www.global-language.com/CENTURY/
not extensible
(e.g. http://jssindex.sourceforge.net/
queries not powerful enough
http://bc.klf.uw.edu.pl/173/ 8/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp
Polyinterpretation Indexing Query and Retrieval Procesor
Open source (GNU GPL)
set of tools for searching large corpora:
http://poliqarp.sourceforge.net/
Originally developed for The IPI PAN Corpus
(http://korpus.pl/).
Now used for The National Corpus of Polish
(http://nkjp.pl/).
Notable features:
polyinterpretation and two-level regular expressions.
http://bc.klf.uw.edu.pl/173/ 9/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu
An extension of Poliqarp
User requirements specified by Janusz S. Bień.
Implemented and maintained by Jakub Wilk.
Operational since December 2009
at http://poliqarp.wbl.klf.uw.edu.pl/
At present supports 4 large dictionaries
(including a few digitally born volumes).
http://bc.klf.uw.edu.pl/173/ 10/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu — welcome screen
http://bc.klf.uw.edu.pl/173/ 11/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A dictionary (a gazeteer — słownik geograficzny)
http://bc.klf.uw.edu.pl/173/ 12/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A non-DjVu version of the gazeteer at ICM UW
http://bc.klf.uw.edu.pl/173/ 13/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Another dictionary (‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 14/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A query (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 15/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A context with metadata (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 16/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
A hit (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 17/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu
Primary design goal achieved
Hits are linked to page images with highlighing
(a DjVu viewer required).
Essential new features
Hits can be uniquely bookmarked
(with some Web browsers).
Concordances can be displayed in the graphical mode
(no DjVu viewer required) .
http://bc.klf.uw.edu.pl/173/ 18/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Graphical concordances
Piotr Wierzchoń’s suggestion (11 Dec 2009)
http://bc.klf.uw.edu.pl/173/ 19/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Graphical concordances (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 20/34
Scanned texts as corpora — a case study
Poliqarp for DjVu
Settings
http://bc.klf.uw.edu.pl/173/ 21/34
Scanned texts as corpora — a case study
Regular expressions
Poliqarp
Reference and tutorials
Adam Przepiórkowski (2004)
The IPI PAN Corpus: Preliminary Version
http:
//nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/
Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk
(2010)
The National Corpus of Polish Cheatsheet
http://nkjp.pl/poliqarp/help/en.html
http://bc.klf.uw.edu.pl/173/ 22/34
Scanned texts as corpora — a case study
Regular expressions
Character equivalence (locale dependent)
Equivalence classes can be used only in bracket expressions!
Some examples (The dictionary of the 16th century Polish)
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
"[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi
"[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi
http://bc.klf.uw.edu.pl/173/ 23/34
Scanned texts as corpora — a case study
Regular expressions
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 24/34
Scanned texts as corpora — a case study
Regular expressions
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 25/34
Scanned texts as corpora — a case study
Regular expressions
OCR in The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 26/34
Scanned texts as corpora — a case study
Regular expressions
"[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi
The dictionary of the 16th century Polish, only digitally born volumes
http://bc.klf.uw.edu.pl/173/ 27/34
Scanned texts as corpora — a case study
Regular expressions
"[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi
The dictionary of the 16th century Polish, only entries in digitally born volumes
Only entries in digitally born volumes
http://bc.klf.uw.edu.pl/173/ 28/34
Scanned texts as corpora — a case study
Regular expressions
Character references (in Poliqarp [for DjVu])
Unicode standard
www.unicode.org
Version 6.0.0 of 11th October 2010
Escape sequences
x5c
REVERSE SOLIDUS,
u1E83
LATIN SMALL LETTER W WITH ACUTE,
U00010300
OLD ITALIC LETTER A
(Supplementary Multilingual Plane).
http://bc.klf.uw.edu.pl/173/ 29/34
Scanned texts as corpora — a case study
Regular expressions
Character class references (locale dependent)
Character classes
[:alnum:],
[:alpha:],
[:blank:],
[:cntrl:],
[:digit:],
. . .
http://bc.klf.uw.edu.pl/173/ 30/34
Scanned texts as corpora — a case study
Regular expressions
Bracket expressions
Character classes can be used only in bracket expressions!
An example (Linde’s dictionary)
Syr "." "[^[:digit:]].*"
http://bc.klf.uw.edu.pl/173/ 31/34
Scanned texts as corpora — a case study
Regular expressions
Syr "."
Linde’s dictionary
http://bc.klf.uw.edu.pl/173/ 32/34
Scanned texts as corpora — a case study
Regular expressions
Syr "." "[^[:digit:]].*"
Linde’s dictionary
http://bc.klf.uw.edu.pl/173/ 33/34
Scanned texts as corpora — a case study
Regular expressions
Final remark
Thank you for your attention!
The present slides are available at
http://bc.klf.uw.edu.pl/173/
Contact: jsbien@uw.edu.pl
http://bc.klf.uw.edu.pl/173/ 34/34

More Related Content

Similar to Scanned texts as corpora - a case study

Similar to Scanned texts as corpora - a case study (20)

Resource List - In Search of the Obscure – Using Library & Online Sources to ...
Resource List - In Search of the Obscure – Using Library & Online Sources to ...Resource List - In Search of the Obscure – Using Library & Online Sources to ...
Resource List - In Search of the Obscure – Using Library & Online Sources to ...
 
Scanned publications in digital libraries: new Open Source DjVu tools.
Scanned publications in digital libraries: new Open Source DjVu tools.Scanned publications in digital libraries: new Open Source DjVu tools.
Scanned publications in digital libraries: new Open Source DjVu tools.
 
Reborn Digital: coding text
Reborn Digital: coding textReborn Digital: coding text
Reborn Digital: coding text
 
Istic thesaurus ws-keizer_2010-10-22
Istic thesaurus ws-keizer_2010-10-22Istic thesaurus ws-keizer_2010-10-22
Istic thesaurus ws-keizer_2010-10-22
 
The role of Thesauri and Standard Vocabularies in linking data
The role of Thesauri and Standard Vocabularies in linking data The role of Thesauri and Standard Vocabularies in linking data
The role of Thesauri and Standard Vocabularies in linking data
 
TOETOE: English for Academic Purposes (EAP) with OER
TOETOE: English for Academic Purposes (EAP) with OERTOETOE: English for Academic Purposes (EAP) with OER
TOETOE: English for Academic Purposes (EAP) with OER
 
Methods and experiences in cultural heritage enhancement
Methods and experiences in cultural heritage enhancementMethods and experiences in cultural heritage enhancement
Methods and experiences in cultural heritage enhancement
 
Wikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization SystemsWikipedia as source of collaboratively created Knowledge Organization Systems
Wikipedia as source of collaboratively created Knowledge Organization Systems
 
Language Teaching with Wikimedia
Language Teaching with WikimediaLanguage Teaching with Wikimedia
Language Teaching with Wikimedia
 
Reforming our methods
Reforming our methodsReforming our methods
Reforming our methods
 
VuFind @ Illinois #1 VuFind at the University of Illinois
VuFind @ Illinois #1 VuFind at the University of IllinoisVuFind @ Illinois #1 VuFind at the University of Illinois
VuFind @ Illinois #1 VuFind at the University of Illinois
 
Libraries And E Learning (2010 02 28)
Libraries And E Learning (2010 02 28)Libraries And E Learning (2010 02 28)
Libraries And E Learning (2010 02 28)
 
The agINFRA Linked Data layer by Valeria Pesce, Giovanni l'Abate, Luca Mattei...
The agINFRA Linked Data layer by Valeria Pesce, Giovanni l'Abate, Luca Mattei...The agINFRA Linked Data layer by Valeria Pesce, Giovanni l'Abate, Luca Mattei...
The agINFRA Linked Data layer by Valeria Pesce, Giovanni l'Abate, Luca Mattei...
 
The agINFRA Linked Data layer
The agINFRA Linked Data layerThe agINFRA Linked Data layer
The agINFRA Linked Data layer
 
Wikipedia as Knowledge Organization System
Wikipedia as Knowledge Organization SystemWikipedia as Knowledge Organization System
Wikipedia as Knowledge Organization System
 
Catalog of the Future
Catalog of the FutureCatalog of the Future
Catalog of the Future
 
Resources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic EnglishResources at the Interface of Openness for Academic English
Resources at the Interface of Openness for Academic English
 
Finding Primary Sources and Digital Collections on the Web
Finding Primary Sources and Digital Collections on the WebFinding Primary Sources and Digital Collections on the Web
Finding Primary Sources and Digital Collections on the Web
 
Acrl 2011 coleman
Acrl 2011 colemanAcrl 2011 coleman
Acrl 2011 coleman
 
The Online-Life of Media Art-Archives
The Online-Life of Media Art-ArchivesThe Online-Life of Media Art-Archives
The Online-Life of Media Art-Archives
 

More from jsbien

Podstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnejPodstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnej
jsbien
 

More from jsbien (9)

Podstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnejPodstawy ochrony własności intelektualnej
Podstawy ochrony własności intelektualnej
 
Jsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipiJsb i linde_18-10-01_ipi
Jsb i linde_18-10-01_ipi
 
Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT Polskie zasoby językowe w projekcie IMPACT
Polskie zasoby językowe w projekcie IMPACT
 
Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)Język naturalny i komputer (komputery i teksty)
Język naturalny i komputer (komputery i teksty)
 
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka PolskiegoKilka uwag o słownikach przyszłości i Radzie Języka Polskiego
Kilka uwag o słownikach przyszłości i Radzie Języka Polskiego
 
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanieJanusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
Janusz S. Bień: Słowniki elektroniczne - budowa i użytkowanie
 
Słownik Lindego jako korpus
Słownik Lindego jako korpus Słownik Lindego jako korpus
Słownik Lindego jako korpus
 
Elektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika LindegoElektroniczny indeks do słownika Lindego
Elektroniczny indeks do słownika Lindego
 
Otwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyceOtwarty dostęp do zasobów lingwistycznych w praktyce
Otwarty dostęp do zasobów lingwistycznych w praktyce
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Recently uploaded (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Scanned texts as corpora - a case study

  • 1. Scanned texts as corpora — a case study Scanned texts as corpora — a case study Janusz S. Bień Formal Linguistics Department, University of Warsaw SLAVICORP. CORPORA OF SLAVIC LANGUAGES University of Warsaw, 22-23 November 2010 (presented by Alicja Wójcicka) http://bc.klf.uw.edu.pl/173/ 1/34
  • 2. Scanned texts as corpora — a case study Preliminaries Absence excuse IMPACT (http://www.impact-project.eu/) All Staff Meeting, Alicante, Spain, 23–25 November 2010 http://bc.klf.uw.edu.pl/173/ 2/34
  • 3. Scanned texts as corpora — a case study Preliminaries Acknowledgment Digitalization tools for philological research The Ministry of Science and Higher Education’s grant no. N N519 384036 May 2009 — November 2011 Janusz S. Bień (project leader), Jakub Wilk and others A result: Lexicographical search engine http://poliqarp.wbl.klf.uw.edu.pl/ http://bc.klf.uw.edu.pl/173/ 3/34
  • 4. Scanned texts as corpora — a case study DjVu DjVu and DjVuLibre Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard 1996 What is DjVu? More then just a format for scans. . . an image compression technique, a document format, and a software platform for delivering documents images over the Internet OCR, searching and indexing DjVu pages can contain a "hidden text" chunk which includes the recognized text as well as the coordinates of each word on the page in a compressed form. Quoted from: http://leon.bottou.org/papers/lecun-2001 http://bc.klf.uw.edu.pl/173/ 4/34
  • 5. Scanned texts as corpora — a case study DjVu DjVu and DjVuLibre Some design principles Action Real-word equivalent Acceptable delay Zooming/Panning Moving the eyes Immediate Next/Previous Page Turning a page < 1 second Random Page Access Finding a page < 3 seconds Quoted from: http://leon.bottou.org/papers/lecun-2001 http://bc.klf.uw.edu.pl/173/ 5/34
  • 6. Scanned texts as corpora — a case study DjVu GNU GPL GNU General Public License 4 freedoms (http://www.gnu.org/philosophy/free-sw.html): The freedom to run the program, for any purpose. The freedom to study how the program works, and adapt it to your needs. The freedom to redistribute copies so you can help your neighbor. The freedom to improve the program, and release your improvements to the public, so that the whole community benefits. http://bc.klf.uw.edu.pl/173/ 6/34
  • 7. Scanned texts as corpora — a case study DjVu DjVu and GPLed tools DjVuLibre Open Source DjVu library and viewer maintained by the original inventors of DjVu http://djvu.sourceforge.net/ Jakub Wilk’s software pdf2djvu (http://code.google.com/p/pdf2djvu/) Debian/Ubuntu GNU/Linux, . . . , MS Windows ocrodjvu (http://jwilk.net/software/ocrodjvu) Debian/Ubuntu GNU/Linux, . . . djvusmooth, didjvu cf. http://jwilk.net/software/ http://bc.klf.uw.edu.pl/173/ 7/34
  • 8. Scanned texts as corpora — a case study Poliqarp for DjVu A new DjVu search engine needed The goal Efficient search in the results of dirty OCR (Optical Character Recognition without proof-reading) Highlighting the hits on the page images Existing solutions closed source (e.g. http://www.global-language.com/CENTURY/ not extensible (e.g. http://jssindex.sourceforge.net/ queries not powerful enough http://bc.klf.uw.edu.pl/173/ 8/34
  • 9. Scanned texts as corpora — a case study Poliqarp for DjVu Poliqarp Polyinterpretation Indexing Query and Retrieval Procesor Open source (GNU GPL) set of tools for searching large corpora: http://poliqarp.sourceforge.net/ Originally developed for The IPI PAN Corpus (http://korpus.pl/). Now used for The National Corpus of Polish (http://nkjp.pl/). Notable features: polyinterpretation and two-level regular expressions. http://bc.klf.uw.edu.pl/173/ 9/34
  • 10. Scanned texts as corpora — a case study Poliqarp for DjVu Poliqarp for DjVu An extension of Poliqarp User requirements specified by Janusz S. Bień. Implemented and maintained by Jakub Wilk. Operational since December 2009 at http://poliqarp.wbl.klf.uw.edu.pl/ At present supports 4 large dictionaries (including a few digitally born volumes). http://bc.klf.uw.edu.pl/173/ 10/34
  • 11. Scanned texts as corpora — a case study Poliqarp for DjVu Poliqarp for DjVu — welcome screen http://bc.klf.uw.edu.pl/173/ 11/34
  • 12. Scanned texts as corpora — a case study Poliqarp for DjVu A dictionary (a gazeteer — słownik geograficzny) http://bc.klf.uw.edu.pl/173/ 12/34
  • 13. Scanned texts as corpora — a case study Poliqarp for DjVu A non-DjVu version of the gazeteer at ICM UW http://bc.klf.uw.edu.pl/173/ 13/34
  • 14. Scanned texts as corpora — a case study Poliqarp for DjVu Another dictionary (‘słownik warszawski’) http://bc.klf.uw.edu.pl/173/ 14/34
  • 15. Scanned texts as corpora — a case study Poliqarp for DjVu A query (in ‘słownik warszawski’) http://bc.klf.uw.edu.pl/173/ 15/34
  • 16. Scanned texts as corpora — a case study Poliqarp for DjVu A context with metadata (in ‘słownik warszawski’) http://bc.klf.uw.edu.pl/173/ 16/34
  • 17. Scanned texts as corpora — a case study Poliqarp for DjVu A hit (in ‘słownik warszawski’) http://bc.klf.uw.edu.pl/173/ 17/34
  • 18. Scanned texts as corpora — a case study Poliqarp for DjVu Poliqarp for DjVu Primary design goal achieved Hits are linked to page images with highlighing (a DjVu viewer required). Essential new features Hits can be uniquely bookmarked (with some Web browsers). Concordances can be displayed in the graphical mode (no DjVu viewer required) . http://bc.klf.uw.edu.pl/173/ 18/34
  • 19. Scanned texts as corpora — a case study Poliqarp for DjVu Graphical concordances Piotr Wierzchoń’s suggestion (11 Dec 2009) http://bc.klf.uw.edu.pl/173/ 19/34
  • 20. Scanned texts as corpora — a case study Poliqarp for DjVu Graphical concordances (in ‘słownik warszawski’) http://bc.klf.uw.edu.pl/173/ 20/34
  • 21. Scanned texts as corpora — a case study Poliqarp for DjVu Settings http://bc.klf.uw.edu.pl/173/ 21/34
  • 22. Scanned texts as corpora — a case study Regular expressions Poliqarp Reference and tutorials Adam Przepiórkowski (2004) The IPI PAN Corpus: Preliminary Version http: //nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/ Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk (2010) The National Corpus of Polish Cheatsheet http://nkjp.pl/poliqarp/help/en.html http://bc.klf.uw.edu.pl/173/ 22/34
  • 23. Scanned texts as corpora — a case study Regular expressions Character equivalence (locale dependent) Equivalence classes can be used only in bracket expressions! Some examples (The dictionary of the 16th century Polish) "[[=s=]f]k[[=a=]]rg[[=a=]]" within body "[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi "[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi http://bc.klf.uw.edu.pl/173/ 23/34
  • 24. Scanned texts as corpora — a case study Regular expressions "[[=s=]f]k[[=a=]]rg[[=a=]]" within body The dictionary of the 16th century Polish http://bc.klf.uw.edu.pl/173/ 24/34
  • 25. Scanned texts as corpora — a case study Regular expressions "[[=s=]f]k[[=a=]]rg[[=a=]]" within body The dictionary of the 16th century Polish http://bc.klf.uw.edu.pl/173/ 25/34
  • 26. Scanned texts as corpora — a case study Regular expressions OCR in The dictionary of the 16th century Polish http://bc.klf.uw.edu.pl/173/ 26/34
  • 27. Scanned texts as corpora — a case study Regular expressions "[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi The dictionary of the 16th century Polish, only digitally born volumes http://bc.klf.uw.edu.pl/173/ 27/34
  • 28. Scanned texts as corpora — a case study Regular expressions "[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi The dictionary of the 16th century Polish, only entries in digitally born volumes Only entries in digitally born volumes http://bc.klf.uw.edu.pl/173/ 28/34
  • 29. Scanned texts as corpora — a case study Regular expressions Character references (in Poliqarp [for DjVu]) Unicode standard www.unicode.org Version 6.0.0 of 11th October 2010 Escape sequences x5c REVERSE SOLIDUS, u1E83 LATIN SMALL LETTER W WITH ACUTE, U00010300 OLD ITALIC LETTER A (Supplementary Multilingual Plane). http://bc.klf.uw.edu.pl/173/ 29/34
  • 30. Scanned texts as corpora — a case study Regular expressions Character class references (locale dependent) Character classes [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], . . . http://bc.klf.uw.edu.pl/173/ 30/34
  • 31. Scanned texts as corpora — a case study Regular expressions Bracket expressions Character classes can be used only in bracket expressions! An example (Linde’s dictionary) Syr "." "[^[:digit:]].*" http://bc.klf.uw.edu.pl/173/ 31/34
  • 32. Scanned texts as corpora — a case study Regular expressions Syr "." Linde’s dictionary http://bc.klf.uw.edu.pl/173/ 32/34
  • 33. Scanned texts as corpora — a case study Regular expressions Syr "." "[^[:digit:]].*" Linde’s dictionary http://bc.klf.uw.edu.pl/173/ 33/34
  • 34. Scanned texts as corpora — a case study Regular expressions Final remark Thank you for your attention! The present slides are available at http://bc.klf.uw.edu.pl/173/ Contact: jsbien@uw.edu.pl http://bc.klf.uw.edu.pl/173/ 34/34