This document discusses using scanned texts as corpora and summarizing them using the Poliqarp search engine. It describes how DjVu and DjVuLibre were used to compress scanned documents while still allowing text extraction and searching. It also explains how Poliqarp was adapted to index scanned texts in DjVu format, allow highlighting search results directly in the scans, and display concordance lines. Regular expressions are also discussed for improving search of the noisy OCR text.
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Scanned texts as corpora - a case study
1. Scanned texts as corpora — a case study
Scanned texts as corpora — a case study
Janusz S. Bień
Formal Linguistics Department, University of Warsaw
SLAVICORP. CORPORA OF SLAVIC LANGUAGES
University of Warsaw, 22-23 November 2010
(presented by Alicja Wójcicka)
http://bc.klf.uw.edu.pl/173/ 1/34
2. Scanned texts as corpora — a case study
Preliminaries
Absence excuse
IMPACT (http://www.impact-project.eu/)
All Staff Meeting, Alicante, Spain, 23–25 November 2010
http://bc.klf.uw.edu.pl/173/ 2/34
3. Scanned texts as corpora — a case study
Preliminaries
Acknowledgment
Digitalization tools for philological research
The Ministry of Science and Higher Education’s grant
no. N N519 384036
May 2009 — November 2011
Janusz S. Bień (project leader), Jakub Wilk and others
A result:
Lexicographical search engine
http://poliqarp.wbl.klf.uw.edu.pl/
http://bc.klf.uw.edu.pl/173/ 3/34
4. Scanned texts as corpora — a case study
DjVu
DjVu and DjVuLibre
Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard
1996
What is DjVu? More then just a format for scans. . .
an image compression technique, a document format,
and a software platform for delivering documents images
over the Internet
OCR, searching and indexing
DjVu pages can contain a "hidden text" chunk which
includes the recognized text as well as the coordinates of
each word on the page in a compressed form.
Quoted from:
http://leon.bottou.org/papers/lecun-2001
http://bc.klf.uw.edu.pl/173/ 4/34
5. Scanned texts as corpora — a case study
DjVu
DjVu and DjVuLibre
Some design principles
Action Real-word equivalent Acceptable delay
Zooming/Panning Moving the eyes Immediate
Next/Previous Page Turning a page < 1 second
Random Page Access Finding a page < 3 seconds
Quoted from:
http://leon.bottou.org/papers/lecun-2001
http://bc.klf.uw.edu.pl/173/ 5/34
6. Scanned texts as corpora — a case study
DjVu
GNU GPL
GNU General Public License
4 freedoms (http://www.gnu.org/philosophy/free-sw.html):
The freedom to run the program, for any purpose.
The freedom to study how the program works,
and adapt it to your needs.
The freedom to redistribute copies
so you can help your neighbor.
The freedom to improve the program,
and release your improvements to the public,
so that the whole community benefits.
http://bc.klf.uw.edu.pl/173/ 6/34
7. Scanned texts as corpora — a case study
DjVu
DjVu and GPLed tools
DjVuLibre
Open Source DjVu library and viewer
maintained by the original inventors of DjVu
http://djvu.sourceforge.net/
Jakub Wilk’s software
pdf2djvu
(http://code.google.com/p/pdf2djvu/)
Debian/Ubuntu GNU/Linux, . . . , MS Windows
ocrodjvu
(http://jwilk.net/software/ocrodjvu)
Debian/Ubuntu GNU/Linux, . . .
djvusmooth, didjvu
cf. http://jwilk.net/software/
http://bc.klf.uw.edu.pl/173/ 7/34
8. Scanned texts as corpora — a case study
Poliqarp for DjVu
A new DjVu search engine needed
The goal
Efficient search in the results of dirty OCR
(Optical Character Recognition without proof-reading)
Highlighting the hits on the page images
Existing solutions
closed source
(e.g. http://www.global-language.com/CENTURY/
not extensible
(e.g. http://jssindex.sourceforge.net/
queries not powerful enough
http://bc.klf.uw.edu.pl/173/ 8/34
9. Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp
Polyinterpretation Indexing Query and Retrieval Procesor
Open source (GNU GPL)
set of tools for searching large corpora:
http://poliqarp.sourceforge.net/
Originally developed for The IPI PAN Corpus
(http://korpus.pl/).
Now used for The National Corpus of Polish
(http://nkjp.pl/).
Notable features:
polyinterpretation and two-level regular expressions.
http://bc.klf.uw.edu.pl/173/ 9/34
10. Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu
An extension of Poliqarp
User requirements specified by Janusz S. Bień.
Implemented and maintained by Jakub Wilk.
Operational since December 2009
at http://poliqarp.wbl.klf.uw.edu.pl/
At present supports 4 large dictionaries
(including a few digitally born volumes).
http://bc.klf.uw.edu.pl/173/ 10/34
11. Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu — welcome screen
http://bc.klf.uw.edu.pl/173/ 11/34
12. Scanned texts as corpora — a case study
Poliqarp for DjVu
A dictionary (a gazeteer — słownik geograficzny)
http://bc.klf.uw.edu.pl/173/ 12/34
13. Scanned texts as corpora — a case study
Poliqarp for DjVu
A non-DjVu version of the gazeteer at ICM UW
http://bc.klf.uw.edu.pl/173/ 13/34
14. Scanned texts as corpora — a case study
Poliqarp for DjVu
Another dictionary (‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 14/34
15. Scanned texts as corpora — a case study
Poliqarp for DjVu
A query (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 15/34
16. Scanned texts as corpora — a case study
Poliqarp for DjVu
A context with metadata (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 16/34
17. Scanned texts as corpora — a case study
Poliqarp for DjVu
A hit (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 17/34
18. Scanned texts as corpora — a case study
Poliqarp for DjVu
Poliqarp for DjVu
Primary design goal achieved
Hits are linked to page images with highlighing
(a DjVu viewer required).
Essential new features
Hits can be uniquely bookmarked
(with some Web browsers).
Concordances can be displayed in the graphical mode
(no DjVu viewer required) .
http://bc.klf.uw.edu.pl/173/ 18/34
19. Scanned texts as corpora — a case study
Poliqarp for DjVu
Graphical concordances
Piotr Wierzchoń’s suggestion (11 Dec 2009)
http://bc.klf.uw.edu.pl/173/ 19/34
20. Scanned texts as corpora — a case study
Poliqarp for DjVu
Graphical concordances (in ‘słownik warszawski’)
http://bc.klf.uw.edu.pl/173/ 20/34
21. Scanned texts as corpora — a case study
Poliqarp for DjVu
Settings
http://bc.klf.uw.edu.pl/173/ 21/34
22. Scanned texts as corpora — a case study
Regular expressions
Poliqarp
Reference and tutorials
Adam Przepiórkowski (2004)
The IPI PAN Corpus: Preliminary Version
http:
//nlp.ipipan.waw.pl/~adamp/Papers/2004-corpus/
Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk
(2010)
The National Corpus of Polish Cheatsheet
http://nkjp.pl/poliqarp/help/en.html
http://bc.klf.uw.edu.pl/173/ 22/34
23. Scanned texts as corpora — a case study
Regular expressions
Character equivalence (locale dependent)
Equivalence classes can be used only in bracket expressions!
Some examples (The dictionary of the 16th century Polish)
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
"[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi
"[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi
http://bc.klf.uw.edu.pl/173/ 23/34
24. Scanned texts as corpora — a case study
Regular expressions
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 24/34
25. Scanned texts as corpora — a case study
Regular expressions
"[[=s=]f]k[[=a=]]rg[[=a=]]" within body
The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 25/34
26. Scanned texts as corpora — a case study
Regular expressions
OCR in The dictionary of the 16th century Polish
http://bc.klf.uw.edu.pl/173/ 26/34
27. Scanned texts as corpora — a case study
Regular expressions
"[[=s=]]k[[=a=]]rg[[=a=]]" meta vol=xxxi
The dictionary of the 16th century Polish, only digitally born volumes
http://bc.klf.uw.edu.pl/173/ 27/34
28. Scanned texts as corpora — a case study
Regular expressions
"[[=s=]]k[[=a=]]rg[[=a=]]" within body meta vol=xxxi
The dictionary of the 16th century Polish, only entries in digitally born volumes
Only entries in digitally born volumes
http://bc.klf.uw.edu.pl/173/ 28/34
29. Scanned texts as corpora — a case study
Regular expressions
Character references (in Poliqarp [for DjVu])
Unicode standard
www.unicode.org
Version 6.0.0 of 11th October 2010
Escape sequences
x5c
REVERSE SOLIDUS,
u1E83
LATIN SMALL LETTER W WITH ACUTE,
U00010300
OLD ITALIC LETTER A
(Supplementary Multilingual Plane).
http://bc.klf.uw.edu.pl/173/ 29/34
30. Scanned texts as corpora — a case study
Regular expressions
Character class references (locale dependent)
Character classes
[:alnum:],
[:alpha:],
[:blank:],
[:cntrl:],
[:digit:],
. . .
http://bc.klf.uw.edu.pl/173/ 30/34
31. Scanned texts as corpora — a case study
Regular expressions
Bracket expressions
Character classes can be used only in bracket expressions!
An example (Linde’s dictionary)
Syr "." "[^[:digit:]].*"
http://bc.klf.uw.edu.pl/173/ 31/34
32. Scanned texts as corpora — a case study
Regular expressions
Syr "."
Linde’s dictionary
http://bc.klf.uw.edu.pl/173/ 32/34
33. Scanned texts as corpora — a case study
Regular expressions
Syr "." "[^[:digit:]].*"
Linde’s dictionary
http://bc.klf.uw.edu.pl/173/ 33/34
34. Scanned texts as corpora — a case study
Regular expressions
Final remark
Thank you for your attention!
The present slides are available at
http://bc.klf.uw.edu.pl/173/
Contact: jsbien@uw.edu.pl
http://bc.klf.uw.edu.pl/173/ 34/34