SlideShare a Scribd company logo
1 of 8
Download to read offline
Библиотека града Београда, 10-11. децембар 2015. 1
Услуга по мери корисника 21. века
In-house development of digital tools for handling of METS/ALTO
files at University library ''Svetozar Markovic''
Nikola Smolenski
University Library ''Svetozar
Markovic'', Belgrade
Milena Kostic
University Library ''Svetozar
Markovic'', Belgrade
Adam Sofronijevic
University Library ''Svetozar
Markovic'', Belgrade
Summary
As a project partner in CIP ICT-PSP Europeana Newspapers project University Library ''Svetozar Markovic'' acquired
METS/ALTO files of 400.000 pages of historical Serbian newspapers. This was the first time ever that librarians in
Serbia encountered this kind of files and there were no digital tools or expertise available to handle them either at
the University Library or at any other institution in Serbia. Acquisition of such tools on the international market was
far beyond budgeting capabilities of the University Library and in-house development was the only solution available.
This paper presents various digital tools developed, tested and implemented at the University Library that nowadays
enable implementation of all elements in the work flow of digitization process of newspapers based on METS/ALTO
files. Search interface and the search engine behind it are depicted in some detail along the presentation interface
that allows users to read, copy and paste full text articles from newspapers that are at the same time presented as
images of newspaper pages. A correction tool that allows for upgrading of OCRd text precision quality by manual
correction of errors in the text is presented with some statistics based on one year experience of work with this tool.
Finally, a solution for automatic detection of pages that are being scanned or printed upside down that is based on
analysis of OCRd text is presented with the analysis of implementation of this solution across the whole collection of
Serbian historic newspapers. The paper also presents experiences regarding user demands, complaints and com-
ments after using the collection by means of in-house developed interface and materials enhanced with in-house
built tools. Some general conclusions and recommendations for future work in area of digitization of newspaper
collections based on these experiences are also presented.
Key words
METS, ALTO, OCR, BnLViewer, Lucene/Solr
1. Introduction
The presentation of digital materials is a challenging task for libraries and other heritage
institutions. There are different approaches to evaluation of digital libraries’ interfaces accompa-
nied with a lot of literature presenting various stances and experiences. The University Library
“Svetozar Markovic” has more than eight years of experience in presenting digitized materials and
the most valuable take away is the notion that users praise the use of full text documents much
more than the use of scanned images. Therefore an important milestone in development of the
Библиотека града Београда, 10-11. децембар 2015. 2
Услуга по мери корисника 21. века
digital library of the University Library in Belgrade was 2013 when first METS/ALTO files were ac-
quired through Europeana Newspapers CIP ICT-PSP project. The full potential of these files may
be unleashed only through proper user interface which was not available at the University Library
at the time. Through exciting and sometimes difficult process of in-house development such an
interface was created and implemented in 2015. In order to present the features of this interface
we will give a short overview of the key terms that will appear in the paper and are crucial for the
understanding of the tools and processes described.
Optical character recognition (OCR) is a complex technology, i.e. the mechanical or elec-
tronic conversion of scanned images of handwritten, typewritten or printed text into machine-
encoded text. OCR allows one to process scanned books, screenshots and photos with text and get
editable documents like TXT, DOC, or PDF files which can be electronically searched, stored more
compactly, displayed on-line, and used in machine processes. The most advanced OCR systems
can handle almost any types of images, even such complex ones as scanned magazine pages with
images and columns, or photos from a mobile phone. There are several steps in the OCR process
and every step is a set of related algorithms that do a piece of the OCR job. Every step is equally
important and must handle the given image correctly. Otherwise, the whole process fails.12
METS (Metadata Encoding and Transmission Standard) established in 2001 is an XML-
based open standard. Its schema is hosted3 at the Library of Congress (LOC) and it is maintained
by METS Editorial Board. METS files are used to describe digital objects which should be preserved
for a long time. They can embed different kinds of metadata, both descriptive and administrative.
METS files may link to any digital objects.
ALTO (Analyzed Layout and Text Object) is also an XML-based open standard whose
schema is also hosted4 at Library of Congress (LOC) and maintained by ALTO Board. An ALTO file
usually contains the content of a single page, and it describes the layout of a printed page to re-
build the original page (styles, layout and block type information). It may contain tags which con-
tain more information about content (e.g. named entities).
The benefit of the METS/ALTO files is that they are widely used in libraries and by content
providers, i.e. represent a standard for digitization. In addition, they secure long-term sustainabil-
ity of digital objects which can be handled easily and exchanged between parties. They support
article and chapter segmentation. PDF, EPUB, DAISY and other formats can be created from
METS/ALTO files.5
1
Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipedia.org/w/in-
dex.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015).
2
Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/optical-character-
recognition-ocr-how-it-works/ (accessed 23 Nov 2015).
3
Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”,
http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015).
4
Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Library of Con-
gress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015).
5
CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conversion.com/wp-content/up-
loads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov 2015).
Библиотека града Београда, 10-11. децембар 2015. 3
Услуга по мери корисника 21. века
2. Document search and display
A diagram of our document search and display infrastructure is shown in Figure 1.
Figure 1: diagram of the search and display servers.
The infrastructure consists of a Windows server that is the repository of METS/ALTO files and hosts
their viewer BnLViewer, a web server that hosts the search interface, and a backend Lucene/Solr server
used to search the documents.
2.1. BnLViewer
The BnlViewer is a rich, interactive viewer for METS/ALTO files developed and maintained
by the National Library of Luxembourg and can be seen at http://source-
forge.net/p/bnlviewer/home/screenshots/.6 Here we provide a short description of what it looks
like. The thumbnails of all the pages are displayed alongside the table of contents. The full page is
displayed and articles can be selected with a cursor. In the article view, one article is cut out so
that it can be presented as a long vertical strip even if it spans several columns (or even pages) in
the original layout. The text of the article is presented as well, as it was recognized by the OCR
6
National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015).
Библиотека града Београда, 10-11. децембар 2015. 4
Услуга по мери корисника 21. века
engine (which explains the errors). When a search is performed, the terms that are found are
highlighted on the image and in the OCR text.7
It is very well received and widely used by researchers who would, in addition to its current
functionalities, like to have access to full text data as linked data set, to be able to print individual
articles and to have a simple “microfilm-like” viewer for efficient viewing of an entire collection.8
2.2. Search interface
The search interface offered to users at http://www.unilib.rs/istorijske-novine/search has
three sections: “Search”, “Advanced Search” and “Browse”. The backend search engine of the in-
terface is Lucene/Solr9 developed by the Apache Foundation.
“Search” presents the users with a text field where they may enter the desired keywords.
The interface converts user query into an appropriate Solr query10 that is sent to Solr server, and
displays the results returned from Solr as a preview, offering the hyperlinks to full documents on
BnlViewer. For example, if a user searches for “beograd”, the corresponding Solr query is “+text:(
beograd )”. Advanced query options that are supported11 are searching for a phrase, searching
with OR, and searching for pages without a phrase; the more advanced Lucene capabilities,12 such
as searching by wildcard or Levenshtein distance,13 are not offered to end users, but they are avail-
able to the library staff.
“Advanced Search”14 extends the search with the possibilities to search within a single pe-
riodical, within a range of dates, and to sort the results in various ways. It can construct more
complex Solr queries, for example a search for “beograd ILI zemun” in the periodical “Beogradske
opštinske novine” in the year 1900 is expanded to “+text:( beograd OR zemun ) +collec-
tion_id:00004 +date:[1900-01-01T12:00:00Z TO 1900-12-31T12:00:00Z]”.
“Browse”15 enables users to select a desired periodical and its publication date, which is
then displayed in BnLViewer. Information about existing periodicals and dates is kept in a rela-
tional database, separately from the search index.
7
“BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/ (accessed 23
Nov 2015).
8
Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013.
9
The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov 2015).
10
“SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax (accessed 23 Nov 2015).
11
University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske-novine/about#search-
guide (accessed 23 Nov 2015).
12
The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu-
cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015).
13
Levenshtein distance is the minimal number of operations necessary to transform one string of characters into
another, which can be used to search for words similar to each other, since they will have small Levenshtein dis-
tance;
Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?title=Le-
venshtein_distance&oldid=688663203 (accessed 23 Nov 2015).
14
University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is-
torijske-novine/advanced-search (accessed 23 Nov 2015).
15
University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is-
torijske-novine/browse (accessed 23 Nov 2015).
Библиотека града Београда, 10-11. децембар 2015. 5
Услуга по мери корисника 21. века
The interface is available in Serbian, English and German, including localization of the
query syntax.
2.3. Search engine improvements
Lucene/Solr is highly satisfactory for our uses as regards its storage capacity16 and speed17
of search. However its support for the Serbian language had to be improved, enabling the search
of Cyrillic and Latin texts by using Cyrillic and Latin alphabets. Furthermore, as a number of Inter-
net users in Serbia use “bald” Latin, we wanted to support that as well. This was done by creating
a Lucene filter18 that converts Cyrillic and Latin text to “bald” Latin (a filter does not affect display
of the searched text). The use of the filter is explained in detail at https://wiki.apache.org/solr/Ser-
bianLanguageSupport.
Currently, a stemmer19 for the Serbian language is being developed, which will further im-
prove Lucene’s search capabilities. An experimental stemmer that stems only the most common
Serbian names is observed to return 100% more search results in searches for supported names,
with no reduction in relevance.
In our Lucene schema,20 every page is indexed as a separate document.21 Together with
the page text, we store the page number, other relevant information about the document the
page is from (such as its document identifier or document date), and about the corresponding
ALTO file.
Importing documents in Lucene is done by a command-line script that scans all the ALTO
files in our repository and submits their content to Solr. The script uses information about the ALTO
file from the Lucene database to verify whether the ALTO files were modified in the repository.
Although usually various file types are imported in Lucene through a Solr data import han-
dler,22 instead of creating a METS/ALTO handler, we have found it simpler to convert ALTO files to
raw text and externally submit them as such. During the conversion, the script also joins the hy-
phenated words at lines’ ends.
16
The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”,
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita-
tions (accessed 23 Nov 2015).
17
We observed that a search query usually lasts around 200 milliseconds.
18
“Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters (ac-
cessed 23 Nov 2015).
19
A stemmer is a computer program that reduces words to their stems, thus enabling finding various grammatical
forms of a word by searching for either one of them:
Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem-
ming&oldid=685786913 (accessed 23 Nov 2015).
20
“SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015).
21
A drawback of this method is that it is not possible to find phrases that span multiple pages; but these are very
rare, and it is usually not possible to find them anyway since they are interrupted by page headers and footers.
22
“Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov 2015).
Библиотека града Београда, 10-11. децембар 2015. 6
Услуга по мери корисника 21. века
3. Document correction
3.1. Manual correction tool
In order to increase the accuracy of OCRd text, we have made a web tool that enables
manual correction of ALTO files called AltoEdit. The tool has been used to correct years 1914 and
1915 of the journal “Ratni dnevnik”23 (war diary of the Serbian army during WWI) — in total, 984
pages of text. The corrected files are observed to return around 10% more search results, with no
reduction in relevance.
The tool works in the following way: a command-line script modifies ALTO files so that
every relevant element without an ID gets one. Another script then imports the content of ALTO
files into a relational database, keeping track of the IDs in ALTO files. For correctors, a web inter-
face that can display ALTO files’ content line-by-line with the corresponding cutout of the page
image is provided. The correctors can then edit the words which are saved in the database; the
database preserves history of all edits. Finally, the third command-line script exports corrected
words from the database back to ALTO files according to their IDs.
Unfortunately, while using the tool we have discovered that it has certain drawbacks. The
database has a relatively complex structure, mimicking all the complexity of ALTO files, but it does
not need any of the benefits provided by relational databases. The correctors have also reported
that working line-by-line is tedious and that work on whole pages or columns of text would be
preferable.
It is thus our recommendation for future tools of this type, that ALTO files are kept in a
version control system24 and edited directly, and that correctors are provided with more comfort-
able interfaces.
3.2. Automatic detection of rotated pages
While surveying the corpus we have discovered that a relatively common error are pages
turned upside-down. Since this is very inconvenient for the reader, we wanted to see whether it is
possible to identify the pages and turn them the right way up.
We did this by statistically analyzing the text of the pages in order to find pages with very
poorly OCRd text, then by OCR-ing the pages in various orientations and statistically analyzing
OCR output in order to find the correct orientation.
The statistical analysis of the text was done by searching for every word in the text in a
spelling dictionary. While this method gave satisfactory results, we have discovered that it reveals
too many false positives, as it sees no difference between a word with a single unrecognized char-
acter and a completely jumbled word, which occur in a page that is rotated upside-down. We
23
University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is-
torijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015).
24
A version control system keeps various versions of a document as it was changed through time, so that they can
be compared, old versions restored in case of an error etc;
Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?title=Version_con-
trol&oldid=690270257 (accessed 23 Nov 2015).
Библиотека града Београда, 10-11. децембар 2015. 7
Услуга по мери корисника 21. века
believe that better results might be obtained by using other methods, probably a statistical n-
gram analysis.
An experimental tool that can rotate all the coordinates in an ALTO file so that the ele-
ments can be set into their correct positions has also been created, although the correct solution
to this problem would be to rotate the image and re-OCR the text.
4. Conclusion
Everyday communication with users revealed that after the first few months of promotion
and online availability, the in-house developed tools for presentation of full text materials at the
University Library proved to be reliable and in accordance with user needs and expectations. More-
over, the number of users seems to be increasing with positive experience of existing users and via
informal communication through social media. The availability of full text files has been identified
as a key added value in the framework of effectively functional search interface and METS/ALTO
viewer.
We conclude that METS/ALTO is a highly versatile format which can easily be used in var-
ious ways and completely fulfills our needs for creation, presentation and preservation of digitized
documents. The fact that the format is XML-based facilitates creation of tools for its handling and
processing. Its versatility also facilitates inter-library cooperation in digitization and tool creation.
Резиме
Kao učesnik u projektu CIP ICT-PSP Europeana Newspapers, Univerzitetska biblioteka „Svetozar Marković” dobila je
METS/ALTO fajlove za 400.000 strana srpskih istorijskih novina. Ovom prilikom bibliotekari su se ujedno i prvi put
susreli sa ovom vrstom fajlova i stoga nije bilo ni digitalnih alata ni potrebne ekspertize da se njima manipuliše ni u
Univerzitetskoj biblioteci, a ni u bilo kojoj drugoj instituciji u Srbiji. Nabavka neophodnih alata na međunarodnom
tržištu bila je nemoguća zbog finansijskih okolnosti u Biblioteci, tako da je jedino moguće rešenje bio samostalni
razvoj pomenutih alata. U ovom radu predstavljeni su razni digitalni alati koji su razvijeni, testirani i implementirani
u Univerzitetskoj biblioteci i koji sada omogućavaju implementaciju svih elemenata u proces digitalizacije novina koji
se zasniva na METS/ALTO fajlovima. Predstavljeni su pojedini detalji interfejsa za pretragu i programa za pre-
traživanje kao i detalji interfejsa koji omogućava korisnicima da čitaju i kopiraju pune tekstove članaka iz novina,
koje su istovremeno predstavljene kao slike novinskih strana. U okviru opisa alata za ispravljanje/korekciju, koji
omogućava poboljšavanje kvaliteta preciznosti OCR-ovanog teksta ručnom ispravkom grešaka u tekstu, pred-
stavljeni su statistički podaci koji su zasnovani na jednogodišnjem korišćenju ovog alata. Na kraju, opisano je i jedno
od rešenja za automatsko otkrivanje stranica koje su skenirane ili odštampane naopačke, a koje se temelji na analizi
OCR-ovanog teksta. Ujedno, opisana je i analiza implementacije ovog rešenja u celoj kolekciji srpskih istorijskih
novina. Ovaj rad predstavlja iskustva koja se odnose na zahteve korisnika, primedbe i komentare nakon korišćenja
kolekcije uz pomoć samostalno razvijenog interfejsa i materijala koji su poboljšani samostalno razvijenim alatima.
Opšti zaključci i preporuke koji se tiču daljeg rada u oblasti digitalizacije novinskih kolekcija, a zasnovani su na
pomenutim iskustvima, dati su na kraju rada.
Кључне речи
METS, ALTO, OCR, BnLViewer, Lucene/Solr
Библиотека града Београда, 10-11. децембар 2015. 8
Услуга по мери корисника 21. века
References
1. “Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersToken-
Filters (accessed 23 Nov 2015).
2. “BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/
(accessed 23 Nov 2015).
3. CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conver-
sion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov
2015).
4. “Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov
2015).
5. Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/opti-
cal-character-recognition-ocr-how-it-works/ (accessed 23 Nov 2015).
6. Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Li-
brary of Congress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015).
7. Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”,
http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015).
8. Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013.
9. National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015).
10. “SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015).
11. “SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax
(accessed 23 Nov 2015).
12. The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu-
cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015).
13. The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov
2015).
14. The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”, http://lu-
cene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita-
tions (accessed 23 Nov 2015).
16. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”,
http://www.unilib.rs/istorijske-novine/advanced-search (accessed 23 Nov 2015).
17. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”,
http://www.unilib.rs/istorijske-novine/browse (accessed 23 Nov 2015).
18. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”,
http://www.unilib.rs/istorijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015).
19. University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske-
novine/about#search-guide (accessed 23 Nov 2015).
20. Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?ti-
tle=Levenshtein_distance&oldid=688663203 (accessed 23 Nov 2015).
21. Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipe-
dia.org/w/index.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015).
22. Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem-
ming&oldid=685786913 (accessed 23 Nov 2015).
23. Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?ti-
tle=Version_control&oldid=690270257 (accessed 23 Nov 2015).

More Related Content

Similar to 12_N.Smolenski, M.Kostic, A.Sofronijevic

An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
An overview of The European Library. Olaf Janssen presenting during DRH 2005,...An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
Olaf Janssen
 
Module 6_Research Publication_Ethics.docx
Module 6_Research Publication_Ethics.docxModule 6_Research Publication_Ethics.docx
Module 6_Research Publication_Ethics.docx
jisskuruvilla
 
EuropeanaConnect - Enhancing User Access to European Digital Heritage
EuropeanaConnect - Enhancing User Access to European Digital HeritageEuropeanaConnect - Enhancing User Access to European Digital Heritage
EuropeanaConnect - Enhancing User Access to European Digital Heritage
Max Kaiser
 
Digital libraries, K. Stefanov
Digital libraries, K. StefanovDigital libraries, K. Stefanov
Digital libraries, K. Stefanov
Share.TEC
 
Modern repositories for storage of scientific information, K. Stefanov
Modern repositories for storage of scientific information, K. StefanovModern repositories for storage of scientific information, K. Stefanov
Modern repositories for storage of scientific information, K. Stefanov
Share.TEC
 

Similar to 12_N.Smolenski, M.Kostic, A.Sofronijevic (20)

The JISC IE: shared, global or common services?
The JISC IE: shared, global or common services?The JISC IE: shared, global or common services?
The JISC IE: shared, global or common services?
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
The JISC Information Environment and collection description
The JISC Information Environment and collection descriptionThe JISC Information Environment and collection description
The JISC Information Environment and collection description
 
New ICT Trends and Issues of Librarianship
New ICT Trends and Issues of LibrarianshipNew ICT Trends and Issues of Librarianship
New ICT Trends and Issues of Librarianship
 
An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
An overview of The European Library. Olaf Janssen presenting during DRH 2005,...An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
An overview of The European Library. Olaf Janssen presenting during DRH 2005,...
 
Digital library and MLE integration - where are we now and where do we want t...
Digital library and MLE integration - where are we now and where do we want t...Digital library and MLE integration - where are we now and where do we want t...
Digital library and MLE integration - where are we now and where do we want t...
 
Rethinking_the_LSP_Jan2016a
Rethinking_the_LSP_Jan2016aRethinking_the_LSP_Jan2016a
Rethinking_the_LSP_Jan2016a
 
Metadata and Scotland’s information environment: potential benefits of Web 2.0
Metadata and Scotland’s information environment: potential benefits of Web 2.0Metadata and Scotland’s information environment: potential benefits of Web 2.0
Metadata and Scotland’s information environment: potential benefits of Web 2.0
 
Europeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom ViewsEuropeana Creative. EDM Endpoint. Custom Views
Europeana Creative. EDM Endpoint. Custom Views
 
Module 6_Research Publication_Ethics.docx
Module 6_Research Publication_Ethics.docxModule 6_Research Publication_Ethics.docx
Module 6_Research Publication_Ethics.docx
 
EuropeanaConnect - Enhancing User Access to European Digital Heritage
EuropeanaConnect - Enhancing User Access to European Digital HeritageEuropeanaConnect - Enhancing User Access to European Digital Heritage
EuropeanaConnect - Enhancing User Access to European Digital Heritage
 
Digital libraries, K. Stefanov
Digital libraries, K. StefanovDigital libraries, K. Stefanov
Digital libraries, K. Stefanov
 
Modern repositories for storage of scientific information, K. Stefanov
Modern repositories for storage of scientific information, K. StefanovModern repositories for storage of scientific information, K. Stefanov
Modern repositories for storage of scientific information, K. Stefanov
 
A Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And RepositoryA Service-Oriented National E-Theses Information System And Repository
A Service-Oriented National E-Theses Information System And Repository
 
TEI Conference - CVCE
TEI Conference - CVCETEI Conference - CVCE
TEI Conference - CVCE
 
"Open Source for Public Libraries Case Study IBLA Soft Library Automation Sof...
"Open Source for Public Libraries Case Study IBLA Soft Library Automation Sof..."Open Source for Public Libraries Case Study IBLA Soft Library Automation Sof...
"Open Source for Public Libraries Case Study IBLA Soft Library Automation Sof...
 
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, SwedenSem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
 
The Ground Truth: Arabic Scientific Manuscripts Workshop
The Ground Truth: Arabic Scientific Manuscripts WorkshopThe Ground Truth: Arabic Scientific Manuscripts Workshop
The Ground Truth: Arabic Scientific Manuscripts Workshop
 
Ehis Eds Europe May2009
Ehis Eds Europe May2009Ehis Eds Europe May2009
Ehis Eds Europe May2009
 
Usability & User-Centred Design
Usability & User-Centred DesignUsability & User-Centred Design
Usability & User-Centred Design
 

12_N.Smolenski, M.Kostic, A.Sofronijevic

  • 1. Библиотека града Београда, 10-11. децембар 2015. 1 Услуга по мери корисника 21. века In-house development of digital tools for handling of METS/ALTO files at University library ''Svetozar Markovic'' Nikola Smolenski University Library ''Svetozar Markovic'', Belgrade Milena Kostic University Library ''Svetozar Markovic'', Belgrade Adam Sofronijevic University Library ''Svetozar Markovic'', Belgrade Summary As a project partner in CIP ICT-PSP Europeana Newspapers project University Library ''Svetozar Markovic'' acquired METS/ALTO files of 400.000 pages of historical Serbian newspapers. This was the first time ever that librarians in Serbia encountered this kind of files and there were no digital tools or expertise available to handle them either at the University Library or at any other institution in Serbia. Acquisition of such tools on the international market was far beyond budgeting capabilities of the University Library and in-house development was the only solution available. This paper presents various digital tools developed, tested and implemented at the University Library that nowadays enable implementation of all elements in the work flow of digitization process of newspapers based on METS/ALTO files. Search interface and the search engine behind it are depicted in some detail along the presentation interface that allows users to read, copy and paste full text articles from newspapers that are at the same time presented as images of newspaper pages. A correction tool that allows for upgrading of OCRd text precision quality by manual correction of errors in the text is presented with some statistics based on one year experience of work with this tool. Finally, a solution for automatic detection of pages that are being scanned or printed upside down that is based on analysis of OCRd text is presented with the analysis of implementation of this solution across the whole collection of Serbian historic newspapers. The paper also presents experiences regarding user demands, complaints and com- ments after using the collection by means of in-house developed interface and materials enhanced with in-house built tools. Some general conclusions and recommendations for future work in area of digitization of newspaper collections based on these experiences are also presented. Key words METS, ALTO, OCR, BnLViewer, Lucene/Solr 1. Introduction The presentation of digital materials is a challenging task for libraries and other heritage institutions. There are different approaches to evaluation of digital libraries’ interfaces accompa- nied with a lot of literature presenting various stances and experiences. The University Library “Svetozar Markovic” has more than eight years of experience in presenting digitized materials and the most valuable take away is the notion that users praise the use of full text documents much more than the use of scanned images. Therefore an important milestone in development of the
  • 2. Библиотека града Београда, 10-11. децембар 2015. 2 Услуга по мери корисника 21. века digital library of the University Library in Belgrade was 2013 when first METS/ALTO files were ac- quired through Europeana Newspapers CIP ICT-PSP project. The full potential of these files may be unleashed only through proper user interface which was not available at the University Library at the time. Through exciting and sometimes difficult process of in-house development such an interface was created and implemented in 2015. In order to present the features of this interface we will give a short overview of the key terms that will appear in the paper and are crucial for the understanding of the tools and processes described. Optical character recognition (OCR) is a complex technology, i.e. the mechanical or elec- tronic conversion of scanned images of handwritten, typewritten or printed text into machine- encoded text. OCR allows one to process scanned books, screenshots and photos with text and get editable documents like TXT, DOC, or PDF files which can be electronically searched, stored more compactly, displayed on-line, and used in machine processes. The most advanced OCR systems can handle almost any types of images, even such complex ones as scanned magazine pages with images and columns, or photos from a mobile phone. There are several steps in the OCR process and every step is a set of related algorithms that do a piece of the OCR job. Every step is equally important and must handle the given image correctly. Otherwise, the whole process fails.12 METS (Metadata Encoding and Transmission Standard) established in 2001 is an XML- based open standard. Its schema is hosted3 at the Library of Congress (LOC) and it is maintained by METS Editorial Board. METS files are used to describe digital objects which should be preserved for a long time. They can embed different kinds of metadata, both descriptive and administrative. METS files may link to any digital objects. ALTO (Analyzed Layout and Text Object) is also an XML-based open standard whose schema is also hosted4 at Library of Congress (LOC) and maintained by ALTO Board. An ALTO file usually contains the content of a single page, and it describes the layout of a printed page to re- build the original page (styles, layout and block type information). It may contain tags which con- tain more information about content (e.g. named entities). The benefit of the METS/ALTO files is that they are widely used in libraries and by content providers, i.e. represent a standard for digitization. In addition, they secure long-term sustainabil- ity of digital objects which can be handled easily and exchanged between parties. They support article and chapter segmentation. PDF, EPUB, DAISY and other formats can be created from METS/ALTO files.5 1 Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipedia.org/w/in- dex.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015). 2 Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/optical-character- recognition-ocr-how-it-works/ (accessed 23 Nov 2015). 3 Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”, http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015). 4 Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Library of Con- gress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015). 5 CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conversion.com/wp-content/up- loads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov 2015).
  • 3. Библиотека града Београда, 10-11. децембар 2015. 3 Услуга по мери корисника 21. века 2. Document search and display A diagram of our document search and display infrastructure is shown in Figure 1. Figure 1: diagram of the search and display servers. The infrastructure consists of a Windows server that is the repository of METS/ALTO files and hosts their viewer BnLViewer, a web server that hosts the search interface, and a backend Lucene/Solr server used to search the documents. 2.1. BnLViewer The BnlViewer is a rich, interactive viewer for METS/ALTO files developed and maintained by the National Library of Luxembourg and can be seen at http://source- forge.net/p/bnlviewer/home/screenshots/.6 Here we provide a short description of what it looks like. The thumbnails of all the pages are displayed alongside the table of contents. The full page is displayed and articles can be selected with a cursor. In the article view, one article is cut out so that it can be presented as a long vertical strip even if it spans several columns (or even pages) in the original layout. The text of the article is presented as well, as it was recognized by the OCR 6 National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015).
  • 4. Библиотека града Београда, 10-11. децембар 2015. 4 Услуга по мери корисника 21. века engine (which explains the errors). When a search is performed, the terms that are found are highlighted on the image and in the OCR text.7 It is very well received and widely used by researchers who would, in addition to its current functionalities, like to have access to full text data as linked data set, to be able to print individual articles and to have a simple “microfilm-like” viewer for efficient viewing of an entire collection.8 2.2. Search interface The search interface offered to users at http://www.unilib.rs/istorijske-novine/search has three sections: “Search”, “Advanced Search” and “Browse”. The backend search engine of the in- terface is Lucene/Solr9 developed by the Apache Foundation. “Search” presents the users with a text field where they may enter the desired keywords. The interface converts user query into an appropriate Solr query10 that is sent to Solr server, and displays the results returned from Solr as a preview, offering the hyperlinks to full documents on BnlViewer. For example, if a user searches for “beograd”, the corresponding Solr query is “+text:( beograd )”. Advanced query options that are supported11 are searching for a phrase, searching with OR, and searching for pages without a phrase; the more advanced Lucene capabilities,12 such as searching by wildcard or Levenshtein distance,13 are not offered to end users, but they are avail- able to the library staff. “Advanced Search”14 extends the search with the possibilities to search within a single pe- riodical, within a range of dates, and to sort the results in various ways. It can construct more complex Solr queries, for example a search for “beograd ILI zemun” in the periodical “Beogradske opštinske novine” in the year 1900 is expanded to “+text:( beograd OR zemun ) +collec- tion_id:00004 +date:[1900-01-01T12:00:00Z TO 1900-12-31T12:00:00Z]”. “Browse”15 enables users to select a desired periodical and its publication date, which is then displayed in BnLViewer. Information about existing periodicals and dates is kept in a rela- tional database, separately from the search index. 7 “BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/ (accessed 23 Nov 2015). 8 Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013. 9 The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov 2015). 10 “SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax (accessed 23 Nov 2015). 11 University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske-novine/about#search- guide (accessed 23 Nov 2015). 12 The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu- cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015). 13 Levenshtein distance is the minimal number of operations necessary to transform one string of characters into another, which can be used to search for words similar to each other, since they will have small Levenshtein dis- tance; Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?title=Le- venshtein_distance&oldid=688663203 (accessed 23 Nov 2015). 14 University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is- torijske-novine/advanced-search (accessed 23 Nov 2015). 15 University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is- torijske-novine/browse (accessed 23 Nov 2015).
  • 5. Библиотека града Београда, 10-11. децембар 2015. 5 Услуга по мери корисника 21. века The interface is available in Serbian, English and German, including localization of the query syntax. 2.3. Search engine improvements Lucene/Solr is highly satisfactory for our uses as regards its storage capacity16 and speed17 of search. However its support for the Serbian language had to be improved, enabling the search of Cyrillic and Latin texts by using Cyrillic and Latin alphabets. Furthermore, as a number of Inter- net users in Serbia use “bald” Latin, we wanted to support that as well. This was done by creating a Lucene filter18 that converts Cyrillic and Latin text to “bald” Latin (a filter does not affect display of the searched text). The use of the filter is explained in detail at https://wiki.apache.org/solr/Ser- bianLanguageSupport. Currently, a stemmer19 for the Serbian language is being developed, which will further im- prove Lucene’s search capabilities. An experimental stemmer that stems only the most common Serbian names is observed to return 100% more search results in searches for supported names, with no reduction in relevance. In our Lucene schema,20 every page is indexed as a separate document.21 Together with the page text, we store the page number, other relevant information about the document the page is from (such as its document identifier or document date), and about the corresponding ALTO file. Importing documents in Lucene is done by a command-line script that scans all the ALTO files in our repository and submits their content to Solr. The script uses information about the ALTO file from the Lucene database to verify whether the ALTO files were modified in the repository. Although usually various file types are imported in Lucene through a Solr data import han- dler,22 instead of creating a METS/ALTO handler, we have found it simpler to convert ALTO files to raw text and externally submit them as such. During the conversion, the script also joins the hy- phenated words at lines’ ends. 16 The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”, http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita- tions (accessed 23 Nov 2015). 17 We observed that a search query usually lasts around 200 milliseconds. 18 “Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters (ac- cessed 23 Nov 2015). 19 A stemmer is a computer program that reduces words to their stems, thus enabling finding various grammatical forms of a word by searching for either one of them: Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem- ming&oldid=685786913 (accessed 23 Nov 2015). 20 “SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015). 21 A drawback of this method is that it is not possible to find phrases that span multiple pages; but these are very rare, and it is usually not possible to find them anyway since they are interrupted by page headers and footers. 22 “Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov 2015).
  • 6. Библиотека града Београда, 10-11. децембар 2015. 6 Услуга по мери корисника 21. века 3. Document correction 3.1. Manual correction tool In order to increase the accuracy of OCRd text, we have made a web tool that enables manual correction of ALTO files called AltoEdit. The tool has been used to correct years 1914 and 1915 of the journal “Ratni dnevnik”23 (war diary of the Serbian army during WWI) — in total, 984 pages of text. The corrected files are observed to return around 10% more search results, with no reduction in relevance. The tool works in the following way: a command-line script modifies ALTO files so that every relevant element without an ID gets one. Another script then imports the content of ALTO files into a relational database, keeping track of the IDs in ALTO files. For correctors, a web inter- face that can display ALTO files’ content line-by-line with the corresponding cutout of the page image is provided. The correctors can then edit the words which are saved in the database; the database preserves history of all edits. Finally, the third command-line script exports corrected words from the database back to ALTO files according to their IDs. Unfortunately, while using the tool we have discovered that it has certain drawbacks. The database has a relatively complex structure, mimicking all the complexity of ALTO files, but it does not need any of the benefits provided by relational databases. The correctors have also reported that working line-by-line is tedious and that work on whole pages or columns of text would be preferable. It is thus our recommendation for future tools of this type, that ALTO files are kept in a version control system24 and edited directly, and that correctors are provided with more comfort- able interfaces. 3.2. Automatic detection of rotated pages While surveying the corpus we have discovered that a relatively common error are pages turned upside-down. Since this is very inconvenient for the reader, we wanted to see whether it is possible to identify the pages and turn them the right way up. We did this by statistically analyzing the text of the pages in order to find pages with very poorly OCRd text, then by OCR-ing the pages in various orientations and statistically analyzing OCR output in order to find the correct orientation. The statistical analysis of the text was done by searching for every word in the text in a spelling dictionary. While this method gave satisfactory results, we have discovered that it reveals too many false positives, as it sees no difference between a word with a single unrecognized char- acter and a completely jumbled word, which occur in a page that is rotated upside-down. We 23 University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is- torijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015). 24 A version control system keeps various versions of a document as it was changed through time, so that they can be compared, old versions restored in case of an error etc; Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?title=Version_con- trol&oldid=690270257 (accessed 23 Nov 2015).
  • 7. Библиотека града Београда, 10-11. децембар 2015. 7 Услуга по мери корисника 21. века believe that better results might be obtained by using other methods, probably a statistical n- gram analysis. An experimental tool that can rotate all the coordinates in an ALTO file so that the ele- ments can be set into their correct positions has also been created, although the correct solution to this problem would be to rotate the image and re-OCR the text. 4. Conclusion Everyday communication with users revealed that after the first few months of promotion and online availability, the in-house developed tools for presentation of full text materials at the University Library proved to be reliable and in accordance with user needs and expectations. More- over, the number of users seems to be increasing with positive experience of existing users and via informal communication through social media. The availability of full text files has been identified as a key added value in the framework of effectively functional search interface and METS/ALTO viewer. We conclude that METS/ALTO is a highly versatile format which can easily be used in var- ious ways and completely fulfills our needs for creation, presentation and preservation of digitized documents. The fact that the format is XML-based facilitates creation of tools for its handling and processing. Its versatility also facilitates inter-library cooperation in digitization and tool creation. Резиме Kao učesnik u projektu CIP ICT-PSP Europeana Newspapers, Univerzitetska biblioteka „Svetozar Marković” dobila je METS/ALTO fajlove za 400.000 strana srpskih istorijskih novina. Ovom prilikom bibliotekari su se ujedno i prvi put susreli sa ovom vrstom fajlova i stoga nije bilo ni digitalnih alata ni potrebne ekspertize da se njima manipuliše ni u Univerzitetskoj biblioteci, a ni u bilo kojoj drugoj instituciji u Srbiji. Nabavka neophodnih alata na međunarodnom tržištu bila je nemoguća zbog finansijskih okolnosti u Biblioteci, tako da je jedino moguće rešenje bio samostalni razvoj pomenutih alata. U ovom radu predstavljeni su razni digitalni alati koji su razvijeni, testirani i implementirani u Univerzitetskoj biblioteci i koji sada omogućavaju implementaciju svih elemenata u proces digitalizacije novina koji se zasniva na METS/ALTO fajlovima. Predstavljeni su pojedini detalji interfejsa za pretragu i programa za pre- traživanje kao i detalji interfejsa koji omogućava korisnicima da čitaju i kopiraju pune tekstove članaka iz novina, koje su istovremeno predstavljene kao slike novinskih strana. U okviru opisa alata za ispravljanje/korekciju, koji omogućava poboljšavanje kvaliteta preciznosti OCR-ovanog teksta ručnom ispravkom grešaka u tekstu, pred- stavljeni su statistički podaci koji su zasnovani na jednogodišnjem korišćenju ovog alata. Na kraju, opisano je i jedno od rešenja za automatsko otkrivanje stranica koje su skenirane ili odštampane naopačke, a koje se temelji na analizi OCR-ovanog teksta. Ujedno, opisana je i analiza implementacije ovog rešenja u celoj kolekciji srpskih istorijskih novina. Ovaj rad predstavlja iskustva koja se odnose na zahteve korisnika, primedbe i komentare nakon korišćenja kolekcije uz pomoć samostalno razvijenog interfejsa i materijala koji su poboljšani samostalno razvijenim alatima. Opšti zaključci i preporuke koji se tiču daljeg rada u oblasti digitalizacije novinskih kolekcija, a zasnovani su na pomenutim iskustvima, dati su na kraju rada. Кључне речи METS, ALTO, OCR, BnLViewer, Lucene/Solr
  • 8. Библиотека града Београда, 10-11. децембар 2015. 8 Услуга по мери корисника 21. века References 1. “Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersToken- Filters (accessed 23 Nov 2015). 2. “BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/ (accessed 23 Nov 2015). 3. CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conver- sion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov 2015). 4. “Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov 2015). 5. Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/opti- cal-character-recognition-ocr-how-it-works/ (accessed 23 Nov 2015). 6. Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Li- brary of Congress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015). 7. Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”, http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015). 8. Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013. 9. National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015). 10. “SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015). 11. “SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax (accessed 23 Nov 2015). 12. The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu- cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015). 13. The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov 2015). 14. The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”, http://lu- cene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita- tions (accessed 23 Nov 2015). 16. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/istorijske-novine/advanced-search (accessed 23 Nov 2015). 17. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/istorijske-novine/browse (accessed 23 Nov 2015). 18. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/istorijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015). 19. University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske- novine/about#search-guide (accessed 23 Nov 2015). 20. Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?ti- tle=Levenshtein_distance&oldid=688663203 (accessed 23 Nov 2015). 21. Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipe- dia.org/w/index.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015). 22. Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem- ming&oldid=685786913 (accessed 23 Nov 2015). 23. Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?ti- tle=Version_control&oldid=690270257 (accessed 23 Nov 2015).