12_N.Smolenski, M.Kostic, A.Sofronijevic

Библиотека града Београда, 10-11. децембар 2015. 1
Услуга по мери корисника 21. века
In-house development of digital tools for handling of METS/ALTO
files at University library ''Svetozar Markovic''
Nikola Smolenski
University Library ''Svetozar
Markovic'', Belgrade
Milena Kostic
Adam Sofronijevic
Summary
As a project partner in CIP ICT-PSP Europeana Newspapers project University Library ''Svetozar Markovic'' acquired
METS/ALTO files of 400.000 pages of historical Serbian newspapers. This was the first time ever that librarians in
Serbia encountered this kind of files and there were no digital tools or expertise available to handle them either at
the University Library or at any other institution in Serbia. Acquisition of such tools on the international market was
far beyond budgeting capabilities of the University Library and in-house development was the only solution available.
This paper presents various digital tools developed, tested and implemented at the University Library that nowadays
enable implementation of all elements in the work flow of digitization process of newspapers based on METS/ALTO
files. Search interface and the search engine behind it are depicted in some detail along the presentation interface
that allows users to read, copy and paste full text articles from newspapers that are at the same time presented as
images of newspaper pages. A correction tool that allows for upgrading of OCRd text precision quality by manual
correction of errors in the text is presented with some statistics based on one year experience of work with this tool.
Finally, a solution for automatic detection of pages that are being scanned or printed upside down that is based on
analysis of OCRd text is presented with the analysis of implementation of this solution across the whole collection of
Serbian historic newspapers. The paper also presents experiences regarding user demands, complaints and com-
ments after using the collection by means of in-house developed interface and materials enhanced with in-house
built tools. Some general conclusions and recommendations for future work in area of digitization of newspaper
collections based on these experiences are also presented.
Key words
METS, ALTO, OCR, BnLViewer, Lucene/Solr
1. Introduction
The presentation of digital materials is a challenging task for libraries and other heritage
institutions. There are different approaches to evaluation of digital libraries’ interfaces accompa-
nied with a lot of literature presenting various stances and experiences. The University Library
“Svetozar Markovic” has more than eight years of experience in presenting digitized materials and
the most valuable take away is the notion that users praise the use of full text documents much
more than the use of scanned images. Therefore an important milestone in development of the

digital library of the University Library in Belgrade was 2013 when first METS/ALTO files were ac-
quired through Europeana Newspapers CIP ICT-PSP project. The full potential of these files may
be unleashed only through proper user interface which was not available at the University Library
at the time. Through exciting and sometimes difficult process of in-house development such an
interface was created and implemented in 2015. In order to present the features of this interface
we will give a short overview of the key terms that will appear in the paper and are crucial for the
understanding of the tools and processes described.
Optical character recognition (OCR) is a complex technology, i.e. the mechanical or elec-
tronic conversion of scanned images of handwritten, typewritten or printed text into machine-
encoded text. OCR allows one to process scanned books, screenshots and photos with text and get
editable documents like TXT, DOC, or PDF files which can be electronically searched, stored more
compactly, displayed on-line, and used in machine processes. The most advanced OCR systems
can handle almost any types of images, even such complex ones as scanned magazine pages with
images and columns, or photos from a mobile phone. There are several steps in the OCR process
and every step is a set of related algorithms that do a piece of the OCR job. Every step is equally
important and must handle the given image correctly. Otherwise, the whole process fails.12
METS (Metadata Encoding and Transmission Standard) established in 2001 is an XML-
based open standard. Its schema is hosted3 at the Library of Congress (LOC) and it is maintained
by METS Editorial Board. METS files are used to describe digital objects which should be preserved
for a long time. They can embed different kinds of metadata, both descriptive and administrative.
METS files may link to any digital objects.
ALTO (Analyzed Layout and Text Object) is also an XML-based open standard whose
schema is also hosted4 at Library of Congress (LOC) and maintained by ALTO Board. An ALTO file
usually contains the content of a single page, and it describes the layout of a printed page to re-
build the original page (styles, layout and block type information). It may contain tags which con-
tain more information about content (e.g. named entities).
The benefit of the METS/ALTO files is that they are widely used in libraries and by content
providers, i.e. represent a standard for digitization. In addition, they secure long-term sustainabil-
ity of digital objects which can be handled easily and exchanged between parties. They support
article and chapter segmentation. PDF, EPUB, DAISY and other formats can be created from
METS/ALTO files.5
1
Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipedia.org/w/in-
dex.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015).
2
Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/optical-character-
recognition-ocr-how-it-works/ (accessed 23 Nov 2015).
3
Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”,
http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015).
4
Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Library of Con-
gress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015).
5
CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conversion.com/wp-content/up-
loads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov 2015).

2. Document search and display
A diagram of our document search and display infrastructure is shown in Figure 1.
Figure 1: diagram of the search and display servers.
The infrastructure consists of a Windows server that is the repository of METS/ALTO files and hosts
their viewer BnLViewer, a web server that hosts the search interface, and a backend Lucene/Solr server
used to search the documents.
2.1. BnLViewer
The BnlViewer is a rich, interactive viewer for METS/ALTO files developed and maintained
by the National Library of Luxembourg and can be seen at http://source-
forge.net/p/bnlviewer/home/screenshots/.6 Here we provide a short description of what it looks
like. The thumbnails of all the pages are displayed alongside the table of contents. The full page is
displayed and articles can be selected with a cursor. In the article view, one article is cut out so
that it can be presented as a long vertical strip even if it spans several columns (or even pages) in
the original layout. The text of the article is presented as well, as it was recognized by the OCR
6
National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015).

engine (which explains the errors). When a search is performed, the terms that are found are
highlighted on the image and in the OCR text.7
It is very well received and widely used by researchers who would, in addition to its current
functionalities, like to have access to full text data as linked data set, to be able to print individual
articles and to have a simple “microfilm-like” viewer for efficient viewing of an entire collection.8
2.2. Search interface
The search interface offered to users at http://www.unilib.rs/istorijske-novine/search has
three sections: “Search”, “Advanced Search” and “Browse”. The backend search engine of the in-
terface is Lucene/Solr9 developed by the Apache Foundation.
“Search” presents the users with a text field where they may enter the desired keywords.
The interface converts user query into an appropriate Solr query10 that is sent to Solr server, and
displays the results returned from Solr as a preview, offering the hyperlinks to full documents on
BnlViewer. For example, if a user searches for “beograd”, the corresponding Solr query is “+text:(
beograd )”. Advanced query options that are supported11 are searching for a phrase, searching
with OR, and searching for pages without a phrase; the more advanced Lucene capabilities,12 such
as searching by wildcard or Levenshtein distance,13 are not offered to end users, but they are avail-
able to the library staff.
“Advanced Search”14 extends the search with the possibilities to search within a single pe-
riodical, within a range of dates, and to sort the results in various ways. It can construct more
complex Solr queries, for example a search for “beograd ILI zemun” in the periodical “Beogradske
opštinske novine” in the year 1900 is expanded to “+text:( beograd OR zemun ) +collec-
tion_id:00004 +date:[1900-01-01T12:00:00Z TO 1900-12-31T12:00:00Z]”.
“Browse”15 enables users to select a desired periodical and its publication date, which is
then displayed in BnLViewer. Information about existing periodicals and dates is kept in a rela-
tional database, separately from the search index.
7
“BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/ (accessed 23
Nov 2015).
8
Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013.
9
The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov 2015).
10
“SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax (accessed 23 Nov 2015).
11
University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske-novine/about#search-
guide (accessed 23 Nov 2015).
12
The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu-
cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015).
13
Levenshtein distance is the minimal number of operations necessary to transform one string of characters into
another, which can be used to search for words similar to each other, since they will have small Levenshtein dis-
tance;
Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?title=Le-
venshtein_distance&oldid=688663203 (accessed 23 Nov 2015).
14
University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”, http://www.unilib.rs/is-
torijske-novine/advanced-search (accessed 23 Nov 2015).
15
torijske-novine/browse (accessed 23 Nov 2015).

The interface is available in Serbian, English and German, including localization of the
query syntax.
2.3. Search engine improvements
Lucene/Solr is highly satisfactory for our uses as regards its storage capacity16 and speed17
of search. However its support for the Serbian language had to be improved, enabling the search
of Cyrillic and Latin texts by using Cyrillic and Latin alphabets. Furthermore, as a number of Inter-
net users in Serbia use “bald” Latin, we wanted to support that as well. This was done by creating
a Lucene filter18 that converts Cyrillic and Latin text to “bald” Latin (a filter does not affect display
of the searched text). The use of the filter is explained in detail at https://wiki.apache.org/solr/Ser-
bianLanguageSupport.
Currently, a stemmer19 for the Serbian language is being developed, which will further im-
prove Lucene’s search capabilities. An experimental stemmer that stems only the most common
Serbian names is observed to return 100% more search results in searches for supported names,
with no reduction in relevance.
In our Lucene schema,20 every page is indexed as a separate document.21 Together with
the page text, we store the page number, other relevant information about the document the
page is from (such as its document identifier or document date), and about the corresponding
ALTO file.
Importing documents in Lucene is done by a command-line script that scans all the ALTO
files in our repository and submits their content to Solr. The script uses information about the ALTO
file from the Lucene database to verify whether the ALTO files were modified in the repository.
Although usually various file types are imported in Lucene through a Solr data import han-
dler,22 instead of creating a METS/ALTO handler, we have found it simpler to convert ALTO files to
raw text and externally submit them as such. During the conversion, the script also joins the hy-
phenated words at lines’ ends.
16
The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”,
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita-
tions (accessed 23 Nov 2015).
17
We observed that a search query usually lasts around 200 milliseconds.
18
“Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters (ac-
cessed 23 Nov 2015).
19
A stemmer is a computer program that reduces words to their stems, thus enabling finding various grammatical
forms of a word by searching for either one of them:
Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem-
ming&oldid=685786913 (accessed 23 Nov 2015).
20
“SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015).
21
A drawback of this method is that it is not possible to find phrases that span multiple pages; but these are very
rare, and it is usually not possible to find them anyway since they are interrupted by page headers and footers.
22
“Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov 2015).

3. Document correction
3.1. Manual correction tool
In order to increase the accuracy of OCRd text, we have made a web tool that enables
manual correction of ALTO files called AltoEdit. The tool has been used to correct years 1914 and
1915 of the journal “Ratni dnevnik”23 (war diary of the Serbian army during WWI) — in total, 984
pages of text. The corrected files are observed to return around 10% more search results, with no
reduction in relevance.
The tool works in the following way: a command-line script modifies ALTO files so that
every relevant element without an ID gets one. Another script then imports the content of ALTO
files into a relational database, keeping track of the IDs in ALTO files. For correctors, a web inter-
face that can display ALTO files’ content line-by-line with the corresponding cutout of the page
image is provided. The correctors can then edit the words which are saved in the database; the
database preserves history of all edits. Finally, the third command-line script exports corrected
words from the database back to ALTO files according to their IDs.
Unfortunately, while using the tool we have discovered that it has certain drawbacks. The
database has a relatively complex structure, mimicking all the complexity of ALTO files, but it does
not need any of the benefits provided by relational databases. The correctors have also reported
that working line-by-line is tedious and that work on whole pages or columns of text would be
preferable.
It is thus our recommendation for future tools of this type, that ALTO files are kept in a
version control system24 and edited directly, and that correctors are provided with more comfort-
able interfaces.
3.2. Automatic detection of rotated pages
While surveying the corpus we have discovered that a relatively common error are pages
turned upside-down. Since this is very inconvenient for the reader, we wanted to see whether it is
possible to identify the pages and turn them the right way up.
We did this by statistically analyzing the text of the pages in order to find pages with very
poorly OCRd text, then by OCR-ing the pages in various orientations and statistically analyzing
OCR output in order to find the correct orientation.
The statistical analysis of the text was done by searching for every word in the text in a
spelling dictionary. While this method gave satisfactory results, we have discovered that it reveals
too many false positives, as it sees no difference between a word with a single unrecognized char-
acter and a completely jumbled word, which occur in a page that is rotated upside-down. We
23
torijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015).
24
A version control system keeps various versions of a document as it was changed through time, so that they can
be compared, old versions restored in case of an error etc;
Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?title=Version_con-
trol&oldid=690270257 (accessed 23 Nov 2015).

believe that better results might be obtained by using other methods, probably a statistical n-
gram analysis.
An experimental tool that can rotate all the coordinates in an ALTO file so that the ele-
ments can be set into their correct positions has also been created, although the correct solution
to this problem would be to rotate the image and re-OCR the text.
4. Conclusion
Everyday communication with users revealed that after the first few months of promotion
and online availability, the in-house developed tools for presentation of full text materials at the
University Library proved to be reliable and in accordance with user needs and expectations. More-
over, the number of users seems to be increasing with positive experience of existing users and via
informal communication through social media. The availability of full text files has been identified
as a key added value in the framework of effectively functional search interface and METS/ALTO
viewer.
We conclude that METS/ALTO is a highly versatile format which can easily be used in var-
ious ways and completely fulfills our needs for creation, presentation and preservation of digitized
documents. The fact that the format is XML-based facilitates creation of tools for its handling and
processing. Its versatility also facilitates inter-library cooperation in digitization and tool creation.
Резиме
Kao učesnik u projektu CIP ICT-PSP Europeana Newspapers, Univerzitetska biblioteka „Svetozar Marković” dobila je
METS/ALTO fajlove za 400.000 strana srpskih istorijskih novina. Ovom prilikom bibliotekari su se ujedno i prvi put
susreli sa ovom vrstom fajlova i stoga nije bilo ni digitalnih alata ni potrebne ekspertize da se njima manipuliše ni u
Univerzitetskoj biblioteci, a ni u bilo kojoj drugoj instituciji u Srbiji. Nabavka neophodnih alata na međunarodnom
tržištu bila je nemoguća zbog finansijskih okolnosti u Biblioteci, tako da je jedino moguće rešenje bio samostalni
razvoj pomenutih alata. U ovom radu predstavljeni su razni digitalni alati koji su razvijeni, testirani i implementirani
u Univerzitetskoj biblioteci i koji sada omogućavaju implementaciju svih elemenata u proces digitalizacije novina koji
se zasniva na METS/ALTO fajlovima. Predstavljeni su pojedini detalji interfejsa za pretragu i programa za pre-
traživanje kao i detalji interfejsa koji omogućava korisnicima da čitaju i kopiraju pune tekstove članaka iz novina,
koje su istovremeno predstavljene kao slike novinskih strana. U okviru opisa alata za ispravljanje/korekciju, koji
omogućava poboljšavanje kvaliteta preciznosti OCR-ovanog teksta ručnom ispravkom grešaka u tekstu, pred-
stavljeni su statistički podaci koji su zasnovani na jednogodišnjem korišćenju ovog alata. Na kraju, opisano je i jedno
od rešenja za automatsko otkrivanje stranica koje su skenirane ili odštampane naopačke, a koje se temelji na analizi
OCR-ovanog teksta. Ujedno, opisana je i analiza implementacije ovog rešenja u celoj kolekciji srpskih istorijskih
novina. Ovaj rad predstavlja iskustva koja se odnose na zahteve korisnika, primedbe i komentare nakon korišćenja
kolekcije uz pomoć samostalno razvijenog interfejsa i materijala koji su poboljšani samostalno razvijenim alatima.
Opšti zaključci i preporuke koji se tiču daljeg rada u oblasti digitalizacije novinskih kolekcija, a zasnovani su na
pomenutim iskustvima, dati su na kraju rada.
Кључне речи
METS, ALTO, OCR, BnLViewer, Lucene/Solr

References
1. “Analyzers, Tokenizers, and Token Filters”, https://wiki.apache.org/solr/AnalyzersTokenizersToken-
Filters (accessed 23 Nov 2015).
2. “BnLViewer: A newspaper in the viewer”, http://sourceforge.net/p/bnlviewer/home/screenshots/
(accessed 23 Nov 2015).
3. CCS Content Conversion Specialists, “METS / ALTO introduction”, http://content-conver-
sion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf (accessed 23 Nov
2015).
4. “Data Import Request Handler”, https://wiki.apache.org/solr/DataImportHandler (accessed 23 Nov
2015).
5. Nicomsoft, “Optical Character Recognition (OCR) – How it works”. http://www.nicomsoft.com/opti-
cal-character-recognition-ocr-how-it-works/ (accessed 23 Nov 2015).
6. Library of Congress, “ALTO: Technical Metadata for Optical Character Recognition (Standards, Li-
brary of Congress)”, http://www.loc.gov/standards/alto/ (accessed 23 Nov 2015).
7. Library of Congress, “Metadata Encoding and Transmission Standard (METS) Official Web Site”,
http://www.loc.gov/standards/mets/ (accessed 23 Nov 2015).
8. Maurer,Yves, Presentation on BNL viewer. Amsterdam, 16 September 2013.
9. National Library of Luxembourg, http://www.eluxemburgensia.lu (accessed 23 Nov 2015).
10. “SchemaXml”, https://wiki.apache.org/solr/SchemaXml (accessed 23 Nov 2015).
11. “SolrQuerySyntax”, https://wiki.apache.org/solr/SolrQuerySyntax
(accessed 23 Nov 2015).
12. The Apache Software Foundation, “Apache Lucene - Query Parser Syntax”,http://lu-
cene.apache.org/core/3_6_0/queryparsersyntax.html (accessed 23 Nov 2015).
13. The Apache Software Foundation, “Apache Solr”, ,http://lucene.apache.org/solr/ (accessed 23 Nov
2015).
14. The Apache Software Foundation, “Package org.apache.lucene.codecs.lucene40”, http://lu-
cene.apache.org/core/4_0_0/core/org/apache/lucene/codecs/lucene40/package-summary.html#Limita-
tions (accessed 23 Nov 2015).
16. University Library “Svetozar Markovic”, “Searchable Digitized Historical Newspapers”,
http://www.unilib.rs/istorijske-novine/advanced-search (accessed 23 Nov 2015).
http://www.unilib.rs/istorijske-novine/browse (accessed 23 Nov 2015).
http://www.unilib.rs/istorijske-novine/browse?newspaper=UB_00029 (accessed 23 Nov 2015).
19. University Library “Svetozar Markovic”, “Search Guide”, http://www.unilib.rs/istorijske-
novine/about#search-guide (accessed 23 Nov 2015).
20. Wikipedia, the free encyclopedia, “Levenshtein distance”, https://en.wikipedia.org/w/index.php?ti-
tle=Levenshtein_distance&oldid=688663203 (accessed 23 Nov 2015).
21. Wikipedia, the free encyclopedia, ”Optical character recognition explained”, https://en.wikipe-
dia.org/w/index.php?title=Optical_character_recognition&oldid=686827150 (accessed 23 Nov 2015).
22. Wikipedia, the free encyclopedia, “Stemming”, https://en.wikipedia.org/w/index.php?title=Stem-
ming&oldid=685786913 (accessed 23 Nov 2015).
23. Wikipedia, the free encyclopedia, “Version control”, https://en.wikipedia.org/w/index.php?ti-
tle=Version_control&oldid=690270257 (accessed 23 Nov 2015).

12_N.Smolenski, M.Kostic, A.Sofronijevic

Recommended

Recommended

More Related Content

Similar to 12_N.Smolenski, M.Kostic, A.Sofronijevic

Similar to 12_N.Smolenski, M.Kostic, A.Sofronijevic (20)

12_N.Smolenski, M.Kostic, A.Sofronijevic