Detecting Articles in a Digitized Finnish
Historical Newspaper Collection 1771–1929:
Early Results Using the PIVAJ Software
Kimmo Kettunen, Teemu Pääkkönen, Erno Liukkonen
The National Library of Finland, Mikkeli unit,
DH Projects
Pierrick Tranouez, Daniel Antelme, Thierry Paquet
LITIS laboratory
University of Rouen Normandy
France
Background
• The National Library of Finland (NLF) has a large
digitized collection of historical periodicals 1771-1929
open in the Web (digi.kansalliskirjasto.fi): 7.58 M pages
in Finnish and Swedish mainly
• The data has been digitized on page basis, and there is
no article structure in the data
• Page is not an informational unit for the user,
article/news item is
• How to produce articles from the existing content?
DATeCH2019, May 8–10, 2019, Brussels, Belgium
An example: a page from Päivälehti from 1904: 8
columns, a relatively clear layout with some ads
DATeCH2019, May 8–10, 2019, Brussels, Belgium
An example: same issue’s ad pages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Messy data
• The layout of the pages varies quite a lot even inside one
issue of a newspaper and layouts keep changing every
3-5 years in a single newspaper
• Different newspapers may have also different styles,
even though they comply usually to same principles
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Article extraction out of the pages: state-of-the-art
results
• Article extraction on complex layout digitized newspaper
pages is not an easy task. Results of the biannual
ICDAR competition on historical newspaper layout
analysis show that current algorithms segment and label
about 80–85 % of the pages correctly at best.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Article extraction out of the pages: state-of-the-art
results
• The latest round of ICDAR competition, ICDAR 2017, considered
comparative evaluation of page segmentation and region
classification methods for documents with complex layouts.
Presented results included the results of the evaluation of seven
methods: five submitted to the competition, and two state-of-the art
systems: commercial ABBYY FineReader® Engine 11 and open-
source Tesseract 3.04. In a combined task of segmentation and
classification of page contents all but one of the systems remain in
the range of 72.5 to 83% correctness.
• The best performing system, MHS 2017, joint work of HoChiMinh
City University of Technology (Ho Chi Minh City, Viet Nam) and
Chonnam National University (Gwangju, Republic of Korea),
achieves a clearly better performance of 90.6%.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Tools for the work
• There are several existing tools for article extraction,
both commercial and research outputs
• After some ground work with tool examination we chose
PIVAJ machine learning based platform developed at the
LITIS laboratory of University of Rouen Normandy
DATeCH2019, May 8–10, 2019, Brussels, Belgium
PIVAJ shortly
• PIVAJ has two parts: an on-line application and an off-
line system. The offline system analyzes newspapers’
digitization images in order to rebuild their logical
structure, from issues to sections to articles.
• The online system allows for the display of the resulting
analysis on the Web, as well as additional functionalities
such as transcription corrections.
• We use only PIVAJ’s offline system for newspaper image
analysis and article marking.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training data
• From a user perspective, PIVAJ article annotation system is a
machine learning software that is first taught with sample page
images that have been marked for layout structure with entities like
title, body, vertical separator, horizontal separator, pictures etc.
• We chose one of our most used newspaper, Uusi Suometar, to work
with
• We established a collection of 224 pages from 1869-1898: 56
issues, 4 pages each
• This data was marked with PIVAJ’s editor
• 168 pages of the data was used for training and 56 pages for
evaluation
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Statistics of usage for top 20 newspapers in Digi:
US is the most used in Finnish
DATeCH2019, May 8–10, 2019, Brussels, Belgium
A marked page for PIVAJ to learn
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training: effect of amount of columns
• We first trained an individual model on the issues
containing 3 columns in the training set and evaluated
the model on the issue in the development set with 3
columns. Subsequently, we repeated the same for the
rest of the column numbers from 4 to 9.
• PIVAJ was able to learn and provide meaningful
predictions (measured by visual inspection) on the
development issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Training: effect of amount of columns
• We then trained a single model using all the 21 issues with varying
column numbers and evaluated the resulting model on all the 7
issues in the development set.
• Our hypothesis was that in case the PIVAJ would have difficulties
handling the variety of number of columns, we would observe
anomalies during training or, more importantly, in the predictions on
the development set. However, we saw no evidence of this, that is,
PIVAJ again provided meaningful predictions and no undesirable
behavior on the development set (measured by visual inspection).
• Therefore, we adopted the single model approach for the primary
experiments conducted on all 56 issues.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Evaluation of results
• We trained PIVAJ successfully, ran the experiments and evaluated
the physical segmentation results using the layout evaluation
software developed by PRImA research laboratory of University of
Salford, which is applicable subsequent to converting the ALTO XML
structure used by PIVAJ to PAGE XML.
• The PRImA software has been employed for evaluation of the
biannual ICDAR competitions (2011/13/15/17).
• However, the specifics of the official competition evaluation are not
publicly available and, thus, the competition evaluation is not
replicable. Therefore, we instead followed the evaluation presented
in Clausner et al. 2011 with three evaluation scenarios
• https://www.primaresearch.org/tools
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Different evaluation scenarios
The General recognition is used to
measure the pure segmentation
performance. Therefore,
misclassification errors are ignored
completely. Miss and partial miss errors
are considered worst and have the
highest weights. The weights for merge
and split errors are set to 50%,
whereas false detection, as the least
important error type, has a weight of
only 10%.
The third scenario, Indexing, is based
on the OCR profile but focuses solely
on text, ignoring non-text regions.
The Text structure scenario evaluates
region classification, in the context of a
typical OCR system, focusing primarily
on text but not ignoring the non-text
regions. Accordingly, this profile is
similar to the first but misclassification
of text is weighted highest and all other
misclassification weights are set to
10%.
DATeCH2019, May 8–10, 2019,
Brussels, Belgium
Evaluation results: On average the three evaluation scenarios
get success rates of 67.9, 76.1, and 92.2 for the whole data set of 56
pages.
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
General recognition success rate: arithmetic mean over pages Text structure success rate: arithmetic mean over pages
Indexing scenario success rate: arithmetic mean over pages
Examples: a complex page succesfully divided, recognition rates
of 81.86, 89.95, and 96.54 - above average
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Examples: bad performance, recognition rates of 57.4,
73.02, and 78.73 - clearly below the averages
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Results overall are fair, if not remarkable
• If the pages contain longer sections that consist of short news
items of few lines, the news are not well extracted, only the
larger section, if there are no clear, e.g. bolded, starts for the
short news items.
• The mean quality of the results is lowered by PIVAJ’s behavior
on advertisement heavy pages. LITIS is currently working on
new solutions to overcome this weakness.
• On the whole, the results seem reasonable considering the
varying layouts of the different issues of Uusi Suometar along
the time scale of the data.
• N.B. ¼ and even more of the pages of an US issue can
consist of advertisements
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Comparison to performance of docWorks with the same
pages: docWorks finds 150 articles, PIVAJ 1013
DATeCH2019, May 8–10, 2019, Brussels, Belgium
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03
Number of articles, docWorks Number of articles, PIVAJ
Benefits for the user: clippings (so far manual and
image only)
DATeCH2019, May 8–10, 2019, Brussels, Belgium
A way to use extracted articles: clippings for the
user automatized
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Continuation
• A single newspaper like Uusi Suometar can consist of a
lot of pages: 86 068 pages 1869-1918 à it takes some
time to run these through PIVAJ
• After that we shall start work with other popular
newspapers like Aamulehti, Satakunnan Kansa etc.
• The next development issue may be development of a
new page model for PIVAJ for new newspapers (not
known yet).
DATeCH2019, May 8–10, 2019, Brussels, Belgium
Thank you!
Kimmo.Kettunen@helsinki.fi
The National Library of Finland, DH Projects

Session3 02.kimmo ketunnen

  • 1.
    Detecting Articles ina Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software Kimmo Kettunen, Teemu Pääkkönen, Erno Liukkonen The National Library of Finland, Mikkeli unit, DH Projects Pierrick Tranouez, Daniel Antelme, Thierry Paquet LITIS laboratory University of Rouen Normandy France
  • 2.
    Background • The NationalLibrary of Finland (NLF) has a large digitized collection of historical periodicals 1771-1929 open in the Web (digi.kansalliskirjasto.fi): 7.58 M pages in Finnish and Swedish mainly • The data has been digitized on page basis, and there is no article structure in the data • Page is not an informational unit for the user, article/news item is • How to produce articles from the existing content? DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 3.
    An example: apage from Päivälehti from 1904: 8 columns, a relatively clear layout with some ads DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 4.
    An example: sameissue’s ad pages DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 5.
    Messy data • Thelayout of the pages varies quite a lot even inside one issue of a newspaper and layouts keep changing every 3-5 years in a single newspaper • Different newspapers may have also different styles, even though they comply usually to same principles DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 6.
    Article extraction outof the pages: state-of-the-art results • Article extraction on complex layout digitized newspaper pages is not an easy task. Results of the biannual ICDAR competition on historical newspaper layout analysis show that current algorithms segment and label about 80–85 % of the pages correctly at best. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 7.
    Article extraction outof the pages: state-of-the-art results • The latest round of ICDAR competition, ICDAR 2017, considered comparative evaluation of page segmentation and region classification methods for documents with complex layouts. Presented results included the results of the evaluation of seven methods: five submitted to the competition, and two state-of-the art systems: commercial ABBYY FineReader® Engine 11 and open- source Tesseract 3.04. In a combined task of segmentation and classification of page contents all but one of the systems remain in the range of 72.5 to 83% correctness. • The best performing system, MHS 2017, joint work of HoChiMinh City University of Technology (Ho Chi Minh City, Viet Nam) and Chonnam National University (Gwangju, Republic of Korea), achieves a clearly better performance of 90.6%. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 8.
    Tools for thework • There are several existing tools for article extraction, both commercial and research outputs • After some ground work with tool examination we chose PIVAJ machine learning based platform developed at the LITIS laboratory of University of Rouen Normandy DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 9.
    PIVAJ shortly • PIVAJhas two parts: an on-line application and an off- line system. The offline system analyzes newspapers’ digitization images in order to rebuild their logical structure, from issues to sections to articles. • The online system allows for the display of the resulting analysis on the Web, as well as additional functionalities such as transcription corrections. • We use only PIVAJ’s offline system for newspaper image analysis and article marking. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 10.
    Training data • Froma user perspective, PIVAJ article annotation system is a machine learning software that is first taught with sample page images that have been marked for layout structure with entities like title, body, vertical separator, horizontal separator, pictures etc. • We chose one of our most used newspaper, Uusi Suometar, to work with • We established a collection of 224 pages from 1869-1898: 56 issues, 4 pages each • This data was marked with PIVAJ’s editor • 168 pages of the data was used for training and 56 pages for evaluation DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 11.
    Statistics of usagefor top 20 newspapers in Digi: US is the most used in Finnish DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 12.
    A marked pagefor PIVAJ to learn DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 13.
    Training: effect ofamount of columns • We first trained an individual model on the issues containing 3 columns in the training set and evaluated the model on the issue in the development set with 3 columns. Subsequently, we repeated the same for the rest of the column numbers from 4 to 9. • PIVAJ was able to learn and provide meaningful predictions (measured by visual inspection) on the development issues. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 14.
    Training: effect ofamount of columns • We then trained a single model using all the 21 issues with varying column numbers and evaluated the resulting model on all the 7 issues in the development set. • Our hypothesis was that in case the PIVAJ would have difficulties handling the variety of number of columns, we would observe anomalies during training or, more importantly, in the predictions on the development set. However, we saw no evidence of this, that is, PIVAJ again provided meaningful predictions and no undesirable behavior on the development set (measured by visual inspection). • Therefore, we adopted the single model approach for the primary experiments conducted on all 56 issues. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 15.
    Evaluation of results •We trained PIVAJ successfully, ran the experiments and evaluated the physical segmentation results using the layout evaluation software developed by PRImA research laboratory of University of Salford, which is applicable subsequent to converting the ALTO XML structure used by PIVAJ to PAGE XML. • The PRImA software has been employed for evaluation of the biannual ICDAR competitions (2011/13/15/17). • However, the specifics of the official competition evaluation are not publicly available and, thus, the competition evaluation is not replicable. Therefore, we instead followed the evaluation presented in Clausner et al. 2011 with three evaluation scenarios • https://www.primaresearch.org/tools DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 16.
    Different evaluation scenarios TheGeneral recognition is used to measure the pure segmentation performance. Therefore, misclassification errors are ignored completely. Miss and partial miss errors are considered worst and have the highest weights. The weights for merge and split errors are set to 50%, whereas false detection, as the least important error type, has a weight of only 10%. The third scenario, Indexing, is based on the OCR profile but focuses solely on text, ignoring non-text regions. The Text structure scenario evaluates region classification, in the context of a typical OCR system, focusing primarily on text but not ignoring the non-text regions. Accordingly, this profile is similar to the first but misclassification of text is weighted highest and all other misclassification weights are set to 10%. DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 17.
    Evaluation results: Onaverage the three evaluation scenarios get success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages. DATeCH2019, May 8–10, 2019, Brussels, Belgium 0 10 20 30 40 50 60 70 80 90 100 1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03 General recognition success rate: arithmetic mean over pages Text structure success rate: arithmetic mean over pages Indexing scenario success rate: arithmetic mean over pages
  • 18.
    Examples: a complexpage succesfully divided, recognition rates of 81.86, 89.95, and 96.54 - above average DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 19.
    Examples: bad performance,recognition rates of 57.4, 73.02, and 78.73 - clearly below the averages DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 20.
    Results overall arefair, if not remarkable • If the pages contain longer sections that consist of short news items of few lines, the news are not well extracted, only the larger section, if there are no clear, e.g. bolded, starts for the short news items. • The mean quality of the results is lowered by PIVAJ’s behavior on advertisement heavy pages. LITIS is currently working on new solutions to overcome this weakness. • On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data. • N.B. ¼ and even more of the pages of an US issue can consist of advertisements DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 21.
    Comparison to performanceof docWorks with the same pages: docWorks finds 150 articles, PIVAJ 1013 DATeCH2019, May 8–10, 2019, Brussels, Belgium 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 1872/01/05 1873/01/07 1877/06/04 1877/07/02 1880/01/05 1880/06/02 1886/06/05 1887/01/05 1889/07/11 1889/08/01 1893/07/04 1894/01/09 1898/01/08 1898/07/03 Number of articles, docWorks Number of articles, PIVAJ
  • 22.
    Benefits for theuser: clippings (so far manual and image only) DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 23.
    A way touse extracted articles: clippings for the user automatized DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 24.
    Continuation • A singlenewspaper like Uusi Suometar can consist of a lot of pages: 86 068 pages 1869-1918 à it takes some time to run these through PIVAJ • After that we shall start work with other popular newspapers like Aamulehti, Satakunnan Kansa etc. • The next development issue may be development of a new page model for PIVAJ for new newspapers (not known yet). DATeCH2019, May 8–10, 2019, Brussels, Belgium
  • 25.