SlideShare a Scribd company logo
1 of 34
Download to read offline
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kimmo Kettunen 1
, Timo Honkela 1,2
, Krister Lindén 2
,
Pekka Kauppinen 2
, Tuula Pääkkönen 1
& Jukka Kervinen 1
Analyzing and Improving
the Quality of a Historical News Collection
using Language Technology and
Statistical Machine Learning Methods
IFLA Pre-Conference
Geneva, Switzerland,
13th of August, 2014
1
2
Presented by
Timo Honkela
in
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
HELSINKI MIKKELI
Department of
Modern Languages
Language Technology
Center for Preservation
and Digitisation
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
www.fmi.fi http://oppimateriaalit.internetix.fi
HonkeLA
 KettuNEN
KauppiNEN
PääkköNEN
 KerviNEN
Lindén
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Structure of the presentation
● Some background on the digitalization
process
● Introducing the paper content:
analysis and correction of OCR results
● Discussion on future steps:
In-depth analysis of newspaper contents
to promote research in humanities and
social sciences
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Historical newspaper collection
● The National Library of Finland has digitized a
large proportion of the historical newspapers
published in Finland between 1771 and 1910
(Bremer-Laamanen 2001, 2005).
● This collection contains approximately 1.95
million pages in Finnish and Swedish
● According to Legal Deposit law, the National
Library of Finland receives a copy of each
newspaper and magazine published in Finland.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Digitisation of the
historical newspaper collection
● In the post-processing phase, the material is
processed so that it can be shared to the library
sector, researchers, and the wide public.
● The scanned images are enhanced and run
through background software and processes
which create METS/ALTO metadata (CCS
Docworks)
● The optical character recognition (OCR) is
conducted at the same time in order to get the text
content from the materials.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Two channels
● Search and exploration interface (“Digi”)
– Approximate search, focusing based on time/place,
indexed contents,
index creation using morphological analysis, etc.
– Digitalkoot: enables the public to collectively mark
and collect articles (crowdsourcing)
● Corpus (FIN-CLARIN)
– Mainly used by linguists
– Includes keyword-in-context (n-gram) view
– Morphological and syntactical analysis results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Search interface
http://digi.kansalliskirjasto.fi
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
FIN-CLARIN corpus
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
OCR Challenges
● Regardless of recent development of the OCR
software, there are still challenges with it, as
some material is very old, with
– varying paper and print quality,
– varying number of columns and layout patterns,
– different languages (mainly Finnish and Swedish
but also French, German, etc.), and
– and varying font types (fraktur and antiqua)
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
OCR Challenges
● The amount of material is such that
human efforts – even crowdsourced –
can only be a partial solution
● Fully or partially automated processes
are needed
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
A very long tail of low frequency forms...
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
zzhdysvautki Yhdyspankki
v, u, p ? u, n, ll ?
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
taioafliftiutpn tavallisuuden
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Sources of complexity
Word (lexeme)
Inflections
Typos
Recognition errors
Historical differences
“Recognized” surface word
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Inflections:
Complexity of
Finnish at the
level of word
forms
Kimmo Koskenniemi (2013):
Johdatus kieliteknologiaan,
sen merkitykseen ja
sovelluksiin
(Introduction to language
technology, its significance and
applications)
https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Typos
Not a major source of problem but they do exist
Basel
Most likely
not a stain
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Historical differences
● All the time, new names and words
are being introduced
● Even more static morphological aspects
evolve over centuries
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Net outcome
● A collection of millions of newspaper
pages gives rise to a list of hundreds
of millions of different word forms
that have been found in the process
● A large proportion of these forms
is not correct
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Detection and correction
● Improving OCR quality – not considered here
● Improving the OCR output based on linguistic
knowledge and statistical considerations
– Detecting incorrect forms
– Correcting the incorrect form
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Introduction to the basic ideas
● Detection
– Morphological analyzer
– Special dictionaries (e.g. names)
– N-grams
● Correction
– Transformation rules created through
a supervised learning scheme
– Edit distance approach using corpus statistics
– Weighted edit distance based on letter shapes
– Future: context information (problem of sparsity)
Please see
the paper for
methodological
details and
analysis results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Similarity diagram of Fraktur letter shapes
(a self-organizing map)
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Socio-Historical Text Mining
of Newspaper Collections
Research direction
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Areas of analysis
● Named entity recognition
(people, organizations, places, events)
● Time series analysis
● Social network analysis
● Topic modeling
cf. Virginie Fortun's
presentation
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Areas of analysis
● Multidimensional sentiment analysis
● Analysis of social and
historical context
● Intercultural and
multilingual analysis
● Analysis of point of view
● Analysis of subjective
understanding
Stella Wisdom & Neil Smyth
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Earlier related results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Learning meaning from context:
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Multidimensional sentiment
using the PERMA model
● Seligman and his colleagues has developed the
PERMA model that addresses different aspects of
wellbeing.
● The model includes five components related to
subjective well-being:
– Positive emotion (P),
– Engagement (E),
– Relationships (R),
– Meaning (M) and
– Achievement (A) Honkela, Korhonen, Lagus & Saarinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
PERMA profiles
of different corpora
Honkela, Korhonen, Lagus & Saarinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Timo Honkela, Juha Raitio, Krista Lagus, Ilari
T. Nieminen, Nina Honkela, and Mika Pantzar:
Subjects on objects in contexts:
Using GICA method to quantify
epistemological subjectivity
(IJCNN 2012)
Analysis of the subjective
meaning: word 'health'
Analysis of the State of
the Union Adresses
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Socio-Historical Text Mining
of Newspaper Collections
A call for interdisciplinary international
collaboration
Libraries, researchers within journalism, corpus
linguistics, history, sociology, political science,
psychology, computer science, machine learning, etc.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Merci!
Danke schön!
Grazie!
Multumesc!
¡Gracias!
Thank you!
Kiitos!
Tack!
謝謝!
Σας ευχαριστούμε!

More Related Content

Similar to Analyzing Historical News Collection Quality with NLP

Internationalisation and the initial teacher education curriculum’
Internationalisation and the initial teacher education curriculum’Internationalisation and the initial teacher education curriculum’
Internationalisation and the initial teacher education curriculum’Ton Koenraad
 
Mobile Learning in Modern Language Education
Mobile Learning in Modern Language EducationMobile Learning in Modern Language Education
Mobile Learning in Modern Language Educationcutrimschmid
 
Euro call2009 iwb_h_utrecht_hheidelbergprojects
Euro call2009 iwb_h_utrecht_hheidelbergprojectsEuro call2009 iwb_h_utrecht_hheidelbergprojects
Euro call2009 iwb_h_utrecht_hheidelbergprojectsTon Koenraad
 
Jhade Pilgrim - Resume & Portfolio 1
Jhade Pilgrim - Resume & Portfolio 1Jhade Pilgrim - Resume & Portfolio 1
Jhade Pilgrim - Resume & Portfolio 1Jhade Pilgrim
 
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptxDataScienceConferenc1
 
Plain language in norway’s civil service
Plain language in norway’s civil servicePlain language in norway’s civil service
Plain language in norway’s civil serviceClarity2010
 
Pdf final october 15th school improvement seminar
Pdf final october 15th school improvement seminarPdf final october 15th school improvement seminar
Pdf final october 15th school improvement seminarclairematthews
 
SLanguages2008 Vitaal
SLanguages2008   VitaalSLanguages2008   Vitaal
SLanguages2008 Vitaalslanguages
 
Ossiannilsson sequent master class barcelona2014
Ossiannilsson sequent master class barcelona2014Ossiannilsson sequent master class barcelona2014
Ossiannilsson sequent master class barcelona2014Ebba Ossiannilsson
 
Timo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela
 
20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdfShingo Nahatame
 
Learnwell 2012
Learnwell 2012Learnwell 2012
Learnwell 2012LANGLETE
 
Vitaal project presented at Slanguages 2008 conference
Vitaal project presented at Slanguages 2008 conferenceVitaal project presented at Slanguages 2008 conference
Vitaal project presented at Slanguages 2008 conferenceTon Koenraad
 

Similar to Analyzing Historical News Collection Quality with NLP (20)

Teacher Competences for basic literacy, by Qarin Franker
Teacher Competences for basic literacy, by Qarin FrankerTeacher Competences for basic literacy, by Qarin Franker
Teacher Competences for basic literacy, by Qarin Franker
 
Finsocinenglish2012
Finsocinenglish2012Finsocinenglish2012
Finsocinenglish2012
 
LAC UWC 0810
LAC UWC 0810LAC UWC 0810
LAC UWC 0810
 
Internationalisation and the initial teacher education curriculum’
Internationalisation and the initial teacher education curriculum’Internationalisation and the initial teacher education curriculum’
Internationalisation and the initial teacher education curriculum’
 
Besa Selimi CV
Besa Selimi CVBesa Selimi CV
Besa Selimi CV
 
Mobile Learning in Modern Language Education
Mobile Learning in Modern Language EducationMobile Learning in Modern Language Education
Mobile Learning in Modern Language Education
 
Lorraine Hirst - General CV Dec 2015
Lorraine Hirst - General CV Dec 2015Lorraine Hirst - General CV Dec 2015
Lorraine Hirst - General CV Dec 2015
 
Euro call2009 iwb_h_utrecht_hheidelbergprojects
Euro call2009 iwb_h_utrecht_hheidelbergprojectsEuro call2009 iwb_h_utrecht_hheidelbergprojects
Euro call2009 iwb_h_utrecht_hheidelbergprojects
 
General CV March 2016
General CV March 2016General CV March 2016
General CV March 2016
 
Jhade Pilgrim - Resume & Portfolio 1
Jhade Pilgrim - Resume & Portfolio 1Jhade Pilgrim - Resume & Portfolio 1
Jhade Pilgrim - Resume & Portfolio 1
 
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
[DSC Europe 23] Slobodan Markovic - NLP for Serbian.pptx
 
OlgaS.2015-CV
OlgaS.2015-CVOlgaS.2015-CV
OlgaS.2015-CV
 
Plain language in norway’s civil service
Plain language in norway’s civil servicePlain language in norway’s civil service
Plain language in norway’s civil service
 
Pdf final october 15th school improvement seminar
Pdf final october 15th school improvement seminarPdf final october 15th school improvement seminar
Pdf final october 15th school improvement seminar
 
SLanguages2008 Vitaal
SLanguages2008   VitaalSLanguages2008   Vitaal
SLanguages2008 Vitaal
 
Ossiannilsson sequent master class barcelona2014
Ossiannilsson sequent master class barcelona2014Ossiannilsson sequent master class barcelona2014
Ossiannilsson sequent master class barcelona2014
 
Timo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of KnowledgeTimo Honkela: Spaces of Knowledge
Timo Honkela: Spaces of Knowledge
 
20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf20220602_QMC22_Slide.pdf
20220602_QMC22_Slide.pdf
 
Learnwell 2012
Learnwell 2012Learnwell 2012
Learnwell 2012
 
Vitaal project presented at Slanguages 2008 conference
Vitaal project presented at Slanguages 2008 conferenceVitaal project presented at Slanguages 2008 conference
Vitaal project presented at Slanguages 2008 conference
 

More from Timo Honkela

Timo Honkela: Meaning negotiations as phenomenon and as languages technology...
 Timo Honkela: Meaning negotiations as phenomenon and as languages technology... Timo Honkela: Meaning negotiations as phenomenon and as languages technology...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology...Timo Honkela
 
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...Timo Honkela
 
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...Timo Honkela
 
Timo Honkela: From early to later Wittgenstein and Artificial Intelligence
Timo Honkela: From early to later Wittgenstein and Artificial IntelligenceTimo Honkela: From early to later Wittgenstein and Artificial Intelligence
Timo Honkela: From early to later Wittgenstein and Artificial IntelligenceTimo Honkela
 
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...Timo Honkela
 
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...Timo Honkela
 
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017Timo Honkela
 
Timo Honkela: Turning quantity into quality and making concepts visible using...
Timo Honkela: Turning quantity into quality and making concepts visible using...Timo Honkela: Turning quantity into quality and making concepts visible using...
Timo Honkela: Turning quantity into quality and making concepts visible using...Timo Honkela
 
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....Timo Honkela
 
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...Timo Honkela
 
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...Timo Honkela
 
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...Timo Honkela
 
Timo Honkela: Digitalisaatio tulevaisuudessa
Timo Honkela: Digitalisaatio tulevaisuudessaTimo Honkela: Digitalisaatio tulevaisuudessa
Timo Honkela: Digitalisaatio tulevaisuudessaTimo Honkela
 
Timo Honkela: Semantic and pragmatics representations of large text corpora
Timo Honkela: Semantic and pragmatics representations of large text corporaTimo Honkela: Semantic and pragmatics representations of large text corpora
Timo Honkela: Semantic and pragmatics representations of large text corporaTimo Honkela
 
Timo Honkela: Analysis of Qualitative Data using Machine Learning Methods
Timo Honkela: Analysis of Qualitative Data using Machine Learning MethodsTimo Honkela: Analysis of Qualitative Data using Machine Learning Methods
Timo Honkela: Analysis of Qualitative Data using Machine Learning MethodsTimo Honkela
 
Timo Honkela: Epistemological status of linguistic theories and models
Timo Honkela: Epistemological status of linguistic theories and modelsTimo Honkela: Epistemological status of linguistic theories and models
Timo Honkela: Epistemological status of linguistic theories and modelsTimo Honkela
 
Timo Honkela: Tieteen uudet valtatiet
Timo Honkela: Tieteen uudet valtatietTimo Honkela: Tieteen uudet valtatiet
Timo Honkela: Tieteen uudet valtatietTimo Honkela
 
Timo Honkela: An introduction to text mining
Timo Honkela: An introduction to text miningTimo Honkela: An introduction to text mining
Timo Honkela: An introduction to text miningTimo Honkela
 
Timo Honkela: Modeling evolution and dynamical systems
Timo Honkela: Modeling evolution and dynamical systemsTimo Honkela: Modeling evolution and dynamical systems
Timo Honkela: Modeling evolution and dynamical systemsTimo Honkela
 
Timo Honkela: Metaphors, analogies and conceptual blending
Timo Honkela: Metaphors, analogies and conceptual blendingTimo Honkela: Metaphors, analogies and conceptual blending
Timo Honkela: Metaphors, analogies and conceptual blendingTimo Honkela
 

More from Timo Honkela (20)

Timo Honkela: Meaning negotiations as phenomenon and as languages technology...
 Timo Honkela: Meaning negotiations as phenomenon and as languages technology... Timo Honkela: Meaning negotiations as phenomenon and as languages technology...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology...
 
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...
Timo Honkela: Meaning negotiations as phenomenon and as languages technology ...
 
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
Timo Honkela: Peace Machine: Using Artificial Intelligence to Promote Peacefu...
 
Timo Honkela: From early to later Wittgenstein and Artificial Intelligence
Timo Honkela: From early to later Wittgenstein and Artificial IntelligenceTimo Honkela: From early to later Wittgenstein and Artificial Intelligence
Timo Honkela: From early to later Wittgenstein and Artificial Intelligence
 
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...
Timo Honkela: Peace Machine: Peace from a difference perspective - Dialogue o...
 
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...
Timo Honkela: Kielellisten merkisten tilastollinen ja psykologinen luonne: Ko...
 
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017
Timo Honkela, kutsuttu esitelmä Automaatiopäivillä 2017
 
Timo Honkela: Turning quantity into quality and making concepts visible using...
Timo Honkela: Turning quantity into quality and making concepts visible using...Timo Honkela: Turning quantity into quality and making concepts visible using...
Timo Honkela: Turning quantity into quality and making concepts visible using...
 
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....
Timo Honkela: Tekoälyn ja koneoppimisen uhat ja mahdollisuudet, Turku, 27.10....
 
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...
Timo Honkela: Kohonen's Self-Organizing Maps for Intelligent Systems Developm...
 
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...
Timo Honkela: Kylmä data kohtaa inhimillisen tulkinnan, Studia Generalia -esi...
 
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...
Honkela. Lagus & Kanner: Parallel Conceptual Spaces and Systems in Health and...
 
Timo Honkela: Digitalisaatio tulevaisuudessa
Timo Honkela: Digitalisaatio tulevaisuudessaTimo Honkela: Digitalisaatio tulevaisuudessa
Timo Honkela: Digitalisaatio tulevaisuudessa
 
Timo Honkela: Semantic and pragmatics representations of large text corpora
Timo Honkela: Semantic and pragmatics representations of large text corporaTimo Honkela: Semantic and pragmatics representations of large text corpora
Timo Honkela: Semantic and pragmatics representations of large text corpora
 
Timo Honkela: Analysis of Qualitative Data using Machine Learning Methods
Timo Honkela: Analysis of Qualitative Data using Machine Learning MethodsTimo Honkela: Analysis of Qualitative Data using Machine Learning Methods
Timo Honkela: Analysis of Qualitative Data using Machine Learning Methods
 
Timo Honkela: Epistemological status of linguistic theories and models
Timo Honkela: Epistemological status of linguistic theories and modelsTimo Honkela: Epistemological status of linguistic theories and models
Timo Honkela: Epistemological status of linguistic theories and models
 
Timo Honkela: Tieteen uudet valtatiet
Timo Honkela: Tieteen uudet valtatietTimo Honkela: Tieteen uudet valtatiet
Timo Honkela: Tieteen uudet valtatiet
 
Timo Honkela: An introduction to text mining
Timo Honkela: An introduction to text miningTimo Honkela: An introduction to text mining
Timo Honkela: An introduction to text mining
 
Timo Honkela: Modeling evolution and dynamical systems
Timo Honkela: Modeling evolution and dynamical systemsTimo Honkela: Modeling evolution and dynamical systems
Timo Honkela: Modeling evolution and dynamical systems
 
Timo Honkela: Metaphors, analogies and conceptual blending
Timo Honkela: Metaphors, analogies and conceptual blendingTimo Honkela: Metaphors, analogies and conceptual blending
Timo Honkela: Metaphors, analogies and conceptual blending
 

Recently uploaded

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Recently uploaded (20)

AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Analyzing Historical News Collection Quality with NLP

  • 1. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Kimmo Kettunen 1 , Timo Honkela 1,2 , Krister Lindén 2 , Pekka Kauppinen 2 , Tuula Pääkkönen 1 & Jukka Kervinen 1 Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods IFLA Pre-Conference Geneva, Switzerland, 13th of August, 2014 1 2 Presented by Timo Honkela in
  • 2. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 HELSINKI MIKKELI Department of Modern Languages Language Technology Center for Preservation and Digitisation
  • 3. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 www.fmi.fi http://oppimateriaalit.internetix.fi HonkeLA  KettuNEN KauppiNEN PääkköNEN  KerviNEN Lindén
  • 4. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Structure of the presentation ● Some background on the digitalization process ● Introducing the paper content: analysis and correction of OCR results ● Discussion on future steps: In-depth analysis of newspaper contents to promote research in humanities and social sciences
  • 5. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Historical newspaper collection ● The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001, 2005). ● This collection contains approximately 1.95 million pages in Finnish and Swedish ● According to Legal Deposit law, the National Library of Finland receives a copy of each newspaper and magazine published in Finland.
  • 6. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Digitisation of the historical newspaper collection ● In the post-processing phase, the material is processed so that it can be shared to the library sector, researchers, and the wide public. ● The scanned images are enhanced and run through background software and processes which create METS/ALTO metadata (CCS Docworks) ● The optical character recognition (OCR) is conducted at the same time in order to get the text content from the materials.
  • 7. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Two channels ● Search and exploration interface (“Digi”) – Approximate search, focusing based on time/place, indexed contents, index creation using morphological analysis, etc. – Digitalkoot: enables the public to collectively mark and collect articles (crowdsourcing) ● Corpus (FIN-CLARIN) – Mainly used by linguists – Includes keyword-in-context (n-gram) view – Morphological and syntactical analysis results
  • 8. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Search interface http://digi.kansalliskirjasto.fi
  • 9. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 FIN-CLARIN corpus
  • 10. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 OCR Challenges ● Regardless of recent development of the OCR software, there are still challenges with it, as some material is very old, with – varying paper and print quality, – varying number of columns and layout patterns, – different languages (mainly Finnish and Swedish but also French, German, etc.), and – and varying font types (fraktur and antiqua)
  • 11. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 OCR Challenges ● The amount of material is such that human efforts – even crowdsourced – can only be a partial solution ● Fully or partially automated processes are needed
  • 12. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 A very long tail of low frequency forms...
  • 13. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 zzhdysvautki Yhdyspankki v, u, p ? u, n, ll ?
  • 14. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 taioafliftiutpn tavallisuuden
  • 15. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Sources of complexity Word (lexeme) Inflections Typos Recognition errors Historical differences “Recognized” surface word
  • 16. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Inflections: Complexity of Finnish at the level of word forms Kimmo Koskenniemi (2013): Johdatus kieliteknologiaan, sen merkitykseen ja sovelluksiin (Introduction to language technology, its significance and applications) https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1
  • 17. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Typos Not a major source of problem but they do exist Basel Most likely not a stain
  • 18. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Historical differences ● All the time, new names and words are being introduced ● Even more static morphological aspects evolve over centuries
  • 19. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Net outcome ● A collection of millions of newspaper pages gives rise to a list of hundreds of millions of different word forms that have been found in the process ● A large proportion of these forms is not correct
  • 20. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Detection and correction ● Improving OCR quality – not considered here ● Improving the OCR output based on linguistic knowledge and statistical considerations – Detecting incorrect forms – Correcting the incorrect form
  • 21. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Introduction to the basic ideas ● Detection – Morphological analyzer – Special dictionaries (e.g. names) – N-grams ● Correction – Transformation rules created through a supervised learning scheme – Edit distance approach using corpus statistics – Weighted edit distance based on letter shapes – Future: context information (problem of sparsity) Please see the paper for methodological details and analysis results
  • 22. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
  • 23. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
  • 24. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Similarity diagram of Fraktur letter shapes (a self-organizing map)
  • 25. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Socio-Historical Text Mining of Newspaper Collections Research direction
  • 26. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Areas of analysis ● Named entity recognition (people, organizations, places, events) ● Time series analysis ● Social network analysis ● Topic modeling cf. Virginie Fortun's presentation
  • 27. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Areas of analysis ● Multidimensional sentiment analysis ● Analysis of social and historical context ● Intercultural and multilingual analysis ● Analysis of point of view ● Analysis of subjective understanding Stella Wisdom & Neil Smyth
  • 28. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Earlier related results
  • 29. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Learning meaning from context: Maps of words in Grimm fairy tales Honkela, Pulkki & Kohonen 1995
  • 30. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Multidimensional sentiment using the PERMA model ● Seligman and his colleagues has developed the PERMA model that addresses different aspects of wellbeing. ● The model includes five components related to subjective well-being: – Positive emotion (P), – Engagement (E), – Relationships (R), – Meaning (M) and – Achievement (A) Honkela, Korhonen, Lagus & Saarinen 2014
  • 31. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 PERMA profiles of different corpora Honkela, Korhonen, Lagus & Saarinen 2014
  • 32. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar: Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity (IJCNN 2012) Analysis of the subjective meaning: word 'health' Analysis of the State of the Union Adresses
  • 33. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Socio-Historical Text Mining of Newspaper Collections A call for interdisciplinary international collaboration Libraries, researchers within journalism, corpus linguistics, history, sociology, political science, psychology, computer science, machine learning, etc.
  • 34. Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014 Merci! Danke schön! Grazie! Multumesc! ¡Gracias! Thank you! Kiitos! Tack! 謝謝! Σας ευχαριστούμε!