SlideShare a Scribd company logo
DATA MINING HISTORICAL
NEWSPAPERS METADATA
Old News Teaches History
Jean-Philippe Moreux
Bibliothèque national de France,
Digitization dpt
IFLA News Media Section,
Hamburg, April 2016
ATrue Story about the Researchers’ Needs
• How can we help a historian working on Stock Market
quotes creation and development in French newspapers?
(1800-1870)
here
ATrue Story about the Researchers’ Needs
• Obviously, he had to query the digital library catalog.
catalog search
ATrue Story about the Researchers’ Needs
• Moreover, he needed a text retrieval functionality.
text retrieval
catalog search
 The basics
ATrue Story about the Needs of Researchers
• But is it enough? Could we do better?
text retrieval
catalog search
+ Corpora builder
+ Predefined qualitative
& easy-to-use corpora
+ Advanced query
on document structure
and layout (to spot Stock
Market regions)
The True Story (cont’d): unhappy Ending
“Stock Market quotes in French Newspapers (1801-1870)”
PhD in Communication and Information Science (P.-C. Langlais)
• The creation of his corpus was very painful:
1. The historian had to script the DL to extract OCR and metadata
from multiple newspaper titles.
2. Then he had to refine/structure his text corpora.
More than 100 Python scripts were needed!
Historians generally prefer to focuse on research, not on writing scripts…
The True Story (cont’d):
Could we have helped him?
tools
OLR facilitates the
corpus creation task
 Content Types classification,
Section identification
The quantitative dataset
is of a great help
“Tables” in newspapers are
predominantly used in Stock
Market quotes  instant use of
this metadata!
Tables per week
day (1838-1870)
© R (P-C Langlais)
Types of content
are marked up
How to Satisfy Scientists’ Needs?
Let’s try to address this question, regarding the heritage daily
corpus enriched during the Europeana Newspapers project:
• Feed the DL with enriched digital documents?
• Give end-users access to quantitative metadata describing
documents structure and layout?
• Give end-users an ad hoc corpora builder functionality?
Plan
1. The Europeana Newspapers test bed
2. Building a quantitative metadata dataset
3. Data mining and data visualization use-cases
Enriching Digital Documents
Europeana Newspapers
project (2012-2015): 11,5M
OCR’ed pages, 2M OLR’ed
pages from 14 European
libraries
What is OLR?
• Identification of structural
elements, including
separation of articles
and sections.
• Classification of types of
content (ads, offers,
obituaries…)
• Europeana Newspaper project has enriched and aggregated
millions of heritage newspapers pages with advanced refinement
techniques like Optical Layout Recognition and Named Entities
Recognition.
UIBK
Document Analysis Technique like OLR
Produce Quantitative Metadata
The good new is OCR and OLR files are full of interesting
objects tagged into the XML:
• OCR (ALTO) is a source for quantitative metadata: number of words,
illustrations & tables, paper format…
• OLR (METS) is a valuable source too for high level informational objects:
• number of articles, titles, etc.
• identification of sections (groups of articles)
• content types classification (ads, judicial review, stock market…)
Huge amount of valuable data
for historians!
• We have to count the number of objects in each page of the
collection. Straightforward with XSLT, Java, Python, Perl, etc.
• We have to package and deliver these datasets to end-users.
How to Build such Datasets?
Europeana Newspapers
project / BnF: 880,000
OLR’ed pages from BnF
newspapers collection,
6 titles, 1814-1944
Pros:
• Give to users light derived datasets, not TB of XML files!
• It’s not rocket science.
• It’s fast (2-3 h/title with an optimized NoXML parsing script)
No Cons!
Who are the End-Users of the BnF Dataset?
• The EN-BnF dataset includes 5.5 M of values (150K issues, 880K p.)
• 7 metadata at issue level, 5 at page level
• XML, JSON or CSV formats
Researchers (Digital
Humanities, History of
Press, Information
Science)
Digital Curators &
Mediators: insights
on the collections
Digitization Program
Managers: statistics on
digitized content
t o o l s
Data visualization allows researchers to discover
meaning and information hidden in large volumes of data
• History of press/illustration:
Dataviz demonstrates the
growing importance of
illustration (blue: front page,
red: inside pages).
• History of press/activity:
Dataviz of types of content shows the impact
of the Great War on the economical
activity and assesses the period of return
to pre-war level activity (roughly 10 years).
Discovering Knowledge through Visualization
tools
© Highcharts
Discovering Knowledge through Visualization
• History of press/page format: Digital archeology of papermaking
and printing.
• History of press/layout: Visualization of the articles density per page reveals
the shift from XVIIth “gazettes” to modern daily.
tools
© Highcharts
Other Users might be Interested by those
Metadata: Digitization dpt
Statistical information on digitized content
for project managers.
• OCR Crowdsourcing project: What is the average
density in words of these documents? What text
correction efforts
will be required?
• Image bank: What titles contain
illustrations? What is the total number
of images one can expect?
tools
© Highcharts
Data visualization facilitates rediscovery and
reappropriation of heritage documents (by
the general public)
• Data visualization of illustrations density can reveal trends or outliers,
like highly illustrated issues (illustr. suppl.) or the first published
illustration in a title.
Engaging new Audiences with Dataviz
tools
© www.RetroNews.fr
Facts extracted thanks
to dataviz can then
enrich other digital
artefacts like timelines.
© https://timeline.knightlab.com
Engaging new Audiences with Dataviz
Interactive chart of the word density reveals breaks
due to changes in layout & paper format, outlier issues…
•
tools
Journal des débats politiques et littéraires, 1814-1944, 45,334 issues displayed
Go beyond keyword
spotting and page flip!
Some users would
like to play with those
charts!
Requesting the Dataset
Those datasets can be requested with dedicated tools
(statistical environments, NoSQL or XML databases...)
• Images search solution used by Gallica Mediation Service:
a XQuery HTTP API identifies “graphical” pages, that is to say both
those poor in words and including illustrations.
tools
http://localhost:8984/rest?run=findIllustratedPages.xq&toDate=1920-01-01&toPage=1
"As a digital mediator,
seeking for illustrations
in our 12M p. collection
is a nightmare…"
Requesting the Dataset
• Looking for WW1 censored front pages with BaseX: XQueries
can be written to dig into the data and find specific types of content, e.g.
the front pages censored during the Great war, which have a slightly
smaller words count than the front pages average. tools
Is it effective?
• Recall rate: 45%
• Precision rate: 68%
(Based on a ground truth carried on the
Journal des Débats front pages for 1915)
 Limits of a statistical approach when
applied to a word based metric biased by
layout singularities. Good enough for
mediation: Gallica blog post
Are my Data Representative?
The quality of datasets affects the validity of the analysis and
interpretation. Irregular data in nature or discontinuous in time may
introduce bias. A qualitative assessment should be conducted.
Data vizualisation can contribute to quality control (and information of end-users)
• A calendar display of a title data shows
rare missing issues, which suggests
that the digital collection is
representative.
• Stock Market quotes study based on the
content tagged “table”: one can empirically
validate this hypothesis by the sudden
inflections recorded in 1914 and 1939 for
all titles, being known and established the
historical fact of the virtual halt of trading
during the two World Wars
© Google Charts API
© Highcharts
Perspectives
• Apply the same data mining process to the other Europeana
Newspapers OLR’ed datasets to produce more datasets.
Apply on the on-going BnF newspapers digitization program.
• Automatically build the quantitative metadata datasets.
• Experiment on other types of materials with a temporal dimension
(e.g. long life magazines or revues, early printed books).
• @BnF: Assess the opportunity of setting up a data mining framework
targeting DH researchers (“Corpus” BnF research project, 2016-2018):
Corpora builder? API? OCR dumps? Derived datasets? Remote
processing?...
Conclusion
• Quantitative metadata are relevant for all DLs’ users: scientists,
general public, institutions’ employees.
• OLR enrichment provides a rich source of information for researchers.
Such data, possibly crossed with the OCRed text, usually provide a
fertile ground for research hypotheses.
• Only basic data mining & dataviz methods and tools are needed to
use such datasets:
• Basic scripting: XSL, Python, Perl, JavaScript…
• Statistical applications: Excel, OpenOffice, R…
• Ready to use charts & timelines API: Highcharts,
Google Charts, timeline.knightlab.com, Sigmajs.org…
• Easy to use NoSQL or XML databases: BaseX, MongoDB…
Conclusion
• Quantitative metadata is sometimes enough to satisfy users. Example
of a “pure” quantitative metadata Digital Humanities project 
“The Comédie-Française Registers”
project: From 1680 until 1791, only one
theater troupe in Paris was allowed to
perform the plays of Molière, Corneille,
Racine, Voltaire, Beaumarchais, etc.
This troupe played the works of these
authors over 34,000 times and kept
detailed records of their box office
receipts for every single one of those
performances.
(Partners: Paris-Sorbonne,
Harvard University, MIT)
© http://cfregisters.org/fr (Chart: Frédéric Glorieux)
Final Thought:Advanced Search in Newspapers?
• Feeding the search engine with layout and structural metadata will
allow users to perform advanced mixed queries:
text retrieval
catalog search layout MD
structural MD
? illustrated articles
in Trial review section
from 1914 to 1916
where title contains
“caillaux” or “calmette”
? articles with table
in Le Matin
where title contains
“metal prices”
and body contains “gold”
Final Thought: Advanced Search in Newspapers?
• Feeding the search engine with layout and structural metadata will
allow users to perform advanced mixed queries:
text retrieval
catalog search layout MD
structural MD
? illustrated articles
in Judicial review section
from 1914 to 1916
where title contains
“caillaux” or “calmette”
? articles with table
in Le Matin
where title contains
“metal prices”
and body contains “gold”
Trove Advanced Search
http://trove.nla.gov.au
Is it Working for Books too?
• Books’ OCR also contains meaningfull layout information:
tables, maps, ornements, drop caps…
text retrieval
catalog search layout MD
structural MD
? pages illustrated with a map
in XIXth books
where text contains “Mars”
? illustrated pages
in XIXth books
where text contains “Mars”
maps
photos,
drawings,
diagrams…
Final Thought: Advanced Search in Newspapers?
• Adding a pinch of semantic flavor to get closer to natural language query:
text retrieval
catalog search layout MD
structural MD
I’m looking for illustrated articles on front page in Trial topic
from 1914 to 1916 which contain NE.person “Henriette Caillaux”
or “Gaston Calmette”
Named Entities
Recognition
Topic Modelling
Historical Events
Recognition
Themes
Classification
• Adding a slice of semantic flavor to get closer to natural language query:
I’m looking for illustrated articles on front page in Trial topic
from 1914 to 1916 which contain NE.person “Henriette Caillaux”
or “Gaston Calmette”
Named Entities
Recognition
Topic Modelling
Historical Events
Recognition
Themes
Classification
Final Thought: Advanced Search in Newspapers?
text retrieval
catalog search layout MD
structural MD
http://www.retronews.fr/
RetroNews Advanced Search
http://www.retronews.fr
Faceted
search: dates,
NE, themes,
events, topics…
Thank you for your attention!
• Dataset (CSV, XML, JSON) and charts are publicly available. Just
play with it! (no language barrier: not a single word of French inside)
http://altomator.github.io/EN-data_mining
Thanks to all
the EN partners!

More Related Content

Viewers also liked

Evolution of motion picture digitization at the National Library of Medicine
Evolution of motion picture digitization at the National Library of MedicineEvolution of motion picture digitization at the National Library of Medicine
Evolution of motion picture digitization at the National Library of Medicine
John Rees
 
Digital Public Library of America
Digital Public Library of AmericaDigital Public Library of America
Digital Public Library of America
Larry Naukam
 
Playing the Long Game
Playing the Long GamePlaying the Long Game
Playing the Long Game
Rachel Fundator
 
Data mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.comData mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.com
Amit Sharma
 
Data Mining: A Short Survey
Data Mining: A Short SurveyData Mining: A Short Survey
Data Mining: A Short Survey
Arvin Jenabi
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
C4Media
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
cneudecker
 
2016 BHL Program Director's Report
2016 BHL Program Director's Report2016 BHL Program Director's Report
2016 BHL Program Director's Report
Martin Kalfatovic
 
Libraries & Publishing Industry
Libraries & Publishing IndustryLibraries & Publishing Industry
Libraries & Publishing Industry
Buddhi Prakash Chauhan
 
Project Report
Project ReportProject Report
Project Report
Apoorv Mehta
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
Institute of Technology Telkom
 
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
Golda Margret Sheeba J
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
Ali Abbasi
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Salah Amean
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
error007
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
 
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
Ig Bittencourt
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
David Carasso
 

Viewers also liked (20)

Evolution of motion picture digitization at the National Library of Medicine
Evolution of motion picture digitization at the National Library of MedicineEvolution of motion picture digitization at the National Library of Medicine
Evolution of motion picture digitization at the National Library of Medicine
 
Digital Public Library of America
Digital Public Library of AmericaDigital Public Library of America
Digital Public Library of America
 
Playing the Long Game
Playing the Long GamePlaying the Long Game
Playing the Long Game
 
Data mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.comData mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.com
 
Data Mining: A Short Survey
Data Mining: A Short SurveyData Mining: A Short Survey
Data Mining: A Short Survey
 
Immutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine ImagesImmutable Infrastructure: Rise of the Machine Images
Immutable Infrastructure: Rise of the Machine Images
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
Neudecker who-cares-about-yesterday’s-news-–-use-cases-and-requirements-for-n...
 
2016 BHL Program Director's Report
2016 BHL Program Director's Report2016 BHL Program Director's Report
2016 BHL Program Director's Report
 
Libraries & Publishing Industry
Libraries & Publishing IndustryLibraries & Publishing Industry
Libraries & Publishing Industry
 
Project Report
Project ReportProject Report
Project Report
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.4 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 7 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
Exploring Threefold Adaptivity to Intelligent Learning Environments: Ontologi...
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
 

Similar to Data Mining Newspapers Metadata

Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
cneudecker
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
BOBCATSSS 2017
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
Marton Nemeth
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Centre of Competence
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
ICARUS - International Centre for Archival Research
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
Europeana Newspapers
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
lljohnston
 
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA..."Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
Jim Salmons
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
cneudecker
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
Europeana Newspapers
 
Moreux data-mining-historical-newspaper-metadata
Moreux data-mining-historical-newspaper-metadataMoreux data-mining-historical-newspaper-metadata
Moreux data-mining-historical-newspaper-metadata
Firas Husseini
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
Isabelle REUSA
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
Jindřich Mynarz
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
LIBER Europe
 
Designing a multilingual knowledge graph - DCMI2018
Designing a multilingual knowledge graph - DCMI2018Designing a multilingual knowledge graph - DCMI2018
Designing a multilingual knowledge graph - DCMI2018
Antoine Isaac
 
Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...
Mia
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
Vladimir Alexiev, PhD, PMP
 
Digital humanities
Digital humanitiesDigital humanities
Digital humanities
Mokhtar Ben Henda
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
Europeana Newspapers
 

Similar to Data Mining Newspapers Metadata (20)

Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...Widening the limits of cognitive reception with online digital library graph ...
Widening the limits of cognitive reception with online digital library graph ...
 
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...IMPACT Final Event 26-06-2012  - Use of IMPACT tools in the Europeana Newspap...
IMPACT Final Event 26-06-2012 - Use of IMPACT tools in the Europeana Newspap...
 
co:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlbergerco:op-READ-Convention Marburg - Günter Mühlberger
co:op-READ-Convention Marburg - Günter Mühlberger
 
The Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final EventThe Europeana Newspapers Project at IMPACT Final Event
The Europeana Newspapers Project at IMPACT Final Event
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA..."Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
"Big Picture" Backgrounder for Crowdsourcing Ground-Truth - FactMiners, PRImA...
 
Europeana Newspapers - Data, Tools & Future Plans
 Europeana Newspapers - Data, Tools & Future Plans  Europeana Newspapers - Data, Tools & Future Plans
Europeana Newspapers - Data, Tools & Future Plans
 
The European(a) Newspapers Project
The European(a) Newspapers ProjectThe European(a) Newspapers Project
The European(a) Newspapers Project
 
Moreux data-mining-historical-newspaper-metadata
Moreux data-mining-historical-newspaper-metadataMoreux data-mining-historical-newspaper-metadata
Moreux data-mining-historical-newspaper-metadata
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
APIdays 2018 BnF API projects
APIdays 2018 BnF API projectsAPIdays 2018 BnF API projects
APIdays 2018 BnF API projects
 
Statistical data in RDF
Statistical data in RDFStatistical data in RDF
Statistical data in RDF
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
 
Designing a multilingual knowledge graph - DCMI2018
Designing a multilingual knowledge graph - DCMI2018Designing a multilingual knowledge graph - DCMI2018
Designing a multilingual knowledge graph - DCMI2018
 
Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...Living with Machines at The Past, Present and Future of Digital Scholarship w...
Living with Machines at The Past, Present and Future of Digital Scholarship w...
 
ResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRMResearchSpace- Example of a VRE Based on CIDOC CRM
ResearchSpace- Example of a VRE Based on CIDOC CRM
 
Digital humanities
Digital humanitiesDigital humanities
Digital humanities
 
LIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers ProjectLIBER, Europeana and the Europeana Newspapers Project
LIBER, Europeana and the Europeana Newspapers Project
 

More from Jean-Philippe Moreux

IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
Jean-Philippe Moreux
 
GallicaPix
GallicaPix GallicaPix
Atelier API Gallica
Atelier API GallicaAtelier API Gallica
Atelier API Gallica
Jean-Philippe Moreux
 
IIIF & Digital Humanities
IIIF & Digital Humanities     IIIF & Digital Humanities
IIIF & Digital Humanities
Jean-Philippe Moreux
 
Image Retrieval at the BnF
Image Retrieval at the BnFImage Retrieval at the BnF
Image Retrieval at the BnF
Jean-Philippe Moreux
 
Fouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPixFouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPix
Jean-Philippe Moreux
 
Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021
Jean-Philippe Moreux
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital libraries
Jean-Philippe Moreux
 

More from Jean-Philippe Moreux (8)

IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
IIIF for Interoperability and Dissemination of Research Results: The NewsEye ...
 
GallicaPix
GallicaPix GallicaPix
GallicaPix
 
Atelier API Gallica
Atelier API GallicaAtelier API Gallica
Atelier API Gallica
 
IIIF & Digital Humanities
IIIF & Digital Humanities     IIIF & Digital Humanities
IIIF & Digital Humanities
 
Image Retrieval at the BnF
Image Retrieval at the BnFImage Retrieval at the BnF
Image Retrieval at the BnF
 
Fouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPixFouille d’images dans les collections patrimoniales : GallicaPix
Fouille d’images dans les collections patrimoniales : GallicaPix
 
Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021Transcription collaborative à la BnF-2021
Transcription collaborative à la BnF-2021
 
Hybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital librariesHybrid Image Retrieval in Digital libraries
Hybrid Image Retrieval in Digital libraries
 

Recently uploaded

Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
OECD Directorate for Financial and Enterprise Affairs
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
kkirkland2
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Dutch Power
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
Frederic Leger
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
OECD Directorate for Financial and Enterprise Affairs
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
gharris9
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
SkillCertProExams
 
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPointMẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
1990 Media
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
samililja
 
Carrer goals.pptx and their importance in real life
Carrer goals.pptx  and their importance in real lifeCarrer goals.pptx  and their importance in real life
Carrer goals.pptx and their importance in real life
artemacademy2
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Rosie Wells
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
ToshihiroIto4
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
amekonnen
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Dutch Power
 
Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
Faculty of Medicine And Health Sciences
 

Recently uploaded (19)

Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
Competition and Regulation in Professions and Occupations – OECD – June 2024 ...
 
Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
 
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...Competition and Regulation in Professions and Occupations – ROBSON – June 202...
Competition and Regulation in Professions and Occupations – ROBSON – June 202...
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
 
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPointMẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
Mẫu PPT kế hoạch làm việc sáng tạo cho nửa cuối năm PowerPoint
 
XP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to LeadershipXP 2024 presentation: A New Look to Leadership
XP 2024 presentation: A New Look to Leadership
 
Carrer goals.pptx and their importance in real life
Carrer goals.pptx  and their importance in real lifeCarrer goals.pptx  and their importance in real life
Carrer goals.pptx and their importance in real life
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
Suzanne Lagerweij - Influence Without Power - Why Empathy is Your Best Friend...
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
ASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdfASONAM2023_presection_slide_track-recommendation.pdf
ASONAM2023_presection_slide_track-recommendation.pdf
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
 
Updated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidismUpdated diagnosis. Cause and treatment of hypothyroidism
Updated diagnosis. Cause and treatment of hypothyroidism
 

Data Mining Newspapers Metadata

  • 1. DATA MINING HISTORICAL NEWSPAPERS METADATA Old News Teaches History Jean-Philippe Moreux Bibliothèque national de France, Digitization dpt IFLA News Media Section, Hamburg, April 2016
  • 2. ATrue Story about the Researchers’ Needs • How can we help a historian working on Stock Market quotes creation and development in French newspapers? (1800-1870) here
  • 3. ATrue Story about the Researchers’ Needs • Obviously, he had to query the digital library catalog. catalog search
  • 4. ATrue Story about the Researchers’ Needs • Moreover, he needed a text retrieval functionality. text retrieval catalog search  The basics
  • 5. ATrue Story about the Needs of Researchers • But is it enough? Could we do better? text retrieval catalog search + Corpora builder + Predefined qualitative & easy-to-use corpora + Advanced query on document structure and layout (to spot Stock Market regions)
  • 6. The True Story (cont’d): unhappy Ending “Stock Market quotes in French Newspapers (1801-1870)” PhD in Communication and Information Science (P.-C. Langlais) • The creation of his corpus was very painful: 1. The historian had to script the DL to extract OCR and metadata from multiple newspaper titles. 2. Then he had to refine/structure his text corpora. More than 100 Python scripts were needed! Historians generally prefer to focuse on research, not on writing scripts…
  • 7. The True Story (cont’d): Could we have helped him? tools OLR facilitates the corpus creation task  Content Types classification, Section identification The quantitative dataset is of a great help “Tables” in newspapers are predominantly used in Stock Market quotes  instant use of this metadata! Tables per week day (1838-1870) © R (P-C Langlais) Types of content are marked up
  • 8. How to Satisfy Scientists’ Needs? Let’s try to address this question, regarding the heritage daily corpus enriched during the Europeana Newspapers project: • Feed the DL with enriched digital documents? • Give end-users access to quantitative metadata describing documents structure and layout? • Give end-users an ad hoc corpora builder functionality? Plan 1. The Europeana Newspapers test bed 2. Building a quantitative metadata dataset 3. Data mining and data visualization use-cases
  • 9. Enriching Digital Documents Europeana Newspapers project (2012-2015): 11,5M OCR’ed pages, 2M OLR’ed pages from 14 European libraries What is OLR? • Identification of structural elements, including separation of articles and sections. • Classification of types of content (ads, offers, obituaries…) • Europeana Newspaper project has enriched and aggregated millions of heritage newspapers pages with advanced refinement techniques like Optical Layout Recognition and Named Entities Recognition. UIBK
  • 10. Document Analysis Technique like OLR Produce Quantitative Metadata The good new is OCR and OLR files are full of interesting objects tagged into the XML: • OCR (ALTO) is a source for quantitative metadata: number of words, illustrations & tables, paper format… • OLR (METS) is a valuable source too for high level informational objects: • number of articles, titles, etc. • identification of sections (groups of articles) • content types classification (ads, judicial review, stock market…) Huge amount of valuable data for historians!
  • 11. • We have to count the number of objects in each page of the collection. Straightforward with XSLT, Java, Python, Perl, etc. • We have to package and deliver these datasets to end-users. How to Build such Datasets? Europeana Newspapers project / BnF: 880,000 OLR’ed pages from BnF newspapers collection, 6 titles, 1814-1944 Pros: • Give to users light derived datasets, not TB of XML files! • It’s not rocket science. • It’s fast (2-3 h/title with an optimized NoXML parsing script) No Cons!
  • 12. Who are the End-Users of the BnF Dataset? • The EN-BnF dataset includes 5.5 M of values (150K issues, 880K p.) • 7 metadata at issue level, 5 at page level • XML, JSON or CSV formats Researchers (Digital Humanities, History of Press, Information Science) Digital Curators & Mediators: insights on the collections Digitization Program Managers: statistics on digitized content t o o l s
  • 13. Data visualization allows researchers to discover meaning and information hidden in large volumes of data • History of press/illustration: Dataviz demonstrates the growing importance of illustration (blue: front page, red: inside pages). • History of press/activity: Dataviz of types of content shows the impact of the Great War on the economical activity and assesses the period of return to pre-war level activity (roughly 10 years). Discovering Knowledge through Visualization tools © Highcharts
  • 14. Discovering Knowledge through Visualization • History of press/page format: Digital archeology of papermaking and printing. • History of press/layout: Visualization of the articles density per page reveals the shift from XVIIth “gazettes” to modern daily. tools © Highcharts
  • 15. Other Users might be Interested by those Metadata: Digitization dpt Statistical information on digitized content for project managers. • OCR Crowdsourcing project: What is the average density in words of these documents? What text correction efforts will be required? • Image bank: What titles contain illustrations? What is the total number of images one can expect? tools © Highcharts
  • 16. Data visualization facilitates rediscovery and reappropriation of heritage documents (by the general public) • Data visualization of illustrations density can reveal trends or outliers, like highly illustrated issues (illustr. suppl.) or the first published illustration in a title. Engaging new Audiences with Dataviz tools © www.RetroNews.fr Facts extracted thanks to dataviz can then enrich other digital artefacts like timelines. © https://timeline.knightlab.com
  • 17. Engaging new Audiences with Dataviz Interactive chart of the word density reveals breaks due to changes in layout & paper format, outlier issues… • tools Journal des débats politiques et littéraires, 1814-1944, 45,334 issues displayed Go beyond keyword spotting and page flip! Some users would like to play with those charts!
  • 18. Requesting the Dataset Those datasets can be requested with dedicated tools (statistical environments, NoSQL or XML databases...) • Images search solution used by Gallica Mediation Service: a XQuery HTTP API identifies “graphical” pages, that is to say both those poor in words and including illustrations. tools http://localhost:8984/rest?run=findIllustratedPages.xq&toDate=1920-01-01&toPage=1 "As a digital mediator, seeking for illustrations in our 12M p. collection is a nightmare…"
  • 19. Requesting the Dataset • Looking for WW1 censored front pages with BaseX: XQueries can be written to dig into the data and find specific types of content, e.g. the front pages censored during the Great war, which have a slightly smaller words count than the front pages average. tools Is it effective? • Recall rate: 45% • Precision rate: 68% (Based on a ground truth carried on the Journal des Débats front pages for 1915)  Limits of a statistical approach when applied to a word based metric biased by layout singularities. Good enough for mediation: Gallica blog post
  • 20. Are my Data Representative? The quality of datasets affects the validity of the analysis and interpretation. Irregular data in nature or discontinuous in time may introduce bias. A qualitative assessment should be conducted. Data vizualisation can contribute to quality control (and information of end-users) • A calendar display of a title data shows rare missing issues, which suggests that the digital collection is representative. • Stock Market quotes study based on the content tagged “table”: one can empirically validate this hypothesis by the sudden inflections recorded in 1914 and 1939 for all titles, being known and established the historical fact of the virtual halt of trading during the two World Wars © Google Charts API © Highcharts
  • 21. Perspectives • Apply the same data mining process to the other Europeana Newspapers OLR’ed datasets to produce more datasets. Apply on the on-going BnF newspapers digitization program. • Automatically build the quantitative metadata datasets. • Experiment on other types of materials with a temporal dimension (e.g. long life magazines or revues, early printed books). • @BnF: Assess the opportunity of setting up a data mining framework targeting DH researchers (“Corpus” BnF research project, 2016-2018): Corpora builder? API? OCR dumps? Derived datasets? Remote processing?...
  • 22. Conclusion • Quantitative metadata are relevant for all DLs’ users: scientists, general public, institutions’ employees. • OLR enrichment provides a rich source of information for researchers. Such data, possibly crossed with the OCRed text, usually provide a fertile ground for research hypotheses. • Only basic data mining & dataviz methods and tools are needed to use such datasets: • Basic scripting: XSL, Python, Perl, JavaScript… • Statistical applications: Excel, OpenOffice, R… • Ready to use charts & timelines API: Highcharts, Google Charts, timeline.knightlab.com, Sigmajs.org… • Easy to use NoSQL or XML databases: BaseX, MongoDB…
  • 23. Conclusion • Quantitative metadata is sometimes enough to satisfy users. Example of a “pure” quantitative metadata Digital Humanities project  “The Comédie-Française Registers” project: From 1680 until 1791, only one theater troupe in Paris was allowed to perform the plays of Molière, Corneille, Racine, Voltaire, Beaumarchais, etc. This troupe played the works of these authors over 34,000 times and kept detailed records of their box office receipts for every single one of those performances. (Partners: Paris-Sorbonne, Harvard University, MIT) © http://cfregisters.org/fr (Chart: Frédéric Glorieux)
  • 24. Final Thought:Advanced Search in Newspapers? • Feeding the search engine with layout and structural metadata will allow users to perform advanced mixed queries: text retrieval catalog search layout MD structural MD ? illustrated articles in Trial review section from 1914 to 1916 where title contains “caillaux” or “calmette” ? articles with table in Le Matin where title contains “metal prices” and body contains “gold”
  • 25. Final Thought: Advanced Search in Newspapers? • Feeding the search engine with layout and structural metadata will allow users to perform advanced mixed queries: text retrieval catalog search layout MD structural MD ? illustrated articles in Judicial review section from 1914 to 1916 where title contains “caillaux” or “calmette” ? articles with table in Le Matin where title contains “metal prices” and body contains “gold” Trove Advanced Search http://trove.nla.gov.au
  • 26. Is it Working for Books too? • Books’ OCR also contains meaningfull layout information: tables, maps, ornements, drop caps… text retrieval catalog search layout MD structural MD ? pages illustrated with a map in XIXth books where text contains “Mars” ? illustrated pages in XIXth books where text contains “Mars” maps photos, drawings, diagrams…
  • 27. Final Thought: Advanced Search in Newspapers? • Adding a pinch of semantic flavor to get closer to natural language query: text retrieval catalog search layout MD structural MD I’m looking for illustrated articles on front page in Trial topic from 1914 to 1916 which contain NE.person “Henriette Caillaux” or “Gaston Calmette” Named Entities Recognition Topic Modelling Historical Events Recognition Themes Classification
  • 28. • Adding a slice of semantic flavor to get closer to natural language query: I’m looking for illustrated articles on front page in Trial topic from 1914 to 1916 which contain NE.person “Henriette Caillaux” or “Gaston Calmette” Named Entities Recognition Topic Modelling Historical Events Recognition Themes Classification Final Thought: Advanced Search in Newspapers? text retrieval catalog search layout MD structural MD http://www.retronews.fr/ RetroNews Advanced Search http://www.retronews.fr Faceted search: dates, NE, themes, events, topics…
  • 29. Thank you for your attention! • Dataset (CSV, XML, JSON) and charts are publicly available. Just play with it! (no language barrier: not a single word of French inside) http://altomator.github.io/EN-data_mining Thanks to all the EN partners!