Towards the Extraction of Statistical Information
from Digitised NumericalTables
The Medical Officer of Health Reports Scoping Study
Christian Clausner, Apostolos Antonacopoulos,
Christy Henshaw, Justin Hayes
University of Salford
Wellcome Collection
25/09/2019DATeCH 2019, Brussels 1
The Medical Officer of Health Reports
• Wellcome Collection holds UK’s
largest collection of Medical
Officer of Health reports
• 130 years
• Over 70,000 reports
• All digitised and OCRed
25/09/2019DATeCH 2019, Brussels 2
https://wellcomelibrary.org/moh/
The Medical Officer of Health Reports
• Narrative textual content + tabular content
• Topics:
• Birth and death statistics
• Notifiable diseases
• General population statistics
• Causes of death
• School health
• Food inspections
• …
25/09/2019DATeCH 2019, Brussels 3
The Medical Officer of Health Reports
• OCRed and post-corrected data
available for Greater London
• Individual tables provided in
special format
• Statistical data difficult to
extract
25/09/2019DATeCH 2019, Brussels 4
Current Practices
• Standard OCR not sufficient for
extraction of numerical data
• Need accuracy for values AND
context (column / row)
• Common:
• Only indexing and providing access
to images with tables
• Manual correction and provision of
tables in dedicated formats
• Rare / very difficult or expensive:
• Full extraction and integration to
provide faceted searches / data
analysis etc.
25/09/2019DATeCH 2019, Brussels 5
1961 Census of England andWales
The MOH Scoping Study (2018)
• Gain understanding of tabular
data available in the reports
• Investigate ways of data
extraction
• Scope out users’ needs and
expectations
• Based on Greater London data
25/09/2019DATeCH 2019, Brussels 6
Identification of table topics
• Text-based analysis of table
captions and headers
• Grouping instances by text
similarity
• Using a tool that was created for
social media analysis
25/09/2019DATeCH 2019, Brussels 7
Topic Table Count
(approx.)
Mortality / Cause of Death 2530
General statistics / demographics 1900
Infectious Diseases / Notifiable
Diseases
1720
Inspections / conditions 4360
Minor ailments, dental, etc. 710
Financial 470
Food 330
Births 240
Meteorological 100
Legal 190
Immunisation 60
Identification of table topics
• Geographies:
• Mostly districts
• Also smaller areas (sub-districts,
wards)
• Considerable variety of
• Information content
• Physical structure
• Across many
• Locations
• Years
25/09/2019DATeCH 2019, Brussels 8
§ Demographics
§ Age
§ Sex
§ Births
§ Deaths
§ Causes of death
§ Infant death
§ Ailments
§ Diseases
§ Infectious diseases
§ Notifiable diseases
§ Immunisations
§ Environmental
§ Inspections
§ Food
§ Conditions
§ Meteorological
§ Financial
§ Legal
Extraction of tabular data
• Can remaining data be extracted
in a less costly way?
• Available for experiments:
• OCR results in ALTO XML format
(Greater London)
• Ran ABBYY FineReader Engine 11
ourselves
25/09/2019DATeCH 2019, Brussels 9
Extraction of tabular data
• Tests with ABBYY FineReader
• Very inconsistent results
• But column and row headers
sufficiently recognised
25/09/2019DATeCH 2019, Brussels 10
Extraction of tabular data
• Prototype: Flexible matching to
locate rows and columns of
interest
• Ignore other data that is less
consistent
• Order of headers usually stable
across geographies
• Variation across the years, but
doable
25/09/2019DATeCH 2019, Brussels 11
Extraction of tabular data
• Large proportion of tabular data
could be extracted in an automated
way
• Quality assurance using row /
column totals and geographical
summations
• OCR quality good enough
• Limitations: some rare tables
• Ingestion into database for online
access…
25/09/2019DATeCH 2019, Brussels 12
User consultation
• Online survey and informal meeting with
researchers
• Findings
• Mixed level of awareness of MOH reports
• Current access functionality useful (search by
topic and time period)
• Wide range of audiences would be interested in
tabular statistical data
25/09/2019DATeCH 2019, Brussels 13
Interest in quantitative MOH data
Very interested
User consultation
• Findings
• Main interest in basic demographics, mortality
and cause of death, ailments, fertility
• Comparative analyses of large subsets of data
would be of interest (e.g. for epidemiologists)
25/09/2019DATeCH 2019, Brussels 14
Priority of topics
Conclusion
• There is interest in statistical numerical data
• Automated extraction is viable alternative to
manual transcription (with limitations)
• Flexible detection and recognition approaches in
combination with data integration and validation
• Queryable large-scale data enables new research
• Deep insights
• Context for other (qualitative research)
25/09/2019DATeCH 2019, Brussels 15
Future work
• Creating an index of exiting transcribed MOH
tables for better accessibility.
• Create integrated data resource from London
MOH tables for online search across locations and
time.
• Indexing and data extraction across all MOH
reports based on structured OCR results.
• Testing / developing improved table recognition
algorithms (e.g. based on deep learning /
convolutional neural networks).
25/09/2019DATeCH 2019, Brussels 16
?!
Questions?
25/09/2019DATeCH 2019, Brussels 17
The 5th International Workshop
on Historical Document Imaging
and Processing
Paper submission deadline: 01 June
In other news
primaresearch.org/hip2019

Session3 03.christian clausner

  • 1.
    Towards the Extractionof Statistical Information from Digitised NumericalTables The Medical Officer of Health Reports Scoping Study Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw, Justin Hayes University of Salford Wellcome Collection 25/09/2019DATeCH 2019, Brussels 1
  • 2.
    The Medical Officerof Health Reports • Wellcome Collection holds UK’s largest collection of Medical Officer of Health reports • 130 years • Over 70,000 reports • All digitised and OCRed 25/09/2019DATeCH 2019, Brussels 2 https://wellcomelibrary.org/moh/
  • 3.
    The Medical Officerof Health Reports • Narrative textual content + tabular content • Topics: • Birth and death statistics • Notifiable diseases • General population statistics • Causes of death • School health • Food inspections • … 25/09/2019DATeCH 2019, Brussels 3
  • 4.
    The Medical Officerof Health Reports • OCRed and post-corrected data available for Greater London • Individual tables provided in special format • Statistical data difficult to extract 25/09/2019DATeCH 2019, Brussels 4
  • 5.
    Current Practices • StandardOCR not sufficient for extraction of numerical data • Need accuracy for values AND context (column / row) • Common: • Only indexing and providing access to images with tables • Manual correction and provision of tables in dedicated formats • Rare / very difficult or expensive: • Full extraction and integration to provide faceted searches / data analysis etc. 25/09/2019DATeCH 2019, Brussels 5 1961 Census of England andWales
  • 6.
    The MOH ScopingStudy (2018) • Gain understanding of tabular data available in the reports • Investigate ways of data extraction • Scope out users’ needs and expectations • Based on Greater London data 25/09/2019DATeCH 2019, Brussels 6
  • 7.
    Identification of tabletopics • Text-based analysis of table captions and headers • Grouping instances by text similarity • Using a tool that was created for social media analysis 25/09/2019DATeCH 2019, Brussels 7 Topic Table Count (approx.) Mortality / Cause of Death 2530 General statistics / demographics 1900 Infectious Diseases / Notifiable Diseases 1720 Inspections / conditions 4360 Minor ailments, dental, etc. 710 Financial 470 Food 330 Births 240 Meteorological 100 Legal 190 Immunisation 60
  • 8.
    Identification of tabletopics • Geographies: • Mostly districts • Also smaller areas (sub-districts, wards) • Considerable variety of • Information content • Physical structure • Across many • Locations • Years 25/09/2019DATeCH 2019, Brussels 8 § Demographics § Age § Sex § Births § Deaths § Causes of death § Infant death § Ailments § Diseases § Infectious diseases § Notifiable diseases § Immunisations § Environmental § Inspections § Food § Conditions § Meteorological § Financial § Legal
  • 9.
    Extraction of tabulardata • Can remaining data be extracted in a less costly way? • Available for experiments: • OCR results in ALTO XML format (Greater London) • Ran ABBYY FineReader Engine 11 ourselves 25/09/2019DATeCH 2019, Brussels 9
  • 10.
    Extraction of tabulardata • Tests with ABBYY FineReader • Very inconsistent results • But column and row headers sufficiently recognised 25/09/2019DATeCH 2019, Brussels 10
  • 11.
    Extraction of tabulardata • Prototype: Flexible matching to locate rows and columns of interest • Ignore other data that is less consistent • Order of headers usually stable across geographies • Variation across the years, but doable 25/09/2019DATeCH 2019, Brussels 11
  • 12.
    Extraction of tabulardata • Large proportion of tabular data could be extracted in an automated way • Quality assurance using row / column totals and geographical summations • OCR quality good enough • Limitations: some rare tables • Ingestion into database for online access… 25/09/2019DATeCH 2019, Brussels 12
  • 13.
    User consultation • Onlinesurvey and informal meeting with researchers • Findings • Mixed level of awareness of MOH reports • Current access functionality useful (search by topic and time period) • Wide range of audiences would be interested in tabular statistical data 25/09/2019DATeCH 2019, Brussels 13 Interest in quantitative MOH data Very interested
  • 14.
    User consultation • Findings •Main interest in basic demographics, mortality and cause of death, ailments, fertility • Comparative analyses of large subsets of data would be of interest (e.g. for epidemiologists) 25/09/2019DATeCH 2019, Brussels 14 Priority of topics
  • 15.
    Conclusion • There isinterest in statistical numerical data • Automated extraction is viable alternative to manual transcription (with limitations) • Flexible detection and recognition approaches in combination with data integration and validation • Queryable large-scale data enables new research • Deep insights • Context for other (qualitative research) 25/09/2019DATeCH 2019, Brussels 15
  • 16.
    Future work • Creatingan index of exiting transcribed MOH tables for better accessibility. • Create integrated data resource from London MOH tables for online search across locations and time. • Indexing and data extraction across all MOH reports based on structured OCR results. • Testing / developing improved table recognition algorithms (e.g. based on deep learning / convolutional neural networks). 25/09/2019DATeCH 2019, Brussels 16 ?!
  • 17.
    Questions? 25/09/2019DATeCH 2019, Brussels17 The 5th International Workshop on Historical Document Imaging and Processing Paper submission deadline: 01 June In other news primaresearch.org/hip2019