SlideShare a Scribd company logo
Towards the Extraction of Statistical Information
from Digitised NumericalTables
The Medical Officer of Health Reports Scoping Study
Christian Clausner, Apostolos Antonacopoulos,
Christy Henshaw, Justin Hayes
University of Salford
Wellcome Collection
25/09/2019DATeCH 2019, Brussels 1
The Medical Officer of Health Reports
• Wellcome Collection holds UK’s
largest collection of Medical
Officer of Health reports
• 130 years
• Over 70,000 reports
• All digitised and OCRed
25/09/2019DATeCH 2019, Brussels 2
https://wellcomelibrary.org/moh/
The Medical Officer of Health Reports
• Narrative textual content + tabular content
• Topics:
• Birth and death statistics
• Notifiable diseases
• General population statistics
• Causes of death
• School health
• Food inspections
• …
25/09/2019DATeCH 2019, Brussels 3
The Medical Officer of Health Reports
• OCRed and post-corrected data
available for Greater London
• Individual tables provided in
special format
• Statistical data difficult to
extract
25/09/2019DATeCH 2019, Brussels 4
Current Practices
• Standard OCR not sufficient for
extraction of numerical data
• Need accuracy for values AND
context (column / row)
• Common:
• Only indexing and providing access
to images with tables
• Manual correction and provision of
tables in dedicated formats
• Rare / very difficult or expensive:
• Full extraction and integration to
provide faceted searches / data
analysis etc.
25/09/2019DATeCH 2019, Brussels 5
1961 Census of England andWales
The MOH Scoping Study (2018)
• Gain understanding of tabular
data available in the reports
• Investigate ways of data
extraction
• Scope out users’ needs and
expectations
• Based on Greater London data
25/09/2019DATeCH 2019, Brussels 6
Identification of table topics
• Text-based analysis of table
captions and headers
• Grouping instances by text
similarity
• Using a tool that was created for
social media analysis
25/09/2019DATeCH 2019, Brussels 7
Topic Table Count
(approx.)
Mortality / Cause of Death 2530
General statistics / demographics 1900
Infectious Diseases / Notifiable
Diseases
1720
Inspections / conditions 4360
Minor ailments, dental, etc. 710
Financial 470
Food 330
Births 240
Meteorological 100
Legal 190
Immunisation 60
Identification of table topics
• Geographies:
• Mostly districts
• Also smaller areas (sub-districts,
wards)
• Considerable variety of
• Information content
• Physical structure
• Across many
• Locations
• Years
25/09/2019DATeCH 2019, Brussels 8
§ Demographics
§ Age
§ Sex
§ Births
§ Deaths
§ Causes of death
§ Infant death
§ Ailments
§ Diseases
§ Infectious diseases
§ Notifiable diseases
§ Immunisations
§ Environmental
§ Inspections
§ Food
§ Conditions
§ Meteorological
§ Financial
§ Legal
Extraction of tabular data
• Can remaining data be extracted
in a less costly way?
• Available for experiments:
• OCR results in ALTO XML format
(Greater London)
• Ran ABBYY FineReader Engine 11
ourselves
25/09/2019DATeCH 2019, Brussels 9
Extraction of tabular data
• Tests with ABBYY FineReader
• Very inconsistent results
• But column and row headers
sufficiently recognised
25/09/2019DATeCH 2019, Brussels 10
Extraction of tabular data
• Prototype: Flexible matching to
locate rows and columns of
interest
• Ignore other data that is less
consistent
• Order of headers usually stable
across geographies
• Variation across the years, but
doable
25/09/2019DATeCH 2019, Brussels 11
Extraction of tabular data
• Large proportion of tabular data
could be extracted in an automated
way
• Quality assurance using row /
column totals and geographical
summations
• OCR quality good enough
• Limitations: some rare tables
• Ingestion into database for online
access…
25/09/2019DATeCH 2019, Brussels 12
User consultation
• Online survey and informal meeting with
researchers
• Findings
• Mixed level of awareness of MOH reports
• Current access functionality useful (search by
topic and time period)
• Wide range of audiences would be interested in
tabular statistical data
25/09/2019DATeCH 2019, Brussels 13
Interest in quantitative MOH data
Very interested
User consultation
• Findings
• Main interest in basic demographics, mortality
and cause of death, ailments, fertility
• Comparative analyses of large subsets of data
would be of interest (e.g. for epidemiologists)
25/09/2019DATeCH 2019, Brussels 14
Priority of topics
Conclusion
• There is interest in statistical numerical data
• Automated extraction is viable alternative to
manual transcription (with limitations)
• Flexible detection and recognition approaches in
combination with data integration and validation
• Queryable large-scale data enables new research
• Deep insights
• Context for other (qualitative research)
25/09/2019DATeCH 2019, Brussels 15
Future work
• Creating an index of exiting transcribed MOH
tables for better accessibility.
• Create integrated data resource from London
MOH tables for online search across locations and
time.
• Indexing and data extraction across all MOH
reports based on structured OCR results.
• Testing / developing improved table recognition
algorithms (e.g. based on deep learning /
convolutional neural networks).
25/09/2019DATeCH 2019, Brussels 16
?!
Questions?
25/09/2019DATeCH 2019, Brussels 17
The 5th International Workshop
on Historical Document Imaging
and Processing
Paper submission deadline: 01 June
In other news
primaresearch.org/hip2019

More Related Content

Similar to Session3 03.christian clausner

Health information
Health informationHealth information
Health information
Jayaramachandran S
 
Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)
Deekshya Devkota
 
Into The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentationInto The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentation
Guus van den Brekel
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Matthieu Schapranow
 
Towards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz SheikhTowards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz Sheikh
NIHR CLAHRC West Midlands
 
ECDC webportal microbiology information
ECDC webportal microbiology informationECDC webportal microbiology information
ECDC webportal microbiology information
European Center for Disease Prevention and Control (ECDC)
 
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
NHS England
 
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.10201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
Workgroup of European Cancer Patient Advocacy Networks
 
INFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for CountriesINFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for Countries
Jesus Lau
 
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
CADTH Symposium
 
Markham2009
Markham2009Markham2009
SEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptxSEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptx
mujahidHajishifa
 
Opening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMCOpening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMC
Martijn Kriens
 
Tips for retrospective studies in clinical medicine
Tips for retrospective studies in clinical medicineTips for retrospective studies in clinical medicine
Tips for retrospective studies in clinical medicine
ibnuwadiyain
 
Gpdpr seminar june 2021
Gpdpr seminar june 2021Gpdpr seminar june 2021
Gpdpr seminar june 2021
Azeem Majeed
 
Verbal autopsy
Verbal autopsyVerbal autopsy
Verbal autopsy
AmanBansal134
 
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
Health Catalyst
 
Biostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubricaBiostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubrica
Pubrica
 
Managing and Analyzing Global Health Data
Managing and Analyzing Global Health DataManaging and Analyzing Global Health Data
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
Paul Agapow
 

Similar to Session3 03.christian clausner (20)

Health information
Health informationHealth information
Health information
 
Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)Community Health Diagnosis 2076 (CHD)
Community Health Diagnosis 2076 (CHD)
 
Into The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentationInto The User Environment 2022! EAHIL2022 plenary presentation
Into The User Environment 2022! EAHIL2022 plenary presentation
 
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
 
Towards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz SheikhTowards a National Learning Health System - Aziz Sheikh
Towards a National Learning Health System - Aziz Sheikh
 
ECDC webportal microbiology information
ECDC webportal microbiology informationECDC webportal microbiology information
ECDC webportal microbiology information
 
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...Why collect and use health data?  Professor  Peter Bradley, Director of Knowl...
Why collect and use health data? Professor Peter Bradley, Director of Knowl...
 
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.10201 rachford pemberton w - using evidence to create advocacy impact 1.1
0201 rachford pemberton w - using evidence to create advocacy impact 1.1
 
INFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for CountriesINFORMATION LITERACY INDICATORS: A Must for Countries
INFORMATION LITERACY INDICATORS: A Must for Countries
 
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0Cadth 2015 b7 symposium cost guidance talk   draft-ab_v1.0
Cadth 2015 b7 symposium cost guidance talk draft-ab_v1.0
 
Markham2009
Markham2009Markham2009
Markham2009
 
SEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptxSEMINAR PRESENTATION.pptx
SEMINAR PRESENTATION.pptx
 
Opening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMCOpening academisch jaar medische informatiekunde AMC
Opening academisch jaar medische informatiekunde AMC
 
Tips for retrospective studies in clinical medicine
Tips for retrospective studies in clinical medicineTips for retrospective studies in clinical medicine
Tips for retrospective studies in clinical medicine
 
Gpdpr seminar june 2021
Gpdpr seminar june 2021Gpdpr seminar june 2021
Gpdpr seminar june 2021
 
Verbal autopsy
Verbal autopsyVerbal autopsy
Verbal autopsy
 
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
A Reference Architecture for Digital Health: The Health Catalyst Data Operati...
 
Biostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubricaBiostatistics is a critical subject in current health data research – pubrica
Biostatistics is a critical subject in current health data research – pubrica
 
Managing and Analyzing Global Health Data
Managing and Analyzing Global Health DataManaging and Analyzing Global Health Data
Managing and Analyzing Global Health Data
 
Interpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical ResearchInterpreting Complex Real World Data for Pharmaceutical Research
Interpreting Complex Real World Data for Pharmaceutical Research
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 

Session3 03.christian clausner

  • 1. Towards the Extraction of Statistical Information from Digitised NumericalTables The Medical Officer of Health Reports Scoping Study Christian Clausner, Apostolos Antonacopoulos, Christy Henshaw, Justin Hayes University of Salford Wellcome Collection 25/09/2019DATeCH 2019, Brussels 1
  • 2. The Medical Officer of Health Reports • Wellcome Collection holds UK’s largest collection of Medical Officer of Health reports • 130 years • Over 70,000 reports • All digitised and OCRed 25/09/2019DATeCH 2019, Brussels 2 https://wellcomelibrary.org/moh/
  • 3. The Medical Officer of Health Reports • Narrative textual content + tabular content • Topics: • Birth and death statistics • Notifiable diseases • General population statistics • Causes of death • School health • Food inspections • … 25/09/2019DATeCH 2019, Brussels 3
  • 4. The Medical Officer of Health Reports • OCRed and post-corrected data available for Greater London • Individual tables provided in special format • Statistical data difficult to extract 25/09/2019DATeCH 2019, Brussels 4
  • 5. Current Practices • Standard OCR not sufficient for extraction of numerical data • Need accuracy for values AND context (column / row) • Common: • Only indexing and providing access to images with tables • Manual correction and provision of tables in dedicated formats • Rare / very difficult or expensive: • Full extraction and integration to provide faceted searches / data analysis etc. 25/09/2019DATeCH 2019, Brussels 5 1961 Census of England andWales
  • 6. The MOH Scoping Study (2018) • Gain understanding of tabular data available in the reports • Investigate ways of data extraction • Scope out users’ needs and expectations • Based on Greater London data 25/09/2019DATeCH 2019, Brussels 6
  • 7. Identification of table topics • Text-based analysis of table captions and headers • Grouping instances by text similarity • Using a tool that was created for social media analysis 25/09/2019DATeCH 2019, Brussels 7 Topic Table Count (approx.) Mortality / Cause of Death 2530 General statistics / demographics 1900 Infectious Diseases / Notifiable Diseases 1720 Inspections / conditions 4360 Minor ailments, dental, etc. 710 Financial 470 Food 330 Births 240 Meteorological 100 Legal 190 Immunisation 60
  • 8. Identification of table topics • Geographies: • Mostly districts • Also smaller areas (sub-districts, wards) • Considerable variety of • Information content • Physical structure • Across many • Locations • Years 25/09/2019DATeCH 2019, Brussels 8 § Demographics § Age § Sex § Births § Deaths § Causes of death § Infant death § Ailments § Diseases § Infectious diseases § Notifiable diseases § Immunisations § Environmental § Inspections § Food § Conditions § Meteorological § Financial § Legal
  • 9. Extraction of tabular data • Can remaining data be extracted in a less costly way? • Available for experiments: • OCR results in ALTO XML format (Greater London) • Ran ABBYY FineReader Engine 11 ourselves 25/09/2019DATeCH 2019, Brussels 9
  • 10. Extraction of tabular data • Tests with ABBYY FineReader • Very inconsistent results • But column and row headers sufficiently recognised 25/09/2019DATeCH 2019, Brussels 10
  • 11. Extraction of tabular data • Prototype: Flexible matching to locate rows and columns of interest • Ignore other data that is less consistent • Order of headers usually stable across geographies • Variation across the years, but doable 25/09/2019DATeCH 2019, Brussels 11
  • 12. Extraction of tabular data • Large proportion of tabular data could be extracted in an automated way • Quality assurance using row / column totals and geographical summations • OCR quality good enough • Limitations: some rare tables • Ingestion into database for online access… 25/09/2019DATeCH 2019, Brussels 12
  • 13. User consultation • Online survey and informal meeting with researchers • Findings • Mixed level of awareness of MOH reports • Current access functionality useful (search by topic and time period) • Wide range of audiences would be interested in tabular statistical data 25/09/2019DATeCH 2019, Brussels 13 Interest in quantitative MOH data Very interested
  • 14. User consultation • Findings • Main interest in basic demographics, mortality and cause of death, ailments, fertility • Comparative analyses of large subsets of data would be of interest (e.g. for epidemiologists) 25/09/2019DATeCH 2019, Brussels 14 Priority of topics
  • 15. Conclusion • There is interest in statistical numerical data • Automated extraction is viable alternative to manual transcription (with limitations) • Flexible detection and recognition approaches in combination with data integration and validation • Queryable large-scale data enables new research • Deep insights • Context for other (qualitative research) 25/09/2019DATeCH 2019, Brussels 15
  • 16. Future work • Creating an index of exiting transcribed MOH tables for better accessibility. • Create integrated data resource from London MOH tables for online search across locations and time. • Indexing and data extraction across all MOH reports based on structured OCR results. • Testing / developing improved table recognition algorithms (e.g. based on deep learning / convolutional neural networks). 25/09/2019DATeCH 2019, Brussels 16 ?!
  • 17. Questions? 25/09/2019DATeCH 2019, Brussels 17 The 5th International Workshop on Historical Document Imaging and Processing Paper submission deadline: 01 June In other news primaresearch.org/hip2019