SlideShare a Scribd company logo
1 of 20
Hybrid methodology for information extraction
from tables in biomedical literature
Nikola Milošević, Cassie Gregson, Robert Hernandez, Goran Nenadić
Contact: nikola.milosevic@manchester.ac.uk
Literature growth
• MEDLINE contains more than 26 million citations
• Number of citation is growing exponentially
• 2100 new articles published daily in biomedicine
• Professionals are no more able to cope with the state-of-the-art
Text mining
Source: https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
Table mining
• Current text mining efforts focus on main text of the article
• Usually ignore tables and figures
• Tables contain
• Settings of the experiment (patient characteristics, arms, dosages, etc.)
• Results of the experiment
• Definition of terms and quantitative scales
• Examples (i.e. questionnaires)
• …
• Article information are incomplete without tables (and figures)
Table complexity
One dimensional (list) table Two dimensional (matrix) table
Table complexity (2)
Multi-dimensional (super-row) table
Multi-dimensional (multi-table) table
Challenges
• Dense content
• Variety of layouts
• Variety of value representation formats
• Misleading visualization markup
• Lack of resources (labelled datasets)
Aim and objectives
• Create a multi-layered approach to mining information from
tables
• to facilitate largescale semi-automated extraction
• curation of data stored in tables
Table mining methodology overview
Functional processing
• Classifies cells to functional classes
• Header,
• super-row,
• stub,
• data
• Uses heuristics based on content and position
• Described in:
Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G.
Disentangling structure of tables in scientific literature.
In Proceedings of the 21th International Conference on Applications of Natural Language to
Information Systems (NLDB 2016) (2016), Springer.
Structural processing
• Determines relationships between cells
• Using cell functions and table structure classifies
table into one of the structural table type:
• List
• Matrix
• Super-row
• Multi-table
• Based on the type, set of rules resolves the relationships
• Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G.
Disentangling structure of tables in scientific literature.
In Proceedings of the 21th International Conference on Applications of Natural
Language to Information Systems (NLDB 2016) (2016), Springer.
Semantic tagging
• Semantically tags terms, phrases or words
• Knowledge sources (UMLS, DBPedia, WordNet)
• Used MetaMap for tagging with UMLS
• Helps with pragmatic classification and information extraction
Pragmatic processing
• Determines the purpose of the table
• Machine learning approach
• Naïve Bayes, Bayes Nets, SVM, Decision trees, random forests
• More specific classes -> better results
• Evidence based on 2 trials
• Settings, findings, support tables - ~ 80% F-score
• Baseline characteristics, Adverse events, Inclusion/Exclusion, Other - ~95%
F-score
Value identification and syntactic
processing
• Indemnifying the cell of interest:
• Looks at the navigational cells for lexical cues or for semantic types in
tags
• Lexical cues in white and black lists
• Syntactic processing
• Uses set of pattern to determine semantics of the value
• Extracts the selected value
Pragmatic classification results
• Pragmatic classification performs well with specific classes
• 4 classes – baseline characteristics, adverse events,
inclusion/exclusion, other
• Best performance - SVM
Information extraction results
• Extracted number of patiens
• New tests on extracting patient age, adverse events (using
UMLS)
Patiens’ age
Adverse reactions
Lessons learned
• Table mining requires multi-layered analysis
• Functional and structural analysis are crucial
• Semantics of value presentation patterns
• Semantic tagging helps
• Machine learning helps in certain steps (i.e. pragmatic analysis)
• Combination of heuristic based and machine learning based
steps
• Availability:
• https://github.com/nikolamilosevic86/TableAnnotator
• https://github.com/nikolamilosevic86/TableInformationExtractionScripts
Future plans
• Develop easy to use methodology
• Develop UI tool (wizard) for information extraction from tables
• Improve the methodology
• Compare heuristic based vs machine learning based IE
• Examine methods for unbalanced datasets
Acknowledgements
Dr Michele Filannino
Dr Azad Dehghan
Nikola Milošević
Ruth Stoney
Maksim Belousov
Dr Goran Nenadić
Robert Hernandez
Cassie Gregson
Richard Boyce
Jodi Schneider Steven DeMarco
nikola.milosevic@manchester.ac.uk

More Related Content

What's hot

Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAP
jaya lakshmi
 

What's hot (11)

How to access databases
How to access databasesHow to access databases
How to access databases
 
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
 
NSPC Introduction to the library (2021)
NSPC Introduction to the library (2021)NSPC Introduction to the library (2021)
NSPC Introduction to the library (2021)
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
Research-only rankings of HEIs: Is it possible to measure scientific performa...
Research-only rankings of HEIs:Is it possible to measure scientific performa...Research-only rankings of HEIs:Is it possible to measure scientific performa...
Research-only rankings of HEIs: Is it possible to measure scientific performa...
 
Aist2014
Aist2014Aist2014
Aist2014
 
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
 
20090813MEETING
20090813MEETING20090813MEETING
20090813MEETING
 
relational database
relational databaserelational database
relational database
 
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
 
Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAP
 

Similar to BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature

Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
Shalin Hai-Jew
 

Similar to BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature (20)

Deposit data to data centre: ADP case
Deposit data to data centre: ADP caseDeposit data to data centre: ADP case
Deposit data to data centre: ADP case
 
Preparing data and documentation for digital curation
Preparing data and documentation for digital curationPreparing data and documentation for digital curation
Preparing data and documentation for digital curation
 
0 introduction
0  introduction0  introduction
0 introduction
 
Handling quantitative data and preparing for sharing and reuse, including dat...
Handling quantitative data and preparing for sharing and reuse, including dat...Handling quantitative data and preparing for sharing and reuse, including dat...
Handling quantitative data and preparing for sharing and reuse, including dat...
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of Metadata
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
 
Nursing Data Analysis.pptx
Nursing Data Analysis.pptxNursing Data Analysis.pptx
Nursing Data Analysis.pptx
 
APSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptxAPSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptx
 
Christina Silver Seeing the wood amongst the trees - choosing an appropriat...
Christina Silver   Seeing the wood amongst the trees - choosing an appropriat...Christina Silver   Seeing the wood amongst the trees - choosing an appropriat...
Christina Silver Seeing the wood amongst the trees - choosing an appropriat...
 
Relational databases
Relational databasesRelational databases
Relational databases
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and Neuroscience
 
Mixed Methods Research Designs
Mixed Methods Research DesignsMixed Methods Research Designs
Mixed Methods Research Designs
 
Mixed Methods Designs
Mixed Methods DesignsMixed Methods Designs
Mixed Methods Designs
 
Extracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literatureExtracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literature
 
Introduction To Research Methodology
Introduction To Research MethodologyIntroduction To Research Methodology
Introduction To Research Methodology
 

Recently uploaded

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature

  • 1. Hybrid methodology for information extraction from tables in biomedical literature Nikola Milošević, Cassie Gregson, Robert Hernandez, Goran Nenadić Contact: nikola.milosevic@manchester.ac.uk
  • 2. Literature growth • MEDLINE contains more than 26 million citations • Number of citation is growing exponentially • 2100 new articles published daily in biomedicine • Professionals are no more able to cope with the state-of-the-art
  • 4. Table mining • Current text mining efforts focus on main text of the article • Usually ignore tables and figures • Tables contain • Settings of the experiment (patient characteristics, arms, dosages, etc.) • Results of the experiment • Definition of terms and quantitative scales • Examples (i.e. questionnaires) • … • Article information are incomplete without tables (and figures)
  • 5. Table complexity One dimensional (list) table Two dimensional (matrix) table
  • 6. Table complexity (2) Multi-dimensional (super-row) table Multi-dimensional (multi-table) table
  • 7. Challenges • Dense content • Variety of layouts • Variety of value representation formats • Misleading visualization markup • Lack of resources (labelled datasets)
  • 8. Aim and objectives • Create a multi-layered approach to mining information from tables • to facilitate largescale semi-automated extraction • curation of data stored in tables
  • 10. Functional processing • Classifies cells to functional classes • Header, • super-row, • stub, • data • Uses heuristics based on content and position • Described in: Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G. Disentangling structure of tables in scientific literature. In Proceedings of the 21th International Conference on Applications of Natural Language to Information Systems (NLDB 2016) (2016), Springer.
  • 11. Structural processing • Determines relationships between cells • Using cell functions and table structure classifies table into one of the structural table type: • List • Matrix • Super-row • Multi-table • Based on the type, set of rules resolves the relationships • Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G. Disentangling structure of tables in scientific literature. In Proceedings of the 21th International Conference on Applications of Natural Language to Information Systems (NLDB 2016) (2016), Springer.
  • 12. Semantic tagging • Semantically tags terms, phrases or words • Knowledge sources (UMLS, DBPedia, WordNet) • Used MetaMap for tagging with UMLS • Helps with pragmatic classification and information extraction
  • 13. Pragmatic processing • Determines the purpose of the table • Machine learning approach • Naïve Bayes, Bayes Nets, SVM, Decision trees, random forests • More specific classes -> better results • Evidence based on 2 trials • Settings, findings, support tables - ~ 80% F-score • Baseline characteristics, Adverse events, Inclusion/Exclusion, Other - ~95% F-score
  • 14. Value identification and syntactic processing • Indemnifying the cell of interest: • Looks at the navigational cells for lexical cues or for semantic types in tags • Lexical cues in white and black lists • Syntactic processing • Uses set of pattern to determine semantics of the value • Extracts the selected value
  • 15. Pragmatic classification results • Pragmatic classification performs well with specific classes • 4 classes – baseline characteristics, adverse events, inclusion/exclusion, other • Best performance - SVM
  • 16. Information extraction results • Extracted number of patiens • New tests on extracting patient age, adverse events (using UMLS) Patiens’ age Adverse reactions
  • 17. Lessons learned • Table mining requires multi-layered analysis • Functional and structural analysis are crucial • Semantics of value presentation patterns • Semantic tagging helps • Machine learning helps in certain steps (i.e. pragmatic analysis) • Combination of heuristic based and machine learning based steps • Availability: • https://github.com/nikolamilosevic86/TableAnnotator • https://github.com/nikolamilosevic86/TableInformationExtractionScripts
  • 18. Future plans • Develop easy to use methodology • Develop UI tool (wizard) for information extraction from tables • Improve the methodology • Compare heuristic based vs machine learning based IE • Examine methods for unbalanced datasets
  • 19. Acknowledgements Dr Michele Filannino Dr Azad Dehghan Nikola Milošević Ruth Stoney Maksim Belousov Dr Goran Nenadić Robert Hernandez Cassie Gregson Richard Boyce Jodi Schneider Steven DeMarco