SlideShare a Scribd company logo
1 of 20
Hybrid methodology for information extraction
from tables in biomedical literature
Nikola Milošević, Cassie Gregson, Robert Hernandez, Goran Nenadić
Contact: nikola.milosevic@manchester.ac.uk
Literature growth
• MEDLINE contains more than 26 million citations
• Number of citation is growing exponentially
• 2100 new articles published daily in biomedicine
• Professionals are no more able to cope with the state-of-the-art
Text mining
Source: https://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining
Table mining
• Current text mining efforts focus on main text of the article
• Usually ignore tables and figures
• Tables contain
• Settings of the experiment (patient characteristics, arms, dosages, etc.)
• Results of the experiment
• Definition of terms and quantitative scales
• Examples (i.e. questionnaires)
• …
• Article information are incomplete without tables (and figures)
Table complexity
One dimensional (list) table Two dimensional (matrix) table
Table complexity (2)
Multi-dimensional (super-row) table
Multi-dimensional (multi-table) table
Challenges
• Dense content
• Variety of layouts
• Variety of value representation formats
• Misleading visualization markup
• Lack of resources (labelled datasets)
Aim and objectives
• Create a multi-layered approach to mining information from
tables
• to facilitate largescale semi-automated extraction
• curation of data stored in tables
Table mining methodology overview
Functional processing
• Classifies cells to functional classes
• Header,
• super-row,
• stub,
• data
• Uses heuristics based on content and position
• Described in:
Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G.
Disentangling structure of tables in scientific literature.
In Proceedings of the 21th International Conference on Applications of Natural Language to
Information Systems (NLDB 2016) (2016), Springer.
Structural processing
• Determines relationships between cells
• Using cell functions and table structure classifies
table into one of the structural table type:
• List
• Matrix
• Super-row
• Multi-table
• Based on the type, set of rules resolves the relationships
• Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G.
Disentangling structure of tables in scientific literature.
In Proceedings of the 21th International Conference on Applications of Natural
Language to Information Systems (NLDB 2016) (2016), Springer.
Semantic tagging
• Semantically tags terms, phrases or words
• Knowledge sources (UMLS, DBPedia, WordNet)
• Used MetaMap for tagging with UMLS
• Helps with pragmatic classification and information extraction
Pragmatic processing
• Determines the purpose of the table
• Machine learning approach
• Naïve Bayes, Bayes Nets, SVM, Decision trees, random forests
• More specific classes -> better results
• Evidence based on 2 trials
• Settings, findings, support tables - ~ 80% F-score
• Baseline characteristics, Adverse events, Inclusion/Exclusion, Other - ~95%
F-score
Value identification and syntactic
processing
• Indemnifying the cell of interest:
• Looks at the navigational cells for lexical cues or for semantic types in
tags
• Lexical cues in white and black lists
• Syntactic processing
• Uses set of pattern to determine semantics of the value
• Extracts the selected value
Pragmatic classification results
• Pragmatic classification performs well with specific classes
• 4 classes – baseline characteristics, adverse events,
inclusion/exclusion, other
• Best performance - SVM
Information extraction results
• Extracted number of patiens
• New tests on extracting patient age, adverse events (using
UMLS)
Patiens’ age
Adverse reactions
Lessons learned
• Table mining requires multi-layered analysis
• Functional and structural analysis are crucial
• Semantics of value presentation patterns
• Semantic tagging helps
• Machine learning helps in certain steps (i.e. pragmatic analysis)
• Combination of heuristic based and machine learning based
steps
• Availability:
• https://github.com/nikolamilosevic86/TableAnnotator
• https://github.com/nikolamilosevic86/TableInformationExtractionScripts
Future plans
• Develop easy to use methodology
• Develop UI tool (wizard) for information extraction from tables
• Improve the methodology
• Compare heuristic based vs machine learning based IE
• Examine methods for unbalanced datasets
Acknowledgements
Dr Michele Filannino
Dr Azad Dehghan
Nikola Milošević
Ruth Stoney
Maksim Belousov
Dr Goran Nenadić
Robert Hernandez
Cassie Gregson
Richard Boyce
Jodi Schneider Steven DeMarco
nikola.milosevic@manchester.ac.uk

More Related Content

What's hot

Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAP
jaya lakshmi
 

What's hot (11)

How to access databases
How to access databasesHow to access databases
How to access databases
 
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
Euraxess ERD2018 Presentation on a JSPS Usability & eHealth Project
 
NSPC Introduction to the library (2021)
NSPC Introduction to the library (2021)NSPC Introduction to the library (2021)
NSPC Introduction to the library (2021)
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
Research-only rankings of HEIs: Is it possible to measure scientific performa...
Research-only rankings of HEIs:Is it possible to measure scientific performa...Research-only rankings of HEIs:Is it possible to measure scientific performa...
Research-only rankings of HEIs: Is it possible to measure scientific performa...
 
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
Kristina Berketa, Nikolina Peša Pavlović, Drahomira Cupar: Do library users k...
 
Aist2014
Aist2014Aist2014
Aist2014
 
20090813MEETING
20090813MEETING20090813MEETING
20090813MEETING
 
relational database
relational databaserelational database
relational database
 
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
IS VaVaI as the information tool for the new Institutional Evaluation Methodo...
 
Data mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAPData mining course learning outcomes,Data Mining CMAP
Data mining course learning outcomes,Data Mining CMAP
 

Similar to BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature

Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
Shalin Hai-Jew
 

Similar to BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature (20)

Deposit data to data centre: ADP case
Deposit data to data centre: ADP caseDeposit data to data centre: ADP case
Deposit data to data centre: ADP case
 
Preparing data and documentation for digital curation
Preparing data and documentation for digital curationPreparing data and documentation for digital curation
Preparing data and documentation for digital curation
 
0 introduction
0  introduction0  introduction
0 introduction
 
Handling quantitative data and preparing for sharing and reuse, including dat...
Handling quantitative data and preparing for sharing and reuse, including dat...Handling quantitative data and preparing for sharing and reuse, including dat...
Handling quantitative data and preparing for sharing and reuse, including dat...
 
Realizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyondRealizing Semantic Web - Light Weight semantics and beyond
Realizing Semantic Web - Light Weight semantics and beyond
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
The Importance of Metadata
The Importance of MetadataThe Importance of Metadata
The Importance of Metadata
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
10 Years of Multi-Label Learning
10 Years of Multi-Label Learning10 Years of Multi-Label Learning
10 Years of Multi-Label Learning
 
Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data Using Decision Trees to Analyze Online Learning Data
Using Decision Trees to Analyze Online Learning Data
 
Nursing Data Analysis.pptx
Nursing Data Analysis.pptxNursing Data Analysis.pptx
Nursing Data Analysis.pptx
 
APSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptxAPSY3206 Lecture 1.pptx
APSY3206 Lecture 1.pptx
 
Christina Silver Seeing the wood amongst the trees - choosing an appropriat...
Christina Silver   Seeing the wood amongst the trees - choosing an appropriat...Christina Silver   Seeing the wood amongst the trees - choosing an appropriat...
Christina Silver Seeing the wood amongst the trees - choosing an appropriat...
 
Relational databases
Relational databasesRelational databases
Relational databases
 
FAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and NeuroscienceFAIR and metadata standards - FAIRsharing and Neuroscience
FAIR and metadata standards - FAIRsharing and Neuroscience
 
Mixed Methods Research Designs
Mixed Methods Research DesignsMixed Methods Research Designs
Mixed Methods Research Designs
 
Mixed Methods Designs
Mixed Methods DesignsMixed Methods Designs
Mixed Methods Designs
 
Extracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literatureExtracting patient data from tables in clinical literature
Extracting patient data from tables in clinical literature
 
Introduction To Research Methodology
Introduction To Research MethodologyIntroduction To Research Methodology
Introduction To Research Methodology
 

More from Nikola Milosevic

Software Freedom day Serbia - Owasp open source resenja
Software Freedom day Serbia - Owasp open source resenjaSoftware Freedom day Serbia - Owasp open source resenja
Software Freedom day Serbia - Owasp open source resenja
Nikola Milosevic
 
OWASP Serbia - A6 security misconfiguration
OWASP Serbia - A6 security misconfigurationOWASP Serbia - A6 security misconfiguration
OWASP Serbia - A6 security misconfiguration
Nikola Milosevic
 

More from Nikola Milosevic (20)

Classifying intangible social innovation concepts using machine learning and ...
Classifying intangible social innovation concepts using machine learning and ...Classifying intangible social innovation concepts using machine learning and ...
Classifying intangible social innovation concepts using machine learning and ...
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 
Veštačka inteligencija
Veštačka inteligencijaVeštačka inteligencija
Veštačka inteligencija
 
AI an the future of society
AI an the future of societyAI an the future of society
AI an the future of society
 
Machine learning prediction of stock markets
Machine learning prediction of stock marketsMachine learning prediction of stock markets
Machine learning prediction of stock markets
 
Equity forecast: Predicting long term stock market prices using machine learning
Equity forecast: Predicting long term stock market prices using machine learningEquity forecast: Predicting long term stock market prices using machine learning
Equity forecast: Predicting long term stock market prices using machine learning
 
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
Mobile security, OWASP Mobile Top 10, OWASP SeraphimdroidMobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
Mobile security, OWASP Mobile Top 10, OWASP Seraphimdroid
 
Serbia2
Serbia2Serbia2
Serbia2
 
Table mining and data curation from biomedical literature
Table mining and data curation from biomedical literatureTable mining and data curation from biomedical literature
Table mining and data curation from biomedical literature
 
Malware
MalwareMalware
Malware
 
Sentiment analysis for Serbian language
Sentiment analysis for Serbian languageSentiment analysis for Serbian language
Sentiment analysis for Serbian language
 
Http and security
Http and securityHttp and security
Http and security
 
Android business models
Android business modelsAndroid business models
Android business models
 
Android(1)
Android(1)Android(1)
Android(1)
 
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
Sigurnosne prijetnje i mjere zaštite IT infrastrukture Sigurnosne prijetnje i mjere zaštite IT infrastrukture
Sigurnosne prijetnje i mjere zaštite IT infrastrukture
 
Mašinska analiza sentimenta rečenica na srpskom jeziku
Mašinska analiza sentimenta rečenica na srpskom jezikuMašinska analiza sentimenta rečenica na srpskom jeziku
Mašinska analiza sentimenta rečenica na srpskom jeziku
 
Malware
MalwareMalware
Malware
 
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
Software Freedom day Serbia - Owasp - informaciona bezbednost u Srbiji open s...
 
Software Freedom day Serbia - Owasp open source resenja
Software Freedom day Serbia - Owasp open source resenjaSoftware Freedom day Serbia - Owasp open source resenja
Software Freedom day Serbia - Owasp open source resenja
 
OWASP Serbia - A6 security misconfiguration
OWASP Serbia - A6 security misconfigurationOWASP Serbia - A6 security misconfiguration
OWASP Serbia - A6 security misconfiguration
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

BelBi2016 presentation: Hybrid methodology for information extraction from tables in biomedical literature

  • 1. Hybrid methodology for information extraction from tables in biomedical literature Nikola Milošević, Cassie Gregson, Robert Hernandez, Goran Nenadić Contact: nikola.milosevic@manchester.ac.uk
  • 2. Literature growth • MEDLINE contains more than 26 million citations • Number of citation is growing exponentially • 2100 new articles published daily in biomedicine • Professionals are no more able to cope with the state-of-the-art
  • 4. Table mining • Current text mining efforts focus on main text of the article • Usually ignore tables and figures • Tables contain • Settings of the experiment (patient characteristics, arms, dosages, etc.) • Results of the experiment • Definition of terms and quantitative scales • Examples (i.e. questionnaires) • … • Article information are incomplete without tables (and figures)
  • 5. Table complexity One dimensional (list) table Two dimensional (matrix) table
  • 6. Table complexity (2) Multi-dimensional (super-row) table Multi-dimensional (multi-table) table
  • 7. Challenges • Dense content • Variety of layouts • Variety of value representation formats • Misleading visualization markup • Lack of resources (labelled datasets)
  • 8. Aim and objectives • Create a multi-layered approach to mining information from tables • to facilitate largescale semi-automated extraction • curation of data stored in tables
  • 10. Functional processing • Classifies cells to functional classes • Header, • super-row, • stub, • data • Uses heuristics based on content and position • Described in: Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G. Disentangling structure of tables in scientific literature. In Proceedings of the 21th International Conference on Applications of Natural Language to Information Systems (NLDB 2016) (2016), Springer.
  • 11. Structural processing • Determines relationships between cells • Using cell functions and table structure classifies table into one of the structural table type: • List • Matrix • Super-row • Multi-table • Based on the type, set of rules resolves the relationships • Milosevic, N., Gregson, C., Hernandez, R.,Nenadic, G. Disentangling structure of tables in scientific literature. In Proceedings of the 21th International Conference on Applications of Natural Language to Information Systems (NLDB 2016) (2016), Springer.
  • 12. Semantic tagging • Semantically tags terms, phrases or words • Knowledge sources (UMLS, DBPedia, WordNet) • Used MetaMap for tagging with UMLS • Helps with pragmatic classification and information extraction
  • 13. Pragmatic processing • Determines the purpose of the table • Machine learning approach • Naïve Bayes, Bayes Nets, SVM, Decision trees, random forests • More specific classes -> better results • Evidence based on 2 trials • Settings, findings, support tables - ~ 80% F-score • Baseline characteristics, Adverse events, Inclusion/Exclusion, Other - ~95% F-score
  • 14. Value identification and syntactic processing • Indemnifying the cell of interest: • Looks at the navigational cells for lexical cues or for semantic types in tags • Lexical cues in white and black lists • Syntactic processing • Uses set of pattern to determine semantics of the value • Extracts the selected value
  • 15. Pragmatic classification results • Pragmatic classification performs well with specific classes • 4 classes – baseline characteristics, adverse events, inclusion/exclusion, other • Best performance - SVM
  • 16. Information extraction results • Extracted number of patiens • New tests on extracting patient age, adverse events (using UMLS) Patiens’ age Adverse reactions
  • 17. Lessons learned • Table mining requires multi-layered analysis • Functional and structural analysis are crucial • Semantics of value presentation patterns • Semantic tagging helps • Machine learning helps in certain steps (i.e. pragmatic analysis) • Combination of heuristic based and machine learning based steps • Availability: • https://github.com/nikolamilosevic86/TableAnnotator • https://github.com/nikolamilosevic86/TableInformationExtractionScripts
  • 18. Future plans • Develop easy to use methodology • Develop UI tool (wizard) for information extraction from tables • Improve the methodology • Compare heuristic based vs machine learning based IE • Examine methods for unbalanced datasets
  • 19. Acknowledgements Dr Michele Filannino Dr Azad Dehghan Nikola Milošević Ruth Stoney Maksim Belousov Dr Goran Nenadić Robert Hernandez Cassie Gregson Richard Boyce Jodi Schneider Steven DeMarco