Extracting patient data from
tables in clinical literature
Case study on extraction of BMI,
weight and number of patients
Nikola Milosevic, Cassie Gregson,
Robert Hernandez, Goran Nenadic
Clinical trial literature
• PubMed contains nearly 800 000 clinical trial
publications
• Researchers challenged with the amount of
published literature
Help from text mining?
• Text mining provides methods to process text
on a large scale
• Current text mining efforts were mainly
focused on text, rather than tables and figures
Tables in clinical documents
• A clinical trial publication contain 2.1 tables
• Tables often contain information about
settings and findings of experiments
Challenges for table mining
• Dense content
• Variety of layouts
• Variety of value representation formats
• Misleading visualization markup
• Lack of resources (labelled datasets)
• How to automatically make
make sense from tables
Aim – a case study
• Extract information about number of patients,
patient’s BMI and weight from tables in
clinical trial literature
• A multi-layered approach to mining
information from tables
– to facilitate largescale semi-automated extraction
– curation of data stored in tables
Methodology overview
• Rule based methodology
– Rules created based on a manual analysis of small
subset of tables
• Five processing layers
– Detection
– Functional
– Structural
– Syntactic
– Semantic
Methodology overview
Table model
• We model 4 main types of tables
– List
– Matrix
– Super-row
– Multi-tables
• Based on table dimensionality
Table types (1)
• List table:
Table types (2)
• Matrix table
Table types (3)
• Super-row table
Table types (4)
• Multi-table
1. Functional analysis
• Classifies cells to functional classes
– Header,
– super-row,
– stub,
– data
• Uses heuristics based on content and position
2. Structural analysis
• Determines relationships between cells
• Using cell functions and table structure classifies
table into one of the structural table type:
– List
– Matrix
– Super-row
– Multi-table
• Based on the type, set of rules resolves the
relationships
3.1 Extracting number of patient
• Heuristic based approach
• Searches captions, headers, cells
• In captions 2 rules:
– n=%d
– %d Adj*(patients|participants|subjects|individuals)
– Usually total number of patients is found
• In header
– usually n=%d
– can be partial, needs adding up
• In cells
– stub contains defined word or phrase
– Can be partial, needs adding up
3.2 Extracting BMI
• Based on trigger phrase (BMI, body mass
index) list and black list (change, increase)
• Trigger words in stub or header invoke
possibility of appearance
• If black listed word is in vicinity it discards the
value
• Range of 14-40
3.3 Extracting weights
• Based on trigger words and black lists
• Looking in stub and header for words from
lists and values in data cells
• Not useful to set range
– Person can have 40 – 150 kg
– In lbs: 80 – 350 lbs
– Baby can have 1500 – 5000 g
Results
• Corpus contained 3573 tables in 1273 documents
• Each table on average 80 cells
• Evaluating Functional and Structural processing:
– Selected random 100 tables of each type and
evaluated
• Evaluating information extraction:
– Number of patients:
• 758 contained data
• 50 random documents
– BMI and weight:
• 113 documents containing these information
Functional analysis results
Results for information extraction
• Extracting number of patients:
• Extracting weight and BMI:
Discussion
• Better scoped values, such as BMI can be
modelled – better performance
• Define exhaustive white and black lists
• Variety of presentation formats and means
• Misleading markup
• However, promising results
Summary
• Large-scale table mining to harvest population
details from clinical trials
• Classified tables based on layout
• Case study on clinical trial patient number,
BMI and weight
• Promising performance
nikola.milosevic@manchester.ac.uk

Extracting patient data from tables in clinical literature

  • 1.
    Extracting patient datafrom tables in clinical literature Case study on extraction of BMI, weight and number of patients Nikola Milosevic, Cassie Gregson, Robert Hernandez, Goran Nenadic
  • 2.
    Clinical trial literature •PubMed contains nearly 800 000 clinical trial publications • Researchers challenged with the amount of published literature
  • 3.
    Help from textmining? • Text mining provides methods to process text on a large scale • Current text mining efforts were mainly focused on text, rather than tables and figures
  • 4.
    Tables in clinicaldocuments • A clinical trial publication contain 2.1 tables • Tables often contain information about settings and findings of experiments
  • 5.
    Challenges for tablemining • Dense content • Variety of layouts • Variety of value representation formats • Misleading visualization markup • Lack of resources (labelled datasets) • How to automatically make make sense from tables
  • 6.
    Aim – acase study • Extract information about number of patients, patient’s BMI and weight from tables in clinical trial literature • A multi-layered approach to mining information from tables – to facilitate largescale semi-automated extraction – curation of data stored in tables
  • 7.
    Methodology overview • Rulebased methodology – Rules created based on a manual analysis of small subset of tables • Five processing layers – Detection – Functional – Structural – Syntactic – Semantic
  • 8.
  • 9.
    Table model • Wemodel 4 main types of tables – List – Matrix – Super-row – Multi-tables • Based on table dimensionality
  • 10.
  • 11.
    Table types (2) •Matrix table
  • 12.
    Table types (3) •Super-row table
  • 13.
  • 14.
    1. Functional analysis •Classifies cells to functional classes – Header, – super-row, – stub, – data • Uses heuristics based on content and position
  • 15.
    2. Structural analysis •Determines relationships between cells • Using cell functions and table structure classifies table into one of the structural table type: – List – Matrix – Super-row – Multi-table • Based on the type, set of rules resolves the relationships
  • 16.
    3.1 Extracting numberof patient • Heuristic based approach • Searches captions, headers, cells • In captions 2 rules: – n=%d – %d Adj*(patients|participants|subjects|individuals) – Usually total number of patients is found • In header – usually n=%d – can be partial, needs adding up • In cells – stub contains defined word or phrase – Can be partial, needs adding up
  • 17.
    3.2 Extracting BMI •Based on trigger phrase (BMI, body mass index) list and black list (change, increase) • Trigger words in stub or header invoke possibility of appearance • If black listed word is in vicinity it discards the value • Range of 14-40
  • 18.
    3.3 Extracting weights •Based on trigger words and black lists • Looking in stub and header for words from lists and values in data cells • Not useful to set range – Person can have 40 – 150 kg – In lbs: 80 – 350 lbs – Baby can have 1500 – 5000 g
  • 19.
    Results • Corpus contained3573 tables in 1273 documents • Each table on average 80 cells • Evaluating Functional and Structural processing: – Selected random 100 tables of each type and evaluated • Evaluating information extraction: – Number of patients: • 758 contained data • 50 random documents – BMI and weight: • 113 documents containing these information
  • 20.
  • 21.
    Results for informationextraction • Extracting number of patients: • Extracting weight and BMI:
  • 22.
    Discussion • Better scopedvalues, such as BMI can be modelled – better performance • Define exhaustive white and black lists • Variety of presentation formats and means • Misleading markup • However, promising results
  • 23.
    Summary • Large-scale tablemining to harvest population details from clinical trials • Classified tables based on layout • Case study on clinical trial patient number, BMI and weight • Promising performance
  • 24.