Text Mining - as Normal as Data Mining?
Andrew Hinton, Application Specialist
IISDV 2016, Tuesday 19th April 2016, Nice
Agenda
Introduction to text mining
The challenge
Applications of specialised normalization solutions
− Maximising Source Normalization
− EASL (Extraction and Search Language )
− Allows programmatic access to unstructured data similar to
SQL over structured data.
− Numeric Normalization & Range search
− Capturing weights between 60 and 80kg whether
expressed in kilograms or pounds, for patient selection
from EHRs.
− Gene Mutation Normalization
− Use case where gene mutations have been linked to rare
disease progression.
© 2016 Linguamatics Ltd2
Answers to Our Questions are in Free Text
80% of information at companies is in free text
Most of the answers to our questions are there
Ever-increasing amounts of text data to examine
© 2016 Linguamatics Ltd3
0
5.000.000
10.000.000
15.000.000
20.000.000
25.000.000
PubMed Records
− Different kinds of documents
− External literature, patents,
EHRs, internal reports, blogs,
presentations
− Different formats
− HTML, PDF, XML, Word, PPT,
Wiki, TXT, HL7
Keyword Searching
© 2016 Linguamatics Ltd4
OLED
Documents, Web Pages, Folders
All these
documents
contain the
keyword
‘Additive’.
Read ALL
the
document
to find the
relevant bit
to you
Linguamatics in Healthcare
© 2016 Linguamatics Ltd5
Electronic
Health
Record
Enterprise
Data
Warehouse
Pathology, radiology,
initial
assessment,
discharge, check up
Structured
data
Clinical
Risk
Monitor
Patient
characteristics
Patient
lists
Clinical
trials
gov
Patient
characteristics
Matching
Clinical
trials
Patient
Narrative
Semantic search
tags
Semantic
Enrichment
Clinical case
histories and/or
genomic
interpretation
Patient
characteristics
Scientific
literature
I2E Transforms Text into Actionable Insights
© 2016 Linguamatics Ltd6
Turn text Into structured data
using sophisticated queries
Accurate results: only retrieves relevant results
Complete results: comprehensive and systematic
Analytics
To drive
analytics
Enterprise
Warehouse
Search vs. Text Mining
© 2016 Linguamatics Ltd7
Text MiningSearch Engine
Filter to find
most
relevant
documents,
then read
News Feeds Literature Patents Internal Reports Social Media
Natural Language
Processing (NLP) -
understand meaning
© 2012 Linguamatics Ltd.
Use of ontologies and
clustered results
Efficient review, without
reading every document
Challenges in Unstructured Data
© 2016 Linguamatics Ltd
Different word, same
meaning
cyclosporine
ciclosporin
Neoral
Sandimmune
Different expression, same
meaning
Non-smoker
Does not smoke
Does not drink or smoke
Denies tobacco use
Different grammar, same
meaning
5mg/kg of cyclosporine per day
5mg/kg per day of cyclosporine
cyclosporine 5mg/kg per day
Same word, different
context
Diagnosed with diabetes
Family history of diabetes
No family history of diabetes
NLP
8
Linguistic Processing Using NLP
Interprets meaning of the text
Groups words into meaningful units
Search for different forms of words
© 2016 Linguamatics Ltd9
We find that p42mapk phosphorylates c-Myb on serine and threonine .
Purified recombinant p42 MAPK was found to phosphorylate Wee1 .
sentences morphology -
different forms
noun groups
match entities
verb groups
match actions
From Words to Meaning
© 2016 Linguamatics Ltd10
“Among them, nimesulide, a selective COX2 inhibitor, …”
Entrez Gene ID:
5743
inhibits
Entrez Gene ID: 5743
inhibits
Identifying
entities and
relations
Linguistics to establish relationships
Text Mining - as Normal as Data Mining?
© 2016 Linguamatics Ltd11
CHALLENGE
How can we capture
information from free text
as conveniently as
accessing a database?
One of the essential
differences is the lack of
normalization of terms
and concepts in free text.
SOLUTION
NLP-based text mining
provides the capability to
look through unstructured
text normalizing:
• Keywords to concepts
• Numerical data
• Range Search
• Gene Mutations
• Content source
BENEFIT
A set of structured facts,
relationships or
assertions, from different
data sources that can be
used for decision support
Providing tabular or visual
analytics to fill data
warehouses and support
better patient care.
Literature
Patents
Reports
Clinical
Trials
Examples of Normalization
Content Source Normalization
I2E: A Fully Federated Text Mining Platform
14
Merge into a single set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Federated Architecture
Normalizing Data from Different Sources
Single query
Differently structured data sources on different
servers
− Journal articles (PubMed Central) on local Enterprise
Server
− MEDLINE on remote cloud server
Single set of results
© 2016 Linguamatics Ltd15
Using EASL
EASL: Extraction And Search Language
Representing a Query in EASL
17
EASL Example
© 2016 Linguamatics Ltd18
query:
document:
- phrase:
- class: {snid: nci.C1909, pt: Pharmacologic Substance}
- treat
- class: {snid: nlm.C04.588.180, pt: Breast Neoplasms}
output:
outputSettings: {documentsPerAssertion: -1,
hitsPerDocPerAssertion: 10, outputOrdering: frequency,
resultType: standard}
Benefits of EASL
Automation
− Richer language for WSAPI applications
− Can build a completely new query vs. adapting smart query parameters
− Allows on-the-fly query production
Re-use
− Save, share and compare components of queries e.g.
− Save out Alternatives
− Load complex expressions in smart query parameters
Audit
− Human readable language for documenting the text mining strategy
− Using open mark-up language (YAML)
Conversion
− Enable scripts to convert from other query languages e.g. advanced search
Different interfaces
− Enables 3rd party applications to create I2E queries
− Developers can produce innovative specialized interfaces e.g. advanced
search plus terminologies
© 2016 Linguamatics Ltd19
EASL: Enhancing the Value of Federated Search
20
Merge into a single set of results
Content
Server 1
Content
Server 2
Content
Server 3
Content
Server 4
Federated Architecture
translate2easl
© 2016 Linguamatics Ltd21
Espacenet query Pubmed query
espacenet2easl pubmed2easl
EASL
keywords + index terms
EASL
terminologies, linguistics …
Clinical
Trials
OMIM
FDA
Drug
Labels
PatentsNIH
Grants
MEDLINE
refine
query
Range Search and Normalization
What Do We Want to Find?
Patients
− below 60 years old
− weight ≥ 80kg
− not having chemotherapy after 2010
− with a mutation C677T
© 2016 Linguamatics Ltd23
Challenge: Variety Within the Text
Below 60 years old
− aged 58
− 35 years old
− 42-year-old
− 39 y/o
Weight ≥ 80kg
− 267 pounds
− 280 lbs
− 80.4kg
− 82 kilograms
© 2016 Linguamatics Ltd24
After 2010
− January 21, 2011
− October of 2012
− 08/21/11
− 2012-05-04
Mutation C677T
− C677T
− 677C>T
− 677C/T
− 677C->T
Normalizing Gene Mutations
Different types of mutation description,
including:
− positional e.g. +869(T>C)
− rsID e.g. rs100
Transform different syntax e.g.
− 1166A/C -> A1166C
− Asn to Ser substitution at codon 127 -> N127S
− +1196C/T -> C1196T)
− g.655C/A>G -> C655G, A655G
− M567V/A -> M567V, M567A
© 2016 Linguamatics Ltd25
Mutation Normalization Examples
© 2016 Linguamatics Ltd26
Range Search
Allows search for values
within a range
− in fixed fields e.g. publication
date
− within free text e.g. dosages
Can directly ask for e.g.
− patients with diabetes under
60 with BMI under 30
Can find intervals within the
text and find these when
search for a number or an
overlapping range
© 2016 Linguamatics Ltd27
Range Search with
Normalization
Range Search (Age, Date)
− Patients aged < 60yrs
− Date before 2010
Normalizing:
− Report Date, Age, Weight & BMI
© 2016 Linguamatics Ltd28
Normalization Benefits
Ability to compare measurements with
different units e.g. kg vs. lbs
Ability to perform range search for numerics,
measurements, dates
Standardized representations to link to
structured data e.g. mutation databases
Better clustering of results e.g. drug lab codes
© 2016 Linguamatics Ltd29
Real World Example: Mutation Normalization
Mucopolysaccharidosis II: Hunter Syndrome
Rare X-linked recessive disorder
Deficiency of the lysosomal enzyme
iduronate-2-sulfatase
Leads to progressive accumulation of
glycosaminoglucans throughout the body
Signs & symptoms:
− Bone deformities with joint stiffness; Frequent
respiratory infections; Cardiomyopathy;
Hepatosplenomegaly; Neurocognitive
impairment; Reduced lifespan
− Some symptoms partially improved with enzyme
replacement therapy
Spectrum of clinical severity (mild to
severe); main difference is progressive
development of neurodegeneration in the
severe form
© 2016 Linguamatics Ltd31
32
CHALLENGE
• Scarcity of knowledge of
natural history of
disease
• Sparse data, needs high
recall across full text
papers
• Mutation patterns very
variable
• Structured databases
lack broad phenotypic
association data
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASES
GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER
SYNDROME
33
CHALLENGE
• Scarcity of knowledge of
natural history of
disease
• Sparse data, needs high
recall across full text
papers
• Mutation patterns very
variable
• Structured databases
lack broad phenotypic
association data
SOLUTION
• Developed workflow with
Linguamatics I2E
• Abstracts ID’ed in
MEDLINE using broad
vocabularies
• Full text PDFs processed
for text analytics
• I2E mutation ontology
and bespoke severity
vocabs enabled
extraction of genotype-
phenotype associations
BENEFIT
• Extraction of patient
mutations matched or
bettered genetic
databases
• Increased understanding
of IDS mutational
spectrum for provider
diagnostics and patient
awareness
• Enabled rational
approach to immune
response classification
© 2016 Linguamatics Ltd
TEXT ANALYTICS FOR RARE DISEASES
GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER
SYNDROME
Shire-Use case
© 2016 Linguamatics Ltd34
In Summary
Better Normalization of
− Numbers, dates, drug codes, TNM cancer stage
− Subsequent range search
− Gene mutations
In combination with a human readable open query
language EASL
− Maximises the ease and flexibility of asking complex
questions simultaneously across different content
sources
Ultimately agile NLP text mining provides
− High quality, structured, clustered & normalized results
in the format you need
− Improves speed to insight for faster decision making
© 2016 Linguamatics Ltd35

II-SDV Andrew Hinton - Text mining - as normal as data mining?

  • 1.
    Text Mining -as Normal as Data Mining? Andrew Hinton, Application Specialist IISDV 2016, Tuesday 19th April 2016, Nice
  • 2.
    Agenda Introduction to textmining The challenge Applications of specialised normalization solutions − Maximising Source Normalization − EASL (Extraction and Search Language ) − Allows programmatic access to unstructured data similar to SQL over structured data. − Numeric Normalization & Range search − Capturing weights between 60 and 80kg whether expressed in kilograms or pounds, for patient selection from EHRs. − Gene Mutation Normalization − Use case where gene mutations have been linked to rare disease progression. © 2016 Linguamatics Ltd2
  • 3.
    Answers to OurQuestions are in Free Text 80% of information at companies is in free text Most of the answers to our questions are there Ever-increasing amounts of text data to examine © 2016 Linguamatics Ltd3 0 5.000.000 10.000.000 15.000.000 20.000.000 25.000.000 PubMed Records − Different kinds of documents − External literature, patents, EHRs, internal reports, blogs, presentations − Different formats − HTML, PDF, XML, Word, PPT, Wiki, TXT, HL7
  • 4.
    Keyword Searching © 2016Linguamatics Ltd4 OLED Documents, Web Pages, Folders All these documents contain the keyword ‘Additive’. Read ALL the document to find the relevant bit to you
  • 5.
    Linguamatics in Healthcare ©2016 Linguamatics Ltd5 Electronic Health Record Enterprise Data Warehouse Pathology, radiology, initial assessment, discharge, check up Structured data Clinical Risk Monitor Patient characteristics Patient lists Clinical trials gov Patient characteristics Matching Clinical trials Patient Narrative Semantic search tags Semantic Enrichment Clinical case histories and/or genomic interpretation Patient characteristics Scientific literature
  • 6.
    I2E Transforms Textinto Actionable Insights © 2016 Linguamatics Ltd6 Turn text Into structured data using sophisticated queries Accurate results: only retrieves relevant results Complete results: comprehensive and systematic Analytics To drive analytics Enterprise Warehouse
  • 7.
    Search vs. TextMining © 2016 Linguamatics Ltd7 Text MiningSearch Engine Filter to find most relevant documents, then read News Feeds Literature Patents Internal Reports Social Media Natural Language Processing (NLP) - understand meaning © 2012 Linguamatics Ltd. Use of ontologies and clustered results Efficient review, without reading every document
  • 8.
    Challenges in UnstructuredData © 2016 Linguamatics Ltd Different word, same meaning cyclosporine ciclosporin Neoral Sandimmune Different expression, same meaning Non-smoker Does not smoke Does not drink or smoke Denies tobacco use Different grammar, same meaning 5mg/kg of cyclosporine per day 5mg/kg per day of cyclosporine cyclosporine 5mg/kg per day Same word, different context Diagnosed with diabetes Family history of diabetes No family history of diabetes NLP 8
  • 9.
    Linguistic Processing UsingNLP Interprets meaning of the text Groups words into meaningful units Search for different forms of words © 2016 Linguamatics Ltd9 We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 . sentences morphology - different forms noun groups match entities verb groups match actions
  • 10.
    From Words toMeaning © 2016 Linguamatics Ltd10 “Among them, nimesulide, a selective COX2 inhibitor, …” Entrez Gene ID: 5743 inhibits Entrez Gene ID: 5743 inhibits Identifying entities and relations Linguistics to establish relationships
  • 11.
    Text Mining -as Normal as Data Mining? © 2016 Linguamatics Ltd11 CHALLENGE How can we capture information from free text as conveniently as accessing a database? One of the essential differences is the lack of normalization of terms and concepts in free text. SOLUTION NLP-based text mining provides the capability to look through unstructured text normalizing: • Keywords to concepts • Numerical data • Range Search • Gene Mutations • Content source BENEFIT A set of structured facts, relationships or assertions, from different data sources that can be used for decision support Providing tabular or visual analytics to fill data warehouses and support better patient care. Literature Patents Reports Clinical Trials
  • 12.
  • 13.
  • 14.
    I2E: A FullyFederated Text Mining Platform 14 Merge into a single set of results Content Server 1 Content Server 2 Content Server 3 Content Server 4 Federated Architecture
  • 15.
    Normalizing Data fromDifferent Sources Single query Differently structured data sources on different servers − Journal articles (PubMed Central) on local Enterprise Server − MEDLINE on remote cloud server Single set of results © 2016 Linguamatics Ltd15
  • 16.
    Using EASL EASL: ExtractionAnd Search Language
  • 17.
  • 18.
    EASL Example © 2016Linguamatics Ltd18 query: document: - phrase: - class: {snid: nci.C1909, pt: Pharmacologic Substance} - treat - class: {snid: nlm.C04.588.180, pt: Breast Neoplasms} output: outputSettings: {documentsPerAssertion: -1, hitsPerDocPerAssertion: 10, outputOrdering: frequency, resultType: standard}
  • 19.
    Benefits of EASL Automation −Richer language for WSAPI applications − Can build a completely new query vs. adapting smart query parameters − Allows on-the-fly query production Re-use − Save, share and compare components of queries e.g. − Save out Alternatives − Load complex expressions in smart query parameters Audit − Human readable language for documenting the text mining strategy − Using open mark-up language (YAML) Conversion − Enable scripts to convert from other query languages e.g. advanced search Different interfaces − Enables 3rd party applications to create I2E queries − Developers can produce innovative specialized interfaces e.g. advanced search plus terminologies © 2016 Linguamatics Ltd19
  • 20.
    EASL: Enhancing theValue of Federated Search 20 Merge into a single set of results Content Server 1 Content Server 2 Content Server 3 Content Server 4 Federated Architecture
  • 21.
    translate2easl © 2016 LinguamaticsLtd21 Espacenet query Pubmed query espacenet2easl pubmed2easl EASL keywords + index terms EASL terminologies, linguistics … Clinical Trials OMIM FDA Drug Labels PatentsNIH Grants MEDLINE refine query
  • 22.
    Range Search andNormalization
  • 23.
    What Do WeWant to Find? Patients − below 60 years old − weight ≥ 80kg − not having chemotherapy after 2010 − with a mutation C677T © 2016 Linguamatics Ltd23
  • 24.
    Challenge: Variety Withinthe Text Below 60 years old − aged 58 − 35 years old − 42-year-old − 39 y/o Weight ≥ 80kg − 267 pounds − 280 lbs − 80.4kg − 82 kilograms © 2016 Linguamatics Ltd24 After 2010 − January 21, 2011 − October of 2012 − 08/21/11 − 2012-05-04 Mutation C677T − C677T − 677C>T − 677C/T − 677C->T
  • 25.
    Normalizing Gene Mutations Differenttypes of mutation description, including: − positional e.g. +869(T>C) − rsID e.g. rs100 Transform different syntax e.g. − 1166A/C -> A1166C − Asn to Ser substitution at codon 127 -> N127S − +1196C/T -> C1196T) − g.655C/A>G -> C655G, A655G − M567V/A -> M567V, M567A © 2016 Linguamatics Ltd25
  • 26.
    Mutation Normalization Examples ©2016 Linguamatics Ltd26
  • 27.
    Range Search Allows searchfor values within a range − in fixed fields e.g. publication date − within free text e.g. dosages Can directly ask for e.g. − patients with diabetes under 60 with BMI under 30 Can find intervals within the text and find these when search for a number or an overlapping range © 2016 Linguamatics Ltd27
  • 28.
    Range Search with Normalization RangeSearch (Age, Date) − Patients aged < 60yrs − Date before 2010 Normalizing: − Report Date, Age, Weight & BMI © 2016 Linguamatics Ltd28
  • 29.
    Normalization Benefits Ability tocompare measurements with different units e.g. kg vs. lbs Ability to perform range search for numerics, measurements, dates Standardized representations to link to structured data e.g. mutation databases Better clustering of results e.g. drug lab codes © 2016 Linguamatics Ltd29
  • 30.
    Real World Example:Mutation Normalization
  • 31.
    Mucopolysaccharidosis II: HunterSyndrome Rare X-linked recessive disorder Deficiency of the lysosomal enzyme iduronate-2-sulfatase Leads to progressive accumulation of glycosaminoglucans throughout the body Signs & symptoms: − Bone deformities with joint stiffness; Frequent respiratory infections; Cardiomyopathy; Hepatosplenomegaly; Neurocognitive impairment; Reduced lifespan − Some symptoms partially improved with enzyme replacement therapy Spectrum of clinical severity (mild to severe); main difference is progressive development of neurodegeneration in the severe form © 2016 Linguamatics Ltd31
  • 32.
    32 CHALLENGE • Scarcity ofknowledge of natural history of disease • Sparse data, needs high recall across full text papers • Mutation patterns very variable • Structured databases lack broad phenotypic association data © 2016 Linguamatics Ltd TEXT ANALYTICS FOR RARE DISEASES GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
  • 33.
    33 CHALLENGE • Scarcity ofknowledge of natural history of disease • Sparse data, needs high recall across full text papers • Mutation patterns very variable • Structured databases lack broad phenotypic association data SOLUTION • Developed workflow with Linguamatics I2E • Abstracts ID’ed in MEDLINE using broad vocabularies • Full text PDFs processed for text analytics • I2E mutation ontology and bespoke severity vocabs enabled extraction of genotype- phenotype associations BENEFIT • Extraction of patient mutations matched or bettered genetic databases • Increased understanding of IDS mutational spectrum for provider diagnostics and patient awareness • Enabled rational approach to immune response classification © 2016 Linguamatics Ltd TEXT ANALYTICS FOR RARE DISEASES GENOTYPE-PHENOTYPE ASSOCIATION IN HUNTER SYNDROME
  • 34.
    Shire-Use case © 2016Linguamatics Ltd34
  • 35.
    In Summary Better Normalizationof − Numbers, dates, drug codes, TNM cancer stage − Subsequent range search − Gene mutations In combination with a human readable open query language EASL − Maximises the ease and flexibility of asking complex questions simultaneously across different content sources Ultimately agile NLP text mining provides − High quality, structured, clustered & normalized results in the format you need − Improves speed to insight for faster decision making © 2016 Linguamatics Ltd35