Using text mining to inform genetic variant interpretation

Using text mining to inform
genetic variant interpretation
Karin Verspoor
Departmentof Computing and Information Systems
karin.verspoor@unimelb.edu.au

So you’re a medical doctor …
• With a very sick patient
• You can’t work out what’s going on
• You suspect a rare disease
• You order a DNA analysis
(whole exome or genome)
• And find a genetic mutation
What does it mean?

Clinical interpretation of variants
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
Step
Wet Lab Bioinformatics Clinical Informatics
Patient
Sample
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
StepPatient
Sample
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
ReportImage courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.

What’s a mutation?
• Genomic variation: alteration in a sequence
– hereditary (germ-line) mutations
– acquired (somatic) mutations
• Examples of variation
– SNP (single nucleotide polymorphism)
– Protein mutation
– insertions, deletions, duplications, inversions, . . .
• Types of variations
– DNA variations that have no adverse effects on our cells and
occur frequently in the population are called polymorphisms
– DNA variations that do affect the function of the protein
made from a gene and occur less often are called mutations

The Challenge: Interpreting variants
§ Identifying variation is becoming easier,
interpreting it remains difficult
• Which changes are due to normal individual variation?
• Which are associated with a phenotype of interest?

Interpreting variation through context
• Analysis of functional significance of variants
– Predicted impact of mutations
– Conservation analysis
– Allele frequencies from large genomic databases
• Existing knowledge captured in structured sources
– UniProt site-specific protein annotations
– The Cancer Gene Atlas genomic characterisation data
– Disease-specific variant databases, e.g. COSMIC and
InSiGHT
• Techniques for annotating variants
– Data aggregation from multiple sources
– Data integration and inference to reveal shared pathways

Exponential knowledge growth
• ~1550 peer-reviewed
gene-related databases in
NAR online Mol Bio
collection
• Over 25 million PubMed
entries (> 2,000/day)
• Breakdown of disciplinary
boundaries makes more of
it relevant to each of us

Why biomedical text mining?
0
200000
400000
600000
800000
1000000
1200000
1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publicationsperyear
Year
Exponential growth in size of Pubmed

Structured resources are not enough:
Literature is the primary repository of knowledge
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual curation is not sufficient for annotation of genomic databases”
Baumgartner et al ISMB 2007
“Our entire understanding of biology and medicine
is really contained in the published literature. And
since people write in natural language, if you can’t
get computers to turn that information into
databases and computable information, you’re
falling behind.”
-- Russ Altman, MD PhD, Stanford University

Recovery of variants from the
literature using text mining
Study:
Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying
the importance of supplementary material. Database: The Journal of Biological Databases and
Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]

Study: Recall of curated variants through the
application of text mining
• Given a curated resource of genetic variants,
• with explicit links to the source literature for
each variant,
• and a mutation extraction tool with
demonstrated good performance on intrinsic
evaluation
… how many variants can text mining recover?

InSiGHT
Gene:
Variant:
p.Lys286Gln
Lit. Reference:
Takahashi et al 2007

Motivations
• Assess real-world applicability of text mining
tools for supporting analysis of genetic
variants
• Speed up curation of mutation databases

Two databases
• InSiGHT, Human Variome Project
– MLH1, MSH2, MSH6 and PMS2 linked to
Lynch syndrome (germline mutations)
• COSMIC, Sanger Institute
– Somatic mutations linked to cancer
Database
PMIDs
associated to
Mutations
Total
Mutation
Count
Average
Mutations
per article Std Dev
InSiGHT 809 7022 8.68 18.55
COSMIC 7898 198864 25.18 521.18

Literature mutation extraction
• Many tools exist to perform mutation annotation
– MutationMiner, MutationFinder, EMU, tmVar, SETH,
...
• Research shows that they have high precision
and recall on MEDLINE abstracts (> 90% F1)
• There are also tools to do named entity
extraction of genes, diseases, body parts …
Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust
recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi:
10.12688/f1000research.3-18.v2 [PMID:25285203]

How to extract mutations from text?
• Essentially a named entity recognition task.
• Early attempts focused on SNPs and protein mutations (amino
acid residues).
• e.g., MutationFinder1 patterns (simplified):
(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)
Gly17Ser
Ser97Pro
• where AminoAcid is:
(CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG|
TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE|
ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE|
THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE|
TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE|
TYROSINE)
1http://mutationfinder.sourceforge.net/

Human Genome Variation Society
nomenclature (excerpt)

• Pattern-based approach to identifying genetic
variants
– dbSNP identifiers and standard HGVS nomenclature
(e.g. SETH https://rockt.github.io/SETH)
– natural language expressions of mutations
o This missense mutation converts a highly conserved glycine
(Gly17 of neurophysin) to a valine residue.
o Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant
lethality in individuals that do not have a functional prune gene.
o … where cysteines at positions 6, 42, 48, 90 and 393 were
replaced by serine.
Extraction of mutations from text

Extractor of Mutations (Kann Lab)

Study text sources
• PubMed
– 22M citations; title and abstract
• PubMed Central
– full text
– 512k available from PMC-Open Access
• Publisher site crawling
– Availability depends on license
– HTML pages can be noisy
• C676T –>Arg226Stop vs. C676TâArg226Stop

Extraction with EMU over our data
• EMU: Extract mutation from text and
link the mutations to co-occurring genes
• Normalize all mutation mentions to
HGVS format
– Format used in COSMIC and InSiGHT
• Match {gene, HGVS variant, PMID}
to curated data

Results
Abstracts and Full Text
NG = No Gene (ignoring gene in match)
Common/Cmn = PMIDs in common between database and corpus subset
(recall with respect to articles for which mutation entity recogniser had at
least one positive extraction)
Set
Cmn
art
Match
mutation
Recall Recall NG
Mutations
common
Recall
common
Recall
CmnNG
COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875
COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408
COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503
InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562
InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644
InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254

High Throughput vs non-High Throughput
Set
Cmn
art
Match
mutation
Recall Recall NG
Recall
common
Recall
CmnNG
HT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608
HT full text 1545 2719 0.0145 0.0172 0.027 0.0319
HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395
NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597
NHT full text 526 937 0.0815 0.0915 0.235 0.2639
NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895
Group PMIDs Count
Average
mutation
SD
Mutation
recall
COSMIC 7898 198 864 25.18 521.27 100.00%
COSMIC-HT 6266 187 367 29.9 584.82 94.22%
COSMIC-NHT 1632 11 497 7.04 38.05 5.78%

Considering tables and Supplementary
material
• Subset from COSMIC and InSiGHT available as
PubMed Central Open Access articles
• Supplementary material: MS Word, PDF, MS Excel,
PPT, images, …
InSiGHT COSMIC
Set Articles Matched Recall (%) Articles Matched Recall (%)
Abstracts 13 1 0.4 563 140 0.41
XML Full Text (FT) 9 20 7.94 487 694 2.05
PDF FT (PDFFT) 4 7 2.78 76 23 0.07
Tables 8 18 7.14 394 466 1.38
FT+PDFFT+Tables 13 44 17.46 563 929 2.75
Supp. Mat. 1 88 34.92 138 17015 50.59
All 13 115 45.63 563 17896 52.92

Recall still only 50%:
Where are the rest?
• Expressed in semi-structured data sources
– do not necessarily follow standard nomenclature more
predictably
– data spread unpredictably across columns (Wong et al.
2009)
• Different reference position in text than database
– curator correction or normalized to different build
• Nomenclature variation
– c.482_483delGA vs c.482_483del2
• Linguistic expression of mutations
– deletion of exon 3
– C>T mutation at nucleotide 2131

Information in tables (spreadsheets, etc.)
is expressed differently than in narrative text

Gene listed in column heading
Non-standard nomenclature
“Del exon 7”

Text mining over semi-structured data?
• Access ?
• Variability (!)
– File formats
– How connected to the main text?
• Semantics (?!)
– How to make sense of the data?
– How to map to standardized nomenclature?
… processing supplementary material will require new
strategies. Some technical solutions. Some research.

Extraction of gene-disease-
mutation relations
Study:
Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for
the Human Variome. BMC Medical Informatics and Decision Making.

Variant interpretation using literature
• Evidence of prior significance of variants
• Evidence of established connection of the variant
to specific patient cohorts
• Use alone or in combination with other evidence
• We aim to extract the relations that connect genes,
diseases and mutations
• Specific Objective of the work:
relation extraction over the
Variome Corpus

gene-mutation-disease-phenotype relations
• Variome Annotation Schema
– a schema defining entities and relations of interest
to curation of genetic variants
• Variome Corpus
– A corpus of full text articles annotated according to
the Variome Annotation Schema
– To be used as training and evaluation data for text
mining tools for extracting genetic variation
information from the published literature
31
http://www.opennicta.com.au/home/health/variome

The Variome Corpus
10 full-text publications related to colorectal cancer
Entities Relations
Gene Gene-has-Mutation
Mutation Cohort/Patient-has-Mutation
Disease Mutation-relatedto-Disease
Body part Disease-relatedto-Gene
Cohort/Patient Disease-relatedto-BodyPart
Size Mutation-has-Size
Age Cohort/Patient-has-Age
Gender Cohort/Patient-has-Gender
Ethnicity or Geo Location Cohort/Patient-has-EthnicityLoc
Characteristic Cohort/Patient-has-Disease
Cohort-has-size
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP.
(2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of
Biological Databases and Curation, bat019.
§ 43k words
§ Double-
annotated
§ IAA varies
§ .88-.92 F for
entities
§ Relations
much lower;
reconciled
manually

The Variome Corpus annotation
33

• Recognise genetic variants
• Named entity recognition for gene names
– Supervised learning for recognizing characteristics and contexts
– Combined with dictionaries to support normalisation
• Associating variants to genes
– Simple co-occurrence
– Combined with sequence verification
– Machine learning for relation classification (PKDE4J)
Extraction of mutation relations from text

Information Extraction, Structuring text
From:
A subset of colorectal tumour DNA samples from 17
patients carrying the p.Lys618Ala variant …
To:
T60 body-part 1307 1317 colorectal
T7 disease 1318 1324 tumour
R17_m relatedTo Arg1:T60 Arg2:T7
(colorectal relatedTo tumour)
T61_merge size 1342 1344 17
T24 cohort-patient 1345 1353 patients
R46_2 has Arg1:T24 Arg2:T7
(patients has tumour)
T62 mutation 1367 1378 p.Lys618Ala
R18_m has Arg1:T24 Arg2:T61_merge
(patients has 17) = (patient group size 17)
R19_m has Arg1:T24 Arg2:T62
(patients has p.Lys618Ala)

PKDE4J: Yonsei University IE system
• PKDE4J
– Extensible, flexible text mining system for public knowledge
discovery
– Entity and relation extraction from the unstructured text data
– Extension of Stanford CoreNLP (Manning et al., 2014)
– http://informatics.yonsei.ac.kr/pkde4j
• Differentiation of PKDE4J
– Configurable system
• Dictionary based entity extraction
• Extensible system
• Wide range of relation extraction tasks developing an
extensible rule engine based on dependency parsing
– Accurate performance
• PKDE4J outperforms many other competing algorithms for
both entity and relation extraction

PKDE4J: Yonsei University IE system
• PKDE4J’s major two pipelines
– Entity Extraction: Target entities based on
dictionaries by extending Stanford CoreNLP
– Relation Extraction: relationships among entities
based on dependency tree based rules

PKDE4J – Named Entity Recognition

PKDE4J – Named Entity Recognition
• Extension of Stanford CoreNLP
• Three major submodules
Pre-Processing Dictionary loading Entity annotation
• Flexible
configuration
(number and format
of dictionaries)
• Trie data structure
• Abbreviation resolution
• Tokenization: Stanford
PTBTokenizer
• Sentence splitting, POS
tagging, Lemmatization:
Stanford CoreNLP
• String normalization:
Special characters
processing
• N-gram matching: Apache
Lucene ShingleWrapper
• Approximate string
matching: Soft-TFIDF
• Regex NER (Rule-based):
Stanford CoreNLP
• Candidate entities filtering:
POS filtering, Stopword
removal
• Labeling: B/I/O format,
Entity type

PKDE4J – Relation Extraction

PKDE4J – Relation Extraction
• Based on dependency parse (grammatical structure)
based rules
• To extract a relation
Step 1: Identify the verbs in a sentence
Category
Number of
Verbs
Type Verb Example
Positive 68
Increase Lead, Contribute, Rise
Transmit Shift, Move, Migrate
Substitute Supplement, Alter
Negative 54
Decrease Decline, Diffuse, Down-regulate
Remove Deplete, Abrogate, Disassociate
Neutral 111
Contain Possess, Constitute, Include
Modify Methylate, Modulate , Normalize
Method Bleach, Centrifuge, Spin
Report Evaluate, Analyze, Examine
Plain 165 Plain Return, Switch, Balance

PKDE4J - RE
Step 2: Check structure of sentence
• Syntactic rules based on deep parsing
– Dependency tree encodes grammaticalrelations between words
in a sentences.
– The tree denotes syntactic dependenciesbetween two entities.
– Need to spot the portion of parse tree that is useful, pertinent to
location of entities in a sentence.

PKDE4J - RE
• Rule Extraction
– Use Strategy design pattern
– Capture predefined rules (17 strategies)
①Verb in dependency path
②No verb in dependency path
③Detect nominalization
④Weak nominalization
⑤Negation
⑥Tense (active / passive)
⑦Contain clause
⑧Clause distance
⑨Negation clause
⑩Number intervening entities
⑪Entities in between
⑫Surface distance
⑬Entity counts
⑭Same head
⑮Entity order
⑯Full tree path
⑰Path length

Evaluation:
PKDE4J over Variome Corpus
• Experimental set-up
– Data split
– Features?
– 10-fold cross-validation
• Focus on relations:
Used gold standard entities
• Baseline co-occurrence system

Results of the evaluation
Relation Extraction results for relations with at least 100
examples in the corpus.

Observations
• By applying text mining we can transform the
literature from an unstructured, difficult to use
resource, to a structured resource.
• We can build systems that can recognise core
biological entities in the published literature.
• With this, the information is more accessible
– Formalised and normalised in a database
– Directly query-able
• and can be used to facilitate more computation:
– Information retrieval in terms of entities
– Predictive modeling and hypothesis generation

Conclusions
• Variants are relatively easy to recognise in the
literature, when the recommended
nomenclature is followed (so please use it!).
• The relations between variants and other
entities are harder to extract, but still we can do
a reasonable job.
• There is lots of information that is in ancillary
files associated to the literature (with some
challenges for automated systems).
The literature can be effectively mined to identify
variant-related information to assist biocuration
and clinical interpretation of variants.

Using text mining to inform genetic variant interpretation

Recommended

Recommended

More Related Content

Similar to Using text mining to inform genetic variant interpretation

Similar to Using text mining to inform genetic variant interpretation (20)

More from Karin Verspoor

More from Karin Verspoor (6)

Recently uploaded

Recently uploaded (20)

Using text mining to inform genetic variant interpretation