SlideShare a Scribd company logo
Using text mining to inform
genetic variant interpretation
Karin Verspoor
Departmentof Computing and Information Systems
karin.verspoor@unimelb.edu.au
So you’re a medical doctor …
• With a very sick patient
• You can’t work out what’s going on
• You suspect a rare disease
• You order a DNA analysis
(whole exome or genome)
• And find a genetic mutation
What does it mean?
Clinical interpretation of variants
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
Step
Wet Lab Bioinformatics Clinical Informatics
Patient
Sample
Sample Data Flow
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant
Calling
Annotation
DB Load
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
Assembly
Variant
Normalisation
Report
Editing and
Signoff
Manual
Step
Automatic
StepPatient
Sample
Filtering
Known
Variant ?
Publish
Report
Curation
Peter Mac
Mutation
DB
External and
Locus Specific
DBs
Yes
No
Patient
Clinical
Report
Document
ReportImage courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.
What’s a mutation?
• Genomic variation: alteration in a sequence
– hereditary (germ-line) mutations
– acquired (somatic) mutations
• Examples of variation
– SNP (single nucleotide polymorphism)
– Protein mutation
– insertions, deletions, duplications, inversions, . . .
• Types of variations
– DNA variations that have no adverse effects on our cells and
occur frequently in the population are called polymorphisms
– DNA variations that do affect the function of the protein
made from a gene and occur less often are called mutations
The Challenge: Interpreting variants
§ Identifying variation is becoming easier,
interpreting it remains difficult
• Which changes are due to normal individual variation?
• Which are associated with a phenotype of interest?
Interpreting variation through context
• Analysis of functional significance of variants
– Predicted impact of mutations
– Conservation analysis
– Allele frequencies from large genomic databases
• Existing knowledge captured in structured sources
– UniProt site-specific protein annotations
– The Cancer Gene Atlas genomic characterisation data
– Disease-specific variant databases, e.g. COSMIC and
InSiGHT
• Techniques for annotating variants
– Data aggregation from multiple sources
– Data integration and inference to reveal shared pathways
Exponential	knowledge	growth
• ~1550	peer-reviewed	
gene-related	databases	in	
NAR	online	Mol Bio	
collection
• Over	25	million	PubMed	
entries	(>	2,000/day)
• Breakdown	of	disciplinary	
boundaries	makes	more	of	
it	relevant	to	each	of	us
Why	biomedical	text	mining?
0
200000
400000
600000
800000
1000000
1200000
1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publicationsperyear
Year
Exponential	growth	in	size	of	Pubmed
Structured resources are not enough:
Literature is the primary repository of knowledge
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
#Swiss-ProtProteins
Proteins missing a FUNCTION comment
Proteins gaining a FUNCTION comment
“Manual	curation	is	not	sufficient	for	annotation	of	genomic	databases”
Baumgartner	et	al	ISMB	2007
“Our entire understanding of biology and medicine
is really contained in the published literature. And
since people write in natural language, if you can’t
get computers to turn that information into
databases and computable information, you’re
falling behind.”
-- Russ Altman, MD PhD, Stanford University
Recovery of variants from the
literature using text mining
Study:
Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying
the importance of supplementary material. Database: The Journal of Biological Databases and
Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]
Study: Recall of curated variants through the
application of text mining
• Given a curated resource of genetic variants,
• with explicit links to the source literature for
each variant,
• and a mutation extraction tool with
demonstrated good performance on intrinsic
evaluation
… how many variants can text mining recover?
InSiGHT
Gene:
Variant:
p.Lys286Gln
Lit. Reference:
Takahashi et al 2007
Motivations
• Assess real-world applicability of text mining
tools for supporting analysis of genetic
variants
• Speed up curation of mutation databases
Two databases
• InSiGHT, Human Variome Project
– MLH1, MSH2, MSH6 and PMS2 linked to
Lynch syndrome (germline mutations)
• COSMIC, Sanger Institute
– Somatic mutations linked to cancer
Database
PMIDs
associated to
Mutations
Total
Mutation
Count
Average
Mutations
per article Std Dev
InSiGHT 809 7022 8.68 18.55
COSMIC 7898 198864 25.18 521.18
Literature mutation extraction
• Many tools exist to perform mutation annotation
– MutationMiner, MutationFinder, EMU, tmVar, SETH,
...
• Research shows that they have high precision
and recall on MEDLINE abstracts (> 90% F1)
• There are also tools to do named entity
extraction of genes, diseases, body parts …
Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust
recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi:
10.12688/f1000research.3-18.v2 [PMID:25285203]
How to extract mutations from text?
• Essentially a named entity recognition task.
• Early attempts focused on SNPs and protein mutations (amino
acid residues).
• e.g., MutationFinder1 patterns (simplified):
(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)
Gly17Ser
Ser97Pro
• where AminoAcid is:
(CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG|
TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE|
ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE|
THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE|
TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE|
TYROSINE)
1http://mutationfinder.sourceforge.net/
Human Genome Variation Society
nomenclature (excerpt)
• Pattern-based approach to identifying genetic
variants
– dbSNP identifiers and standard HGVS nomenclature
(e.g. SETH https://rockt.github.io/SETH)
– natural language expressions of mutations
o This missense mutation converts a highly conserved glycine
(Gly17 of neurophysin) to a valine residue.
o Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant
lethality in individuals that do not have a functional prune gene.
o … where cysteines at positions 6, 42, 48, 90 and 393 were
replaced by serine.
Extraction of mutations from text
Extractor of Mutations (Kann Lab)
Study	text	sources
• PubMed
– 22M	citations;	title	and	abstract
• PubMed	Central
– full	text
– 512k	available	from	PMC-Open	Access
• Publisher	site	crawling
– Availability	depends	on	license
– HTML	pages	can	be	noisy
• C676T	–>Arg226Stop	vs.	C676TâArg226Stop
Extraction with EMU over our data
• EMU: Extract mutation from text and
link the mutations to co-occurring genes
• Normalize all mutation mentions to
HGVS format
– Format used in COSMIC and InSiGHT
• Match {gene, HGVS variant, PMID}
to curated data
Results
Abstracts and Full Text
NG = No Gene (ignoring gene in match)
Common/Cmn = PMIDs in common between database and corpus subset
(recall with respect to articles for which mutation entity recogniser had at
least one positive extraction)
Set
Cmn
art
Match
mutation
Recall Recall NG
Mutations
common
Recall
common
Recall
CmnNG
COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875
COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408
COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503
InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562
InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644
InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254
High Throughput vs non-High Throughput
Set
Cmn
art
Match
mutation
Recall Recall NG
Recall
common
Recall
CmnNG
HT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608
HT full text 1545 2719 0.0145 0.0172 0.027 0.0319
HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395
NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597
NHT full text 526 937 0.0815 0.0915 0.235 0.2639
NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895
Group PMIDs Count
Average
mutation
SD
Mutation
recall
COSMIC 7898 198 864 25.18 521.27 100.00%
COSMIC-HT 6266 187 367 29.9 584.82 94.22%
COSMIC-NHT 1632 11 497 7.04 38.05 5.78%
Considering tables and Supplementary
material
• Subset from COSMIC and InSiGHT available as
PubMed Central Open Access articles
• Supplementary material: MS Word, PDF, MS Excel,
PPT, images, …
InSiGHT COSMIC
Set Articles Matched Recall (%) Articles Matched Recall (%)
Abstracts 13 1 0.4 563 140 0.41
XML Full Text (FT) 9 20 7.94 487 694 2.05
PDF FT (PDFFT) 4 7 2.78 76 23 0.07
Tables 8 18 7.14 394 466 1.38
FT+PDFFT+Tables 13 44 17.46 563 929 2.75
Supp. Mat. 1 88 34.92 138 17015 50.59
All 13 115 45.63 563 17896 52.92
Recall still only 50%:
Where are the rest?
• Expressed in semi-structured data sources
– do not necessarily follow standard nomenclature more
predictably
– data spread unpredictably across columns (Wong et al.
2009)
• Different reference position in text than database
– curator correction or normalized to different build
• Nomenclature variation
– c.482_483delGA vs c.482_483del2
• Linguistic expression of mutations
– deletion of exon 3
– C>T mutation at nucleotide 2131
Information in tables (spreadsheets, etc.)
is expressed differently than in narrative text
Gene listed in column heading
Non-standard nomenclature
“Del exon 7”
Text mining over semi-structured data?
• Access ?
• Variability (!)
– File formats
– How connected to the main text?
• Semantics (?!)
– How to make sense of the data?
– How to map to standardized nomenclature?
… processing supplementary material will require new
strategies. Some technical solutions. Some research.
Extraction of gene-disease-
mutation relations
Study:
Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for
the Human Variome. BMC Medical Informatics and Decision Making.
Variant interpretation using literature
• Evidence of prior significance of variants
• Evidence of established connection of the variant
to specific patient cohorts
• Use alone or in combination with other evidence
• We aim to extract the relations that connect genes,
diseases and mutations
• Specific Objective of the work:
relation extraction over the
Variome Corpus
gene-mutation-disease-phenotype relations
• Variome Annotation Schema
– a schema defining entities and relations of interest
to curation of genetic variants
• Variome Corpus
– A corpus of full text articles annotated according to
the Variome Annotation Schema
– To be used as training and evaluation data for text
mining tools for extracting genetic variation
information from the published literature
31
http://www.opennicta.com.au/home/health/variome
The Variome Corpus
10 full-text publications related to colorectal cancer
Entities Relations
Gene Gene-has-Mutation
Mutation Cohort/Patient-has-Mutation
Disease Mutation-relatedto-Disease
Body part Disease-relatedto-Gene
Cohort/Patient Disease-relatedto-BodyPart
Size Mutation-has-Size
Age Cohort/Patient-has-Age
Gender Cohort/Patient-has-Gender
Ethnicity or Geo Location Cohort/Patient-has-EthnicityLoc
Characteristic Cohort/Patient-has-Disease
Cohort-has-size
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP.
(2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of
Biological Databases and Curation, bat019.
§ 43k words
§ Double-
annotated
§ IAA varies
§ .88-.92 F for
entities
§ Relations
much lower;
reconciled
manually
The Variome Corpus annotation
33
• Recognise genetic variants
• Named entity recognition for gene names
– Supervised learning for recognizing characteristics and contexts
– Combined with dictionaries to support normalisation
• Associating variants to genes
– Simple co-occurrence
– Combined with sequence verification
– Machine learning for relation classification (PKDE4J)
Extraction of mutation relations from text
Information Extraction, Structuring text
From:
A subset of colorectal tumour DNA samples from 17
patients carrying the p.Lys618Ala variant …
To:
T60 body-part 1307 1317 colorectal
T7 disease 1318 1324 tumour
R17_m relatedTo Arg1:T60 Arg2:T7
(colorectal relatedTo tumour)
T61_merge size 1342 1344 17
T24 cohort-patient 1345 1353 patients
R46_2 has Arg1:T24 Arg2:T7
(patients has tumour)
T62 mutation 1367 1378 p.Lys618Ala
R18_m has Arg1:T24 Arg2:T61_merge
(patients has 17) = (patient group size 17)
R19_m has Arg1:T24 Arg2:T62
(patients has p.Lys618Ala)
PKDE4J: Yonsei University IE system
• PKDE4J
– Extensible, flexible text mining system for public knowledge
discovery
– Entity and relation extraction from the unstructured text data
– Extension of Stanford CoreNLP (Manning et al., 2014)
– http://informatics.yonsei.ac.kr/pkde4j
• Differentiation of PKDE4J
– Configurable system
• Dictionary based entity extraction
• Extensible system
• Wide range of relation extraction tasks developing an
extensible rule engine based on dependency parsing
– Accurate performance
• PKDE4J outperforms many other competing algorithms for
both entity and relation extraction
PKDE4J: Yonsei University IE system
• PKDE4J’s major two pipelines
– Entity Extraction: Target entities based on
dictionaries by extending Stanford CoreNLP
– Relation Extraction: relationships among entities
based on dependency tree based rules
PKDE4J – Named Entity Recognition
PKDE4J – Named Entity Recognition
• Extension of Stanford CoreNLP
• Three major submodules
Pre-Processing Dictionary loading Entity annotation
• Flexible
configuration
(number and format
of dictionaries)
• Trie data structure
• Abbreviation resolution
• Tokenization: Stanford
PTBTokenizer
• Sentence splitting, POS
tagging, Lemmatization:
Stanford CoreNLP
• String normalization:
Special characters
processing
• N-gram matching: Apache
Lucene ShingleWrapper
• Approximate string
matching: Soft-TFIDF
• Regex NER (Rule-based):
Stanford CoreNLP
• Candidate entities filtering:
POS filtering, Stopword
removal
• Labeling: B/I/O format,
Entity type
PKDE4J – Relation Extraction
PKDE4J – Relation Extraction
• Based on dependency parse (grammatical structure)
based rules
• To extract a relation
Step 1: Identify the verbs in a sentence
Category
Number of
Verbs
Type Verb Example
Positive 68
Increase Lead, Contribute, Rise
Transmit Shift, Move, Migrate
Substitute Supplement, Alter
Negative 54
Decrease Decline, Diffuse, Down-regulate
Remove Deplete, Abrogate, Disassociate
Neutral 111
Contain Possess, Constitute, Include
Modify Methylate, Modulate , Normalize
Method Bleach, Centrifuge, Spin
Report Evaluate, Analyze, Examine
Plain 165 Plain Return, Switch, Balance
PKDE4J - RE
Step 2: Check structure of sentence
• Syntactic rules based on deep parsing
– Dependency tree encodes grammaticalrelations between words
in a sentences.
– The tree denotes syntactic dependenciesbetween two entities.
– Need to spot the portion of parse tree that is useful, pertinent to
location of entities in a sentence.
PKDE4J - RE
• Rule Extraction
– Use Strategy design pattern
– Capture predefined rules (17 strategies)
①Verb in dependency path
②No verb in dependency path
③Detect nominalization
④Weak nominalization
⑤Negation
⑥Tense (active / passive)
⑦Contain clause
⑧Clause distance
⑨Negation clause
⑩Number intervening entities
⑪Entities in between
⑫Surface distance
⑬Entity counts
⑭Same head
⑮Entity order
⑯Full tree path
⑰Path length
Evaluation:
PKDE4J over Variome Corpus
• Experimental set-up
– Data split
– Features?
– 10-fold cross-validation
• Focus on relations:
Used gold standard entities
• Baseline co-occurrence system
Results of the evaluation
Relation Extraction results for relations with at least 100
examples in the corpus.
Observations
• By applying text mining we can transform the
literature from an unstructured, difficult to use
resource, to a structured resource.
• We can build systems that can recognise core
biological entities in the published literature.
• With this, the information is more accessible
– Formalised and normalised in a database
– Directly query-able
• and can be used to facilitate more computation:
– Information retrieval in terms of entities
– Predictive modeling and hypothesis generation
Conclusions
• Variants are relatively easy to recognise in the
literature, when the recommended
nomenclature is followed (so please use it!).
• The relations between variants and other
entities are harder to extract, but still we can do
a reasonable job.
• There is lots of information that is in ancillary
files associated to the literature (with some
challenges for automated systems).
The literature can be effectively mined to identify
variant-related information to assist biocuration
and clinical interpretation of variants.
© Copyright The University of Melbourne 2016

More Related Content

Similar to Using text mining to inform genetic variant interpretation

Biomarkers brain regions
Biomarkers brain regionsBiomarkers brain regions
Biomarkers brain regions
Ann-Marie Roche
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
csfunk
 
Genetic architecture of developmental traits in populations of male gypsy moths
Genetic architecture of developmental traits in populations of male gypsy mothsGenetic architecture of developmental traits in populations of male gypsy moths
Genetic architecture of developmental traits in populations of male gypsy moths
cfriedline
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San Diego
François Fauteux
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
Josef Scheiber
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
Robert (Rob) Salomon
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
Aksw Group
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
SKUASTKashmir
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
Chris Evelo
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
Denise Carvalho-Silva, PhD
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways:
Chris Evelo
 
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Karin Verspoor
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
shashi bijapure
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
Dan Sullivan, Ph.D.
 
MoM2010: Bioinformatics
MoM2010: BioinformaticsMoM2010: Bioinformatics
MoM2010: Bioinformatics
Hend Al-Khalifa
 
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
millerjeremya
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
bbuliksullivan
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
MongoDB
 

Similar to Using text mining to inform genetic variant interpretation (20)

Biomarkers brain regions
Biomarkers brain regionsBiomarkers brain regions
Biomarkers brain regions
 
Computational Biology thesis defense
Computational Biology thesis defenseComputational Biology thesis defense
Computational Biology thesis defense
 
Genetic architecture of developmental traits in populations of male gypsy moths
Genetic architecture of developmental traits in populations of male gypsy mothsGenetic architecture of developmental traits in populations of male gypsy moths
Genetic architecture of developmental traits in populations of male gypsy moths
 
Fauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San DiegoFauteux World ADC 2017 San Diego
Fauteux World ADC 2017 San Diego
 
Big Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use CasesBig Data in Pharma - Overview and Use Cases
Big Data in Pharma - Overview and Use Cases
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Bioinformatics issues and challanges presentation at s p college
Bioinformatics  issues and challanges  presentation at s p collegeBioinformatics  issues and challanges  presentation at s p college
Bioinformatics issues and challanges presentation at s p college
 
Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...Using biological network approaches for dynamic extension of micronutrient re...
Using biological network approaches for dynamic extension of micronutrient re...
 
Experiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the CommunityExperiences in Hosting Big Chemistry Data Collections for the Community
Experiences in Hosting Big Chemistry Data Collections for the Community
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.orgCrowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
 
Analysis with biological pathways:
Analysis with biological pathways: Analysis with biological pathways:
Analysis with biological pathways:
 
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...Leveraging Text Classification Strategies for Clinical and Public Health Appl...
Leveraging Text Classification Strategies for Clinical and Public Health Appl...
 
Bioinformatics seminar
Bioinformatics seminarBioinformatics seminar
Bioinformatics seminar
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
MoM2010: Bioinformatics
MoM2010: BioinformaticsMoM2010: Bioinformatics
MoM2010: Bioinformatics
 
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
pro-iBiosphere Towards Open Biodiversity Knowledge COOPEUS 2013
 
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score RegressionPartitioning Heritability using GWAS Summary Statistics with LD Score Regression
Partitioning Heritability using GWAS Summary Statistics with LD Score Regression
 
Accelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo dbAccelerate pharmaceutical r&d with mongo db
Accelerate pharmaceutical r&d with mongo db
 

More from Karin Verspoor

Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questions
Karin Verspoor
 
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin VerspoorRobogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Karin Verspoor
 
Doctor Digital will see you now
Doctor Digital will see you nowDoctor Digital will see you now
Doctor Digital will see you now
Karin Verspoor
 
Function and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge FusionFunction and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge Fusion
Karin Verspoor
 
Syndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage NotesSyndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage Notes
Karin Verspoor
 
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Karin Verspoor
 

More from Karin Verspoor (6)

Using real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questionsUsing real-world evidence to investigate clinical research questions
Using real-world evidence to investigate clinical research questions
 
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin VerspoorRobogals 10th Anniversary Gala Keynote, Karin Verspoor
Robogals 10th Anniversary Gala Keynote, Karin Verspoor
 
Doctor Digital will see you now
Doctor Digital will see you nowDoctor Digital will see you now
Doctor Digital will see you now
 
Function and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge FusionFunction and Phenotype Prediction through Data and Knowledge Fusion
Function and Phenotype Prediction through Data and Knowledge Fusion
 
Syndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage NotesSyndromic Surveillance from Emergency Department Triage Notes
Syndromic Surveillance from Emergency Department Triage Notes
 
Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...Topic modeling of Emergency Department Triage notes for characterising pain-r...
Topic modeling of Emergency Department Triage notes for characterising pain-r...
 

Recently uploaded

原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
Hitesh Sikarwar
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 

Recently uploaded (20)

原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Cytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptxCytokines and their role in immune regulation.pptx
Cytokines and their role in immune regulation.pptx
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Deep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless ReproducibilityDeep Software Variability and Frictionless Reproducibility
Deep Software Variability and Frictionless Reproducibility
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 

Using text mining to inform genetic variant interpretation

  • 1. Using text mining to inform genetic variant interpretation Karin Verspoor Departmentof Computing and Information Systems karin.verspoor@unimelb.edu.au
  • 2. So you’re a medical doctor … • With a very sick patient • You can’t work out what’s going on • You suspect a rare disease • You order a DNA analysis (whole exome or genome) • And find a genetic mutation What does it mean?
  • 3. Clinical interpretation of variants Sample Data Flow Histology DNA Extract PCR Sequencing Alignment Variant Calling Annotation DB Load Filtering Known Variant ? Publish Report Curation Peter Mac Mutation DB External and Locus Specific DBs Yes No Patient Clinical Report Document Assembly Variant Normalisation Report Editing and Signoff Manual Step Automatic Step Wet Lab Bioinformatics Clinical Informatics Patient Sample Sample Data Flow Histology DNA Extract PCR Sequencing Alignment Variant Calling Annotation DB Load Filtering Known Variant ? Publish Report Curation Peter Mac Mutation DB External and Locus Specific DBs Yes No Patient Clinical Report Document Assembly Variant Normalisation Report Editing and Signoff Manual Step Automatic StepPatient Sample Filtering Known Variant ? Publish Report Curation Peter Mac Mutation DB External and Locus Specific DBs Yes No Patient Clinical Report Document ReportImage courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.
  • 4. What’s a mutation? • Genomic variation: alteration in a sequence – hereditary (germ-line) mutations – acquired (somatic) mutations • Examples of variation – SNP (single nucleotide polymorphism) – Protein mutation – insertions, deletions, duplications, inversions, . . . • Types of variations – DNA variations that have no adverse effects on our cells and occur frequently in the population are called polymorphisms – DNA variations that do affect the function of the protein made from a gene and occur less often are called mutations
  • 5. The Challenge: Interpreting variants § Identifying variation is becoming easier, interpreting it remains difficult • Which changes are due to normal individual variation? • Which are associated with a phenotype of interest?
  • 6. Interpreting variation through context • Analysis of functional significance of variants – Predicted impact of mutations – Conservation analysis – Allele frequencies from large genomic databases • Existing knowledge captured in structured sources – UniProt site-specific protein annotations – The Cancer Gene Atlas genomic characterisation data – Disease-specific variant databases, e.g. COSMIC and InSiGHT • Techniques for annotating variants – Data aggregation from multiple sources – Data integration and inference to reveal shared pathways
  • 7. Exponential knowledge growth • ~1550 peer-reviewed gene-related databases in NAR online Mol Bio collection • Over 25 million PubMed entries (> 2,000/day) • Breakdown of disciplinary boundaries makes more of it relevant to each of us
  • 9. Structured resources are not enough: Literature is the primary repository of knowledge 0 20000 40000 60000 80000 100000 120000 1/02 1/03 1/04 1/05 1/06 1/07 #Swiss-ProtProteins Proteins missing a FUNCTION comment Proteins gaining a FUNCTION comment “Manual curation is not sufficient for annotation of genomic databases” Baumgartner et al ISMB 2007 “Our entire understanding of biology and medicine is really contained in the published literature. And since people write in natural language, if you can’t get computers to turn that information into databases and computable information, you’re falling behind.” -- Russ Altman, MD PhD, Stanford University
  • 10. Recovery of variants from the literature using text mining Study: Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying the importance of supplementary material. Database: The Journal of Biological Databases and Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]
  • 11. Study: Recall of curated variants through the application of text mining • Given a curated resource of genetic variants, • with explicit links to the source literature for each variant, • and a mutation extraction tool with demonstrated good performance on intrinsic evaluation … how many variants can text mining recover?
  • 13. Motivations • Assess real-world applicability of text mining tools for supporting analysis of genetic variants • Speed up curation of mutation databases
  • 14. Two databases • InSiGHT, Human Variome Project – MLH1, MSH2, MSH6 and PMS2 linked to Lynch syndrome (germline mutations) • COSMIC, Sanger Institute – Somatic mutations linked to cancer Database PMIDs associated to Mutations Total Mutation Count Average Mutations per article Std Dev InSiGHT 809 7022 8.68 18.55 COSMIC 7898 198864 25.18 521.18
  • 15. Literature mutation extraction • Many tools exist to perform mutation annotation – MutationMiner, MutationFinder, EMU, tmVar, SETH, ... • Research shows that they have high precision and recall on MEDLINE abstracts (> 90% F1) • There are also tools to do named entity extraction of genes, diseases, body parts … Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi: 10.12688/f1000research.3-18.v2 [PMID:25285203]
  • 16. How to extract mutations from text? • Essentially a named entity recognition task. • Early attempts focused on SNPs and protein mutations (amino acid residues). • e.g., MutationFinder1 patterns (simplified): (?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid) Gly17Ser Ser97Pro • where AminoAcid is: (CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG| TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE| ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE| THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE| TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE| TYROSINE) 1http://mutationfinder.sourceforge.net/
  • 17. Human Genome Variation Society nomenclature (excerpt)
  • 18. • Pattern-based approach to identifying genetic variants – dbSNP identifiers and standard HGVS nomenclature (e.g. SETH https://rockt.github.io/SETH) – natural language expressions of mutations o This missense mutation converts a highly conserved glycine (Gly17 of neurophysin) to a valine residue. o Killer of prune (Kpn) is a mutation in the awd gene which substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene. o … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine. Extraction of mutations from text
  • 20. Study text sources • PubMed – 22M citations; title and abstract • PubMed Central – full text – 512k available from PMC-Open Access • Publisher site crawling – Availability depends on license – HTML pages can be noisy • C676T –>Arg226Stop vs. C676TâArg226Stop
  • 21. Extraction with EMU over our data • EMU: Extract mutation from text and link the mutations to co-occurring genes • Normalize all mutation mentions to HGVS format – Format used in COSMIC and InSiGHT • Match {gene, HGVS variant, PMID} to curated data
  • 22. Results Abstracts and Full Text NG = No Gene (ignoring gene in match) Common/Cmn = PMIDs in common between database and corpus subset (recall with respect to articles for which mutation entity recogniser had at least one positive extraction) Set Cmn art Match mutation Recall Recall NG Mutations common Recall common Recall CmnNG COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875 COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408 COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503 InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562 InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644 InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254
  • 23. High Throughput vs non-High Throughput Set Cmn art Match mutation Recall Recall NG Recall common Recall CmnNG HT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608 HT full text 1545 2719 0.0145 0.0172 0.027 0.0319 HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395 NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597 NHT full text 526 937 0.0815 0.0915 0.235 0.2639 NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895 Group PMIDs Count Average mutation SD Mutation recall COSMIC 7898 198 864 25.18 521.27 100.00% COSMIC-HT 6266 187 367 29.9 584.82 94.22% COSMIC-NHT 1632 11 497 7.04 38.05 5.78%
  • 24. Considering tables and Supplementary material • Subset from COSMIC and InSiGHT available as PubMed Central Open Access articles • Supplementary material: MS Word, PDF, MS Excel, PPT, images, … InSiGHT COSMIC Set Articles Matched Recall (%) Articles Matched Recall (%) Abstracts 13 1 0.4 563 140 0.41 XML Full Text (FT) 9 20 7.94 487 694 2.05 PDF FT (PDFFT) 4 7 2.78 76 23 0.07 Tables 8 18 7.14 394 466 1.38 FT+PDFFT+Tables 13 44 17.46 563 929 2.75 Supp. Mat. 1 88 34.92 138 17015 50.59 All 13 115 45.63 563 17896 52.92
  • 25. Recall still only 50%: Where are the rest? • Expressed in semi-structured data sources – do not necessarily follow standard nomenclature more predictably – data spread unpredictably across columns (Wong et al. 2009) • Different reference position in text than database – curator correction or normalized to different build • Nomenclature variation – c.482_483delGA vs c.482_483del2 • Linguistic expression of mutations – deletion of exon 3 – C>T mutation at nucleotide 2131
  • 26. Information in tables (spreadsheets, etc.) is expressed differently than in narrative text
  • 27. Gene listed in column heading Non-standard nomenclature “Del exon 7”
  • 28. Text mining over semi-structured data? • Access ? • Variability (!) – File formats – How connected to the main text? • Semantics (?!) – How to make sense of the data? – How to map to standardized nomenclature? … processing supplementary material will require new strategies. Some technical solutions. Some research.
  • 29. Extraction of gene-disease- mutation relations Study: Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for the Human Variome. BMC Medical Informatics and Decision Making.
  • 30. Variant interpretation using literature • Evidence of prior significance of variants • Evidence of established connection of the variant to specific patient cohorts • Use alone or in combination with other evidence • We aim to extract the relations that connect genes, diseases and mutations • Specific Objective of the work: relation extraction over the Variome Corpus
  • 31. gene-mutation-disease-phenotype relations • Variome Annotation Schema – a schema defining entities and relations of interest to curation of genetic variants • Variome Corpus – A corpus of full text articles annotated according to the Variome Annotation Schema – To be used as training and evaluation data for text mining tools for extracting genetic variation information from the published literature 31 http://www.opennicta.com.au/home/health/variome
  • 32. The Variome Corpus 10 full-text publications related to colorectal cancer Entities Relations Gene Gene-has-Mutation Mutation Cohort/Patient-has-Mutation Disease Mutation-relatedto-Disease Body part Disease-relatedto-Gene Cohort/Patient Disease-relatedto-BodyPart Size Mutation-has-Size Age Cohort/Patient-has-Age Gender Cohort/Patient-has-Gender Ethnicity or Geo Location Cohort/Patient-has-EthnicityLoc Characteristic Cohort/Patient-has-Disease Cohort-has-size Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP. (2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of Biological Databases and Curation, bat019. § 43k words § Double- annotated § IAA varies § .88-.92 F for entities § Relations much lower; reconciled manually
  • 33. The Variome Corpus annotation 33
  • 34. • Recognise genetic variants • Named entity recognition for gene names – Supervised learning for recognizing characteristics and contexts – Combined with dictionaries to support normalisation • Associating variants to genes – Simple co-occurrence – Combined with sequence verification – Machine learning for relation classification (PKDE4J) Extraction of mutation relations from text
  • 35. Information Extraction, Structuring text From: A subset of colorectal tumour DNA samples from 17 patients carrying the p.Lys618Ala variant … To: T60 body-part 1307 1317 colorectal T7 disease 1318 1324 tumour R17_m relatedTo Arg1:T60 Arg2:T7 (colorectal relatedTo tumour) T61_merge size 1342 1344 17 T24 cohort-patient 1345 1353 patients R46_2 has Arg1:T24 Arg2:T7 (patients has tumour) T62 mutation 1367 1378 p.Lys618Ala R18_m has Arg1:T24 Arg2:T61_merge (patients has 17) = (patient group size 17) R19_m has Arg1:T24 Arg2:T62 (patients has p.Lys618Ala)
  • 36. PKDE4J: Yonsei University IE system • PKDE4J – Extensible, flexible text mining system for public knowledge discovery – Entity and relation extraction from the unstructured text data – Extension of Stanford CoreNLP (Manning et al., 2014) – http://informatics.yonsei.ac.kr/pkde4j • Differentiation of PKDE4J – Configurable system • Dictionary based entity extraction • Extensible system • Wide range of relation extraction tasks developing an extensible rule engine based on dependency parsing – Accurate performance • PKDE4J outperforms many other competing algorithms for both entity and relation extraction
  • 37. PKDE4J: Yonsei University IE system • PKDE4J’s major two pipelines – Entity Extraction: Target entities based on dictionaries by extending Stanford CoreNLP – Relation Extraction: relationships among entities based on dependency tree based rules
  • 38. PKDE4J – Named Entity Recognition
  • 39. PKDE4J – Named Entity Recognition • Extension of Stanford CoreNLP • Three major submodules Pre-Processing Dictionary loading Entity annotation • Flexible configuration (number and format of dictionaries) • Trie data structure • Abbreviation resolution • Tokenization: Stanford PTBTokenizer • Sentence splitting, POS tagging, Lemmatization: Stanford CoreNLP • String normalization: Special characters processing • N-gram matching: Apache Lucene ShingleWrapper • Approximate string matching: Soft-TFIDF • Regex NER (Rule-based): Stanford CoreNLP • Candidate entities filtering: POS filtering, Stopword removal • Labeling: B/I/O format, Entity type
  • 40. PKDE4J – Relation Extraction
  • 41. PKDE4J – Relation Extraction • Based on dependency parse (grammatical structure) based rules • To extract a relation Step 1: Identify the verbs in a sentence Category Number of Verbs Type Verb Example Positive 68 Increase Lead, Contribute, Rise Transmit Shift, Move, Migrate Substitute Supplement, Alter Negative 54 Decrease Decline, Diffuse, Down-regulate Remove Deplete, Abrogate, Disassociate Neutral 111 Contain Possess, Constitute, Include Modify Methylate, Modulate , Normalize Method Bleach, Centrifuge, Spin Report Evaluate, Analyze, Examine Plain 165 Plain Return, Switch, Balance
  • 42. PKDE4J - RE Step 2: Check structure of sentence • Syntactic rules based on deep parsing – Dependency tree encodes grammaticalrelations between words in a sentences. – The tree denotes syntactic dependenciesbetween two entities. – Need to spot the portion of parse tree that is useful, pertinent to location of entities in a sentence.
  • 43. PKDE4J - RE • Rule Extraction – Use Strategy design pattern – Capture predefined rules (17 strategies) ①Verb in dependency path ②No verb in dependency path ③Detect nominalization ④Weak nominalization ⑤Negation ⑥Tense (active / passive) ⑦Contain clause ⑧Clause distance ⑨Negation clause ⑩Number intervening entities ⑪Entities in between ⑫Surface distance ⑬Entity counts ⑭Same head ⑮Entity order ⑯Full tree path ⑰Path length
  • 44. Evaluation: PKDE4J over Variome Corpus • Experimental set-up – Data split – Features? – 10-fold cross-validation • Focus on relations: Used gold standard entities • Baseline co-occurrence system
  • 45. Results of the evaluation Relation Extraction results for relations with at least 100 examples in the corpus.
  • 46. Observations • By applying text mining we can transform the literature from an unstructured, difficult to use resource, to a structured resource. • We can build systems that can recognise core biological entities in the published literature. • With this, the information is more accessible – Formalised and normalised in a database – Directly query-able • and can be used to facilitate more computation: – Information retrieval in terms of entities – Predictive modeling and hypothesis generation
  • 47. Conclusions • Variants are relatively easy to recognise in the literature, when the recommended nomenclature is followed (so please use it!). • The relations between variants and other entities are harder to extract, but still we can do a reasonable job. • There is lots of information that is in ancillary files associated to the literature (with some challenges for automated systems). The literature can be effectively mined to identify variant-related information to assist biocuration and clinical interpretation of variants.
  • 48. © Copyright The University of Melbourne 2016