SlideShare a Scribd company logo
Ambiguity 
and 
Variability 
of 
Database 
and 
So6ware 
Names 
in 
Bioinforma:cs 
SMBM 
2012 
Geraint 
Duck1, 
Robert 
Stevens1, 
David 
Robertson2 
and 
Goran 
Nenadic1 
1School 
of 
Computer 
Science, 
2Faculty 
of 
Life 
Sciences 
The 
University 
of 
Manchester 
Manchester, 
UK
Named 
En:ty 
Recogni:on 
(NER) 
• Variety 
of 
NER 
uses 
– Species 
– Gene/protein 
names 
– Chemical 
names 
• Variety 
of 
NER 
accuracy 
– 95% 
F-­‐score 
species 
(LINNAEUS) 
– 73% 
F-­‐score 
(strict) 
gene 
name 
(ABNER) 
– Over 
70% 
F-­‐score 
chemical 
names 
(OSCAR3) 
• Draw 
parallels 
for 
database 
and 
so/ware 
NER 
2
Example 
PMC1660556; 
M. 
Watson 
3
Challenges 
-­‐ 
Ambiguity 
• leg 
• white 
• cab 
• C. 
elegans 
– 41 
NCBI 
taxonomy 
species 
• HIV 
– Human 
immunodeficiency 
virus 
– Human 
immunovirus 
• analysis 
• Network 
• graph 
• DIP 
– distal 
interphalangeal 
– Database 
of 
Interac:ng 
Proteins 
4
Challenges 
-­‐ 
Variability 
• NF-­‐kappaB 
• NF-­‐kappa 
B 
• NF-­‐kappa-­‐B 
• NF-­‐κB 
• Case 
variants 
• Spelling 
variants 
• ClustalW 
• Clustal 
W 
• Clustal-­‐W 
• CLUSTAL 
W 
• ClustalX 
(GUI)? 
• Now: 
Clustal 
Omega 
5
Preliminary 
• Annota:on 
guidelines 
– Database, 
so6ware, 
package, 
ontology 
names 
– Not 
file 
formats, 
algorithms, 
tasks, 
methods, 
database 
iden:fiers, 
programming 
languages, 
opera:ng 
systems, 
etc. 
• Gold 
standard 
corpus 
– 25 
from 
BMC 
Bioinforma:cs 
and 
PLoS 
Computa:onal 
Biology; 
5 
from 
Genome 
Biology 
• Dic:onary 
of 
resource 
names 
– 4,879 
unique 
entries 
from 
10 
online 
resources 
6
Preliminary 
• Inter-­‐annotator 
agreement 
– F-­‐score: 
86% 
• 30 
documents 
– 1319 
total 
men:ons 
– 224 
unique 
men:ons 
Databases 
So/ware 
Combined 
Precision 
0.79 
(0.66) 
0.99 
(0.96) 
0.93 
(0.87) 
Recall 
0.67 
(0.56) 
0.84 
(0.82) 
0.80 
(0.74) 
F-­‐measure 
0.73 
(0.61) 
0.91 
(0.88) 
0.86 
(0.80) 
Total 
Number 
of 
Documents 
30 
Total 
Database 
and 
So9ware 
Men<ons 
1319 
Total 
Unique 
Resource 
Men<ons 
224 
Percentage 
of 
Database 
Men:ons 
36% 
Percentage 
of 
Unique 
DB 
Men:ons 
26% 
Average 
Men:ons 
per 
Document 
44 
Average 
Unique 
Men:ons 
per 
Document 
8.2 
Max 
Men:ons 
in 
a 
Single 
Document 
227 
Max 
Unique 
Men:ons 
in 
a 
Document 
33 
Resources 
with 
only 
a 
Single 
Men:on 
117 
7
Ambiguity 
and 
Variability 
• Compared 
names 
to 
– Acronym 
Dic:onary: 
1,933 
– English 
Dic:onary: 
86,308 
• Ambiguity 
in 
corpus: 
– ≈ 
2% 
(case-­‐sensi:ve) 
– ≈ 
12% 
(case-­‐insensi:ve) 
• Ambiguity 
in 
names 
dic:onary: 
– ≈ 
0.1% 
(case-­‐sensi:ve) 
– ≈ 
0.5% 
(case-­‐insensi:ve) 
• 224 
unique 
names 
– 45 
were 
variants 
• 15 
acronyms 
• Orthographics 
• Spellings 
– 179 
different 
resources 
• 79% 
one 
variant 
• 17% 
two 
variants 
• 4% 
three 
variants 
8
Name 
Composi:on 
• Majority 
are 
single 
nouns 
– includes 
acronyms 
• 6% 
lowercase 
common 
nouns 
– affy, 
bioconductor 
• A 
few 
contained 
numbers 
– S4, 
t2prhd 
• A 
few 
misclassified 
as 
verbs 
– …each 
query 
protein 
is 
first 
BLASTed 
with… 
– …held 
near 
their 
equilibrium 
values 
using 
SHAKE. 
– …graphical 
representaPons 
were 
achieved 
using 
dot 
v1.10… 
NNP 
68.0% 
NNP 
NNP 
8.8% 
NN 
5.7% 
NNP 
NNP 
NNP 
5.3% 
NNP 
CD 
3.1% 
NNP 
CD 
. 
CD 
1.8% 
NNP 
NNP 
NNP 
NNP 
NNP 
1.3% 
NNP 
LS 
0.9% 
NNP 
NNP 
NNP 
NNP 
0.9% 
Other 
Pajerns 
4.4% 
9
Name 
Composi:on 
• Longest 
Names 
(most 
tokens) 
– Corpus: 
5 
– 
Gene 
Expression 
Profile 
Analysis 
Suite 
– Dic:onary: 
12 
– 
PredicPon 
of 
Protein 
SorPng 
Signals 
and 
LocalisaPon 
Sites 
in 
Amino 
Acid 
Sequences 
• Evaluated 
(stemmed) 
token 
frequencies 
within 
the 
dic:onary 
– Long-­‐tail 
curve 
– 87% 
used 
only 
once 
– High 
frequency 
words 
suggest 
common 
heads 
and 
bioinforma:cs 
related 
terms 
10
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& 
!"#$%"& 
'($)"*#& 
!"#"& 
+",-"#."& 
/-%0#& 
621& 
611& 
51& 
41& 
31& 
21& 
1& 
@<A$1& 
1& 27& 71& 87& 611& 627& 671& 
!"#$%&'($)*$%+,& 
!"-&./0&!"#$%1&23"(415& 
11
Dic:onary 
Matching 
• F-­‐score 
under 
55% 
– Low 
precision 
• GO 
(GO:0007089) 
• cycle 
• genomes 
– Low 
recall, 
Incomprehensive 
• i 
Linker 
• xPedPhase 
• 95% 
of 
menPons 
could 
be 
matched… 
Dic:onary 
matches 
55.3% 
Heads 
and 
Hearst 
pajerns 
9.7% 
Title 
appearances 
0.6% 
References 
and 
URLs 
1.9% 
Version 
informa:on 
1.2% 
Noun/Verb 
associa:ons 
20.3% 
Comparisons 
5.8% 
Remaining 
5.2% 
12 
TP 
FP 
FN 
P 
R 
F 
Lenient 
729 
633 
590 
54% 
55% 
54% 
Strict 
695 
667 
624 
51% 
53% 
52%
Poten:al 
Clues 
• Heads 
– the 
stochas:c 
simulator 
Dizzy 
allows 
... 
– The 
MethMarker 
so9ware 
was 
... 
– ... 
system, 
PSPE, 
specifically 
to 
... 
– tools: 
CLUSTALW, 
..., 
and 
MUSCLE. 
– ... 
programs 
such 
as 
Simlink, 
..., 
and 
SimPed. 
• Titles 
– CoXpress: 
differen:al 
co-­‐ 
expression 
in 
gene 
expression 
data 
– TABASCO: 
A 
single 
molecule, 
base-­‐pair 
resolved 
gene 
expression 
simulator 
– SimHap 
GUI: 
An 
intui:ve 
graphical 
user 
interface 
for 
gene:c 
associa:on 
analysis 
13
Poten:al 
Clues 
• References 
– Galaxy 
[18] 
and 
EpiGRAPH 
[19] 
– The 
learning 
metrics 
principle 
[14,15] 
• Versions 
– using 
dot 
v1.10 
and 
Graphviz 
1.13(v16). 
– CLUSTAL 
W 
version 
1.83 
– Dynalign 
4.5, 
and 
LocARNA 
0.99 
• Comparisons 
– xPedPhase 
did 
beRer 
than 
i 
Linker 
– Cofogla2 
with 
this 
cutoff 
PSVM 
gives 
a 
bejer 
false 
posi:ve 
rate 
compared 
to 
RNAz 
– Foldalign 
was 
much 
slower 
than 
Cofolga2 
except 
for 
– Like 
Moleculizer, 
Tabasco 
dynamically 
generates 
14 
FP
Poten:al 
Clues 
• the 
SimHap 
GUI 
installa<on. 
• implemented 
within 
PedPhase 
• Our 
mo:va:ons 
for 
crea<ng 
Tabasco 
• MethMarker 
therefore 
provides 
• A 
typical 
screenshot 
of 
MethMarker 
• MethMarker’s 
user 
interface 
reflects 
• Tested 
effect 
on 
precision 
• Ran 
regular 
expression 
• Percentage 
of 
sentences 
with 
resource 
name 
and 
that 
matched 
regex: 
– ran|run(ning|s)? 
• 48% 
– RAM 
• 50% 
– Website 
• 77% 
• … 
so 
are 
plausible 
clues. 
15
Scope 
• Database 
• So6ware 
• Method 
• Approach 
• Algorithm 
• Task 
• Programming 
Language 
• Records/Iden:fiers 
• File 
Formats 
• Author’s 
mix 
vocab 
• Fuzzy 
dis:nc:on 
• R 
language, 
R 
so6ware 
– Dis:nc:on? 
• Microso6 
Excel 
– Lots 
of 
sta:s:cs 
• Students 
t-­‐test 
– Lots 
of 
sta:s:cs 
tools 
16
Summary 
• Annota:on 
guidelines 
• Annotated 
gold 
corpus 
• Evaluated 
resource 
name 
men:ons 
– Composi:on 
– Ambiguity 
– Variability 
• Dic:onary 
match: 
< 
55% 
• Provide 
poten:al 
clues 
for 
capture 
• Acknowledgments 
– BBSRC 
– Dan 
Jamieson 
– 
IAA 
• hjp://sourceforge.net/ 
projects/bionerds/ 
• Thank-­‐you! 
• Ques:ons? 
17

More Related Content

Similar to SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Databricks
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
GenomeInABottle
 
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
geraintduck
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
GenomeInABottle
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
Li Shen
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GenomeInABottle
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
GenomeInABottle
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
Sean Davis
 
Mane v2 final
Mane v2 finalMane v2 final
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
c.titus.brown
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
fruitbreedomics
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
Tristan Kempston
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
Ravi Gandham
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
GenomeInABottle
 
RNA-Seq
RNA-SeqRNA-Seq
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
Pushpendra83
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
Nathan Olson
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
GenomeInABottle
 

Similar to SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics (20)

Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
GIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM ForumGIAB for AMP GeT-RM Forum
GIAB for AMP GeT-RM Forum
 
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...ECCB 2014: Extracting patterns of database and software usage from the bioinf...
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
 
Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018Giab poster structural variants ashg 2018
Giab poster structural variants ashg 2018
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030Genome in a bottle for amp GeT-RM 181030
Genome in a bottle for amp GeT-RM 181030
 
RNA-seq Data Analysis Overview
RNA-seq Data Analysis OverviewRNA-seq Data Analysis Overview
RNA-seq Data Analysis Overview
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
Fruit breedomics workshop wp6 from marker assisted breeding to genomics assis...
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
CRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse ModelingCRISPR presentation extended Mouse Modeling
CRISPR presentation extended Mouse Modeling
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
160627 giab for festival sv workshop
160627 giab for festival sv workshop160627 giab for festival sv workshop
160627 giab for festival sv workshop
 
RNA-Seq
RNA-SeqRNA-Seq
RNA-Seq
 
rnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdfrnaseq2015-02-18-170327193409.pdf
rnaseq2015-02-18-170327193409.pdf
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 

Recently uploaded

Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
FarhanaHussain18
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
ABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Sérgio Sacani
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
gyhwyo
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
Sérgio Sacani
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Nutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptxNutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptx
vimalveerammal
 
seed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdfseed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdf
Nistarini College, Purulia (W.B) India
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
RAYMUNDONAVARROCORON
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 

Recently uploaded (20)

Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptxSynopsis presentation VDR gene polymorphism and anemia (2).pptx
Synopsis presentation VDR gene polymorphism and anemia (2).pptx
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
MICROBIAL INTERACTION PPT/ MICROBIAL INTERACTION AND THEIR TYPES // PLANT MIC...
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理一比一原版美国佩斯大学毕业证如何办理
一比一原版美国佩斯大学毕业证如何办理
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Nutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptxNutaceuticsls herbal drug technology CVS, cancer.pptx
Nutaceuticsls herbal drug technology CVS, cancer.pptx
 
seed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdfseed production, Nursery & Gardening.pdf
seed production, Nursery & Gardening.pdf
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

  • 1. Ambiguity and Variability of Database and So6ware Names in Bioinforma:cs SMBM 2012 Geraint Duck1, Robert Stevens1, David Robertson2 and Goran Nenadic1 1School of Computer Science, 2Faculty of Life Sciences The University of Manchester Manchester, UK
  • 2. Named En:ty Recogni:on (NER) • Variety of NER uses – Species – Gene/protein names – Chemical names • Variety of NER accuracy – 95% F-­‐score species (LINNAEUS) – 73% F-­‐score (strict) gene name (ABNER) – Over 70% F-­‐score chemical names (OSCAR3) • Draw parallels for database and so/ware NER 2
  • 4. Challenges -­‐ Ambiguity • leg • white • cab • C. elegans – 41 NCBI taxonomy species • HIV – Human immunodeficiency virus – Human immunovirus • analysis • Network • graph • DIP – distal interphalangeal – Database of Interac:ng Proteins 4
  • 5. Challenges -­‐ Variability • NF-­‐kappaB • NF-­‐kappa B • NF-­‐kappa-­‐B • NF-­‐κB • Case variants • Spelling variants • ClustalW • Clustal W • Clustal-­‐W • CLUSTAL W • ClustalX (GUI)? • Now: Clustal Omega 5
  • 6. Preliminary • Annota:on guidelines – Database, so6ware, package, ontology names – Not file formats, algorithms, tasks, methods, database iden:fiers, programming languages, opera:ng systems, etc. • Gold standard corpus – 25 from BMC Bioinforma:cs and PLoS Computa:onal Biology; 5 from Genome Biology • Dic:onary of resource names – 4,879 unique entries from 10 online resources 6
  • 7. Preliminary • Inter-­‐annotator agreement – F-­‐score: 86% • 30 documents – 1319 total men:ons – 224 unique men:ons Databases So/ware Combined Precision 0.79 (0.66) 0.99 (0.96) 0.93 (0.87) Recall 0.67 (0.56) 0.84 (0.82) 0.80 (0.74) F-­‐measure 0.73 (0.61) 0.91 (0.88) 0.86 (0.80) Total Number of Documents 30 Total Database and So9ware Men<ons 1319 Total Unique Resource Men<ons 224 Percentage of Database Men:ons 36% Percentage of Unique DB Men:ons 26% Average Men:ons per Document 44 Average Unique Men:ons per Document 8.2 Max Men:ons in a Single Document 227 Max Unique Men:ons in a Document 33 Resources with only a Single Men:on 117 7
  • 8. Ambiguity and Variability • Compared names to – Acronym Dic:onary: 1,933 – English Dic:onary: 86,308 • Ambiguity in corpus: – ≈ 2% (case-­‐sensi:ve) – ≈ 12% (case-­‐insensi:ve) • Ambiguity in names dic:onary: – ≈ 0.1% (case-­‐sensi:ve) – ≈ 0.5% (case-­‐insensi:ve) • 224 unique names – 45 were variants • 15 acronyms • Orthographics • Spellings – 179 different resources • 79% one variant • 17% two variants • 4% three variants 8
  • 9. Name Composi:on • Majority are single nouns – includes acronyms • 6% lowercase common nouns – affy, bioconductor • A few contained numbers – S4, t2prhd • A few misclassified as verbs – …each query protein is first BLASTed with… – …held near their equilibrium values using SHAKE. – …graphical representaPons were achieved using dot v1.10… NNP 68.0% NNP NNP 8.8% NN 5.7% NNP NNP NNP 5.3% NNP CD 3.1% NNP CD . CD 1.8% NNP NNP NNP NNP NNP 1.3% NNP LS 0.9% NNP NNP NNP NNP 0.9% Other Pajerns 4.4% 9
  • 10. Name Composi:on • Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic:onary: 12 – PredicPon of Protein SorPng Signals and LocalisaPon Sites in Amino Acid Sequences • Evaluated (stemmed) token frequencies within the dic:onary – Long-­‐tail curve – 87% used only once – High frequency words suggest common heads and bioinforma:cs related terms 10
  • 11. !"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& !"#$%"& '($)"*#& !"#"& +",-"#."& /-%0#& 621& 611& 51& 41& 31& 21& 1& @<A$1& 1& 27& 71& 87& 611& 627& 671& !"#$%&'($)*$%+,& !"-&./0&!"#$%1&23"(415& 11
  • 12. Dic:onary Matching • F-­‐score under 55% – Low precision • GO (GO:0007089) • cycle • genomes – Low recall, Incomprehensive • i Linker • xPedPhase • 95% of menPons could be matched… Dic:onary matches 55.3% Heads and Hearst pajerns 9.7% Title appearances 0.6% References and URLs 1.9% Version informa:on 1.2% Noun/Verb associa:ons 20.3% Comparisons 5.8% Remaining 5.2% 12 TP FP FN P R F Lenient 729 633 590 54% 55% 54% Strict 695 667 624 51% 53% 52%
  • 13. Poten:al Clues • Heads – the stochas:c simulator Dizzy allows ... – The MethMarker so9ware was ... – ... system, PSPE, specifically to ... – tools: CLUSTALW, ..., and MUSCLE. – ... programs such as Simlink, ..., and SimPed. • Titles – CoXpress: differen:al co-­‐ expression in gene expression data – TABASCO: A single molecule, base-­‐pair resolved gene expression simulator – SimHap GUI: An intui:ve graphical user interface for gene:c associa:on analysis 13
  • 14. Poten:al Clues • References – Galaxy [18] and EpiGRAPH [19] – The learning metrics principle [14,15] • Versions – using dot v1.10 and Graphviz 1.13(v16). – CLUSTAL W version 1.83 – Dynalign 4.5, and LocARNA 0.99 • Comparisons – xPedPhase did beRer than i Linker – Cofogla2 with this cutoff PSVM gives a bejer false posi:ve rate compared to RNAz – Foldalign was much slower than Cofolga2 except for – Like Moleculizer, Tabasco dynamically generates 14 FP
  • 15. Poten:al Clues • the SimHap GUI installa<on. • implemented within PedPhase • Our mo:va:ons for crea<ng Tabasco • MethMarker therefore provides • A typical screenshot of MethMarker • MethMarker’s user interface reflects • Tested effect on precision • Ran regular expression • Percentage of sentences with resource name and that matched regex: – ran|run(ning|s)? • 48% – RAM • 50% – Website • 77% • … so are plausible clues. 15
  • 16. Scope • Database • So6ware • Method • Approach • Algorithm • Task • Programming Language • Records/Iden:fiers • File Formats • Author’s mix vocab • Fuzzy dis:nc:on • R language, R so6ware – Dis:nc:on? • Microso6 Excel – Lots of sta:s:cs • Students t-­‐test – Lots of sta:s:cs tools 16
  • 17. Summary • Annota:on guidelines • Annotated gold corpus • Evaluated resource name men:ons – Composi:on – Ambiguity – Variability • Dic:onary match: < 55% • Provide poten:al clues for capture • Acknowledgments – BBSRC – Dan Jamieson – IAA • hjp://sourceforge.net/ projects/bionerds/ • Thank-­‐you! • Ques:ons? 17