SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Ambiguity
and
Variability
of
Database
and
So6ware
Names
in
Bioinforma:cs
SMBM
2012
Geraint
Duck1,
Robert
Stevens1,
David
Robertson2
and
Goran
Nenadic1
1School
of
Computer
Science,
2Faculty
of
Life
Sciences
The
University
of
Manchester
Manchester,
UK

Named
En:ty
Recogni:on
(NER)
• Variety
of
NER
uses
– Species
– Gene/protein
names
– Chemical
names
• Variety
of
NER
accuracy
– 95%
F-‐score
species
(LINNAEUS)
– 73%
F-‐score
(strict)
gene
name
(ABNER)
– Over
70%
F-‐score
chemical
names
(OSCAR3)
• Draw
parallels
for
database
and
so/ware
NER
2

Example
PMC1660556;
M.
Watson
3

Challenges
-‐
Ambiguity
• leg
• white
• cab
• C.
elegans
– 41
NCBI
taxonomy
species
• HIV
– Human
immunodeficiency
virus
– Human
immunovirus
• analysis
• Network
• graph
• DIP
– distal
interphalangeal
– Database
of
Interac:ng
Proteins
4

Challenges
-‐
Variability
• NF-‐kappaB
• NF-‐kappa
B
• NF-‐kappa-‐B
• NF-‐κB
• Case
variants
• Spelling
variants
• ClustalW
• Clustal
W
• Clustal-‐W
• CLUSTAL
W
• ClustalX
(GUI)?
• Now:
Clustal
Omega
5

Preliminary
• Annota:on
guidelines
– Database,
so6ware,
package,
ontology
names
– Not
file
formats,
algorithms,
tasks,
methods,
database
iden:fiers,
programming
languages,
opera:ng
systems,
etc.
• Gold
standard
corpus
– 25
from
BMC
Bioinforma:cs
and
PLoS
Computa:onal
Biology;
5
from
Genome
Biology
• Dic:onary
of
resource
names
– 4,879
unique
entries
from
10
online
resources
6

Preliminary
• Inter-‐annotator
agreement
– F-‐score:
86%
• 30
documents
– 1319
total
men:ons
– 224
unique
men:ons
Databases
So/ware
Combined
Precision
0.79
(0.66)
0.99
(0.96)
0.93
(0.87)
Recall
0.67
(0.56)
0.84
(0.82)
0.80
(0.74)
F-‐measure
0.73
(0.61)
0.91
(0.88)
0.86
(0.80)
Total
Number
of
Documents
30
Total
Database
and
So9ware
Men<ons
1319
Total
Unique
Resource
Men<ons
224
Percentage
of
Database
Men:ons
36%
Percentage
of
Unique
DB
Men:ons
26%
Average
Men:ons
per
Document
44
Average
Unique
Men:ons
per
Document
8.2
Max
Men:ons
in
a
Single
Document
227
Max
Unique
Men:ons
in
a
Document
33
Resources
with
only
a
Single
Men:on
117
7

Ambiguity
and
Variability
• Compared
names
to
– Acronym
Dic:onary:
1,933
– English
Dic:onary:
86,308
• Ambiguity
in
corpus:
– ≈
2%
(case-‐sensi:ve)
– ≈
12%
(case-‐insensi:ve)
• Ambiguity
in
names
dic:onary:
– ≈
0.1%
(case-‐sensi:ve)
– ≈
0.5%
(case-‐insensi:ve)
• 224
unique
names
– 45
were
variants
• 15
acronyms
• Orthographics
• Spellings
– 179
different
resources
• 79%
one
variant
• 17%
two
variants
• 4%
three
variants
8

Name
Composi:on
• Majority
are
single
nouns
– includes
acronyms
• 6%
lowercase
common
nouns
– affy,
bioconductor
• A
few
contained
numbers
– S4,
t2prhd
• A
few
misclassified
as
verbs
– …each
query
protein
is
first
BLASTed
with…
– …held
near
their
equilibrium
values
using
SHAKE.
– …graphical
representaPons
were
achieved
using
dot
v1.10…
NNP
68.0%
NNP
NNP
8.8%
NN
5.7%
NNP
NNP
NNP
5.3%
NNP
CD
3.1%
NNP
CD
.
CD
1.8%
NNP
NNP
NNP
NNP
NNP
1.3%
NNP
LS
0.9%
NNP
NNP
NNP
NNP
0.9%
Other
Pajerns
4.4%
9

Name
Composi:on
• Longest
Names
(most
tokens)
– Corpus:
5
–
Gene
Expression
Profile
Analysis
Suite
– Dic:onary:
12
–
PredicPon
of
Protein
SorPng
Signals
and
LocalisaPon
Sites
in
Amino
Acid
Sequences
• Evaluated
(stemmed)
token
frequencies
within
the
dic:onary
– Long-‐tail
curve
– 87%
used
only
once
– High
frequency
words
suggest
common
heads
and
bioinforma:cs
related
terms
10

!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&
!"#$%"&
'($)"*#&
!"#"&
+",-"#."&
/-%0#&
621&
611&
51&
41&
31&
21&
1&
@<A$1&
1& 27& 71& 87& 611& 627& 671&
!"#$%&'($)*$%+,&
!"-&./0&!"#$%1&23"(415&
11

Dic:onary
Matching
• F-‐score
under
55%
– Low
precision
• GO
(GO:0007089)
• cycle
• genomes
– Low
recall,
Incomprehensive
• i
Linker
• xPedPhase
• 95%
of
menPons
could
be
matched…
Dic:onary
matches
55.3%
Heads
and
Hearst
pajerns
9.7%
Title
appearances
0.6%
References
and
URLs
1.9%
Version
informa:on
1.2%
Noun/Verb
associa:ons
20.3%
Comparisons
5.8%
Remaining
5.2%
12
TP
FP
FN
P
R
F
Lenient
729
633
590
54%
55%
54%
Strict
695
667
624
51%
53%
52%

Poten:al
Clues
• Heads
– the
stochas:c
simulator
Dizzy
allows
...
– The
MethMarker
so9ware
was
...
– ...
system,
PSPE,
specifically
to
...
– tools:
CLUSTALW,
...,
and
MUSCLE.
– ...
programs
such
as
Simlink,
...,
and
SimPed.
• Titles
– CoXpress:
differen:al
co-‐
expression
in
gene
expression
data
– TABASCO:
A
single
molecule,
base-‐pair
resolved
gene
expression
simulator
– SimHap
GUI:
An
intui:ve
graphical
user
interface
for
gene:c
associa:on
analysis
13

Poten:al
Clues
• References
– Galaxy
[18]
and
EpiGRAPH
[19]
– The
learning
metrics
principle
[14,15]
• Versions
– using
dot
v1.10
and
Graphviz
1.13(v16).
– CLUSTAL
W
version
1.83
– Dynalign
4.5,
and
LocARNA
0.99
• Comparisons
– xPedPhase
did
beRer
than
i
Linker
– Cofogla2
with
this
cutoff
PSVM
gives
a
bejer
false
posi:ve
rate
compared
to
RNAz
– Foldalign
was
much
slower
than
Cofolga2
except
for
– Like
Moleculizer,
Tabasco
dynamically
generates
14
FP

Poten:al
Clues
• the
SimHap
GUI
installa<on.
• implemented
within
PedPhase
• Our
mo:va:ons
for
crea<ng
Tabasco
• MethMarker
therefore
provides
• A
typical
screenshot
of
MethMarker
• MethMarker’s
user
interface
reflects
• Tested
effect
on
precision
• Ran
regular
expression
• Percentage
of
sentences
with
resource
name
and
that
matched
regex:
– ran|run(ning|s)?
• 48%
– RAM
• 50%
– Website
• 77%
• …
so
are
plausible
clues.
15

Scope
• Database
• So6ware
• Method
• Approach
• Algorithm
• Task
• Programming
Language
• Records/Iden:fiers
• File
Formats
• Author’s
mix
vocab
• Fuzzy
dis:nc:on
• R
language,
R
so6ware
– Dis:nc:on?
• Microso6
Excel
– Lots
of
sta:s:cs
• Students
t-‐test
– Lots
of
sta:s:cs
tools
16

Summary
• Annota:on
guidelines
• Annotated
gold
corpus
• Evaluated
resource
name
men:ons
– Composi:on
– Ambiguity
– Variability
• Dic:onary
match:
<
55%
• Provide
poten:al
clues
for
capture
• Acknowledgments
– BBSRC
– Dan
Jamieson
–
IAA
• hjp://sourceforge.net/
projects/bionerds/
• Thank-‐you!
• Ques:ons?
17

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Recommended

Recommended

More Related Content

Similar to SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Similar to SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics (20)

Recently uploaded

Recently uploaded (20)

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics