Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Ambiguity 
and 
Variability 
of 
Database 
and 
So6ware 
Names 
in 
Bioinforma:cs 
SMBM 
2012 
Geraint 
Duck1, 
Robert 
St...
Named 
En:ty 
Recogni:on 
(NER) 
• Variety 
of 
NER 
uses 
– Species 
– Gene/protein 
names 
– Chemical 
names 
• Variety ...
Example 
PMC1660556; 
M. 
Watson 
3
Challenges 
-­‐ 
Ambiguity 
• leg 
• white 
• cab 
• C. 
elegans 
– 41 
NCBI 
taxonomy 
species 
• HIV 
– Human 
immunodef...
Challenges 
-­‐ 
Variability 
• NF-­‐kappaB 
• NF-­‐kappa 
B 
• NF-­‐kappa-­‐B 
• NF-­‐κB 
• Case 
variants 
• Spelling 
v...
Preliminary 
• Annota:on 
guidelines 
– Database, 
so6ware, 
package, 
ontology 
names 
– Not 
file 
formats, 
algorithms,...
Preliminary 
• Inter-­‐annotator 
agreement 
– F-­‐score: 
86% 
• 30 
documents 
– 1319 
total 
men:ons 
– 224 
unique 
me...
Ambiguity 
and 
Variability 
• Compared 
names 
to 
– Acronym 
Dic:onary: 
1,933 
– English 
Dic:onary: 
86,308 
• Ambigui...
Name 
Composi:on 
• Majority 
are 
single 
nouns 
– includes 
acronyms 
• 6% 
lowercase 
common 
nouns 
– affy, 
bioconduc...
Name 
Composi:on 
• Longest 
Names 
(most 
tokens) 
– Corpus: 
5 
– 
Gene 
Expression 
Profile 
Analysis 
Suite 
– Dic:ona...
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& 
!"#$%"& 
'($)"*#& 
!"#"& 
+",-"#."& 
/-%0#& 
621& 
611& 
51& 
41& ...
Dic:onary 
Matching 
• F-­‐score 
under 
55% 
– Low 
precision 
• GO 
(GO:0007089) 
• cycle 
• genomes 
– Low 
recall, 
In...
Poten:al 
Clues 
• Heads 
– the 
stochas:c 
simulator 
Dizzy 
allows 
... 
– The 
MethMarker 
so9ware 
was 
... 
– ... 
sy...
Poten:al 
Clues 
• References 
– Galaxy 
[18] 
and 
EpiGRAPH 
[19] 
– The 
learning 
metrics 
principle 
[14,15] 
• Versio...
Poten:al 
Clues 
• the 
SimHap 
GUI 
installa<on. 
• implemented 
within 
PedPhase 
• Our 
mo:va:ons 
for 
crea<ng 
Tabasc...
Scope 
• Database 
• So6ware 
• Method 
• Approach 
• Algorithm 
• Task 
• Programming 
Language 
• Records/Iden:fiers 
• ...
Summary 
• Annota:on 
guidelines 
• Annotated 
gold 
corpus 
• Evaluated 
resource 
name 
men:ons 
– Composi:on 
– Ambigui...
Upcoming SlideShare
Loading in …5
×

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

375 views

Published on

Presentation given at SMBM 2012, to present our paper at the conference:
http://dx.doi.org/10.5167/uzh-64476

Published in: Science
  • Be the first to comment

  • Be the first to like this

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

  1. 1. Ambiguity and Variability of Database and So6ware Names in Bioinforma:cs SMBM 2012 Geraint Duck1, Robert Stevens1, David Robertson2 and Goran Nenadic1 1School of Computer Science, 2Faculty of Life Sciences The University of Manchester Manchester, UK
  2. 2. Named En:ty Recogni:on (NER) • Variety of NER uses – Species – Gene/protein names – Chemical names • Variety of NER accuracy – 95% F-­‐score species (LINNAEUS) – 73% F-­‐score (strict) gene name (ABNER) – Over 70% F-­‐score chemical names (OSCAR3) • Draw parallels for database and so/ware NER 2
  3. 3. Example PMC1660556; M. Watson 3
  4. 4. Challenges -­‐ Ambiguity • leg • white • cab • C. elegans – 41 NCBI taxonomy species • HIV – Human immunodeficiency virus – Human immunovirus • analysis • Network • graph • DIP – distal interphalangeal – Database of Interac:ng Proteins 4
  5. 5. Challenges -­‐ Variability • NF-­‐kappaB • NF-­‐kappa B • NF-­‐kappa-­‐B • NF-­‐κB • Case variants • Spelling variants • ClustalW • Clustal W • Clustal-­‐W • CLUSTAL W • ClustalX (GUI)? • Now: Clustal Omega 5
  6. 6. Preliminary • Annota:on guidelines – Database, so6ware, package, ontology names – Not file formats, algorithms, tasks, methods, database iden:fiers, programming languages, opera:ng systems, etc. • Gold standard corpus – 25 from BMC Bioinforma:cs and PLoS Computa:onal Biology; 5 from Genome Biology • Dic:onary of resource names – 4,879 unique entries from 10 online resources 6
  7. 7. Preliminary • Inter-­‐annotator agreement – F-­‐score: 86% • 30 documents – 1319 total men:ons – 224 unique men:ons Databases So/ware Combined Precision 0.79 (0.66) 0.99 (0.96) 0.93 (0.87) Recall 0.67 (0.56) 0.84 (0.82) 0.80 (0.74) F-­‐measure 0.73 (0.61) 0.91 (0.88) 0.86 (0.80) Total Number of Documents 30 Total Database and So9ware Men<ons 1319 Total Unique Resource Men<ons 224 Percentage of Database Men:ons 36% Percentage of Unique DB Men:ons 26% Average Men:ons per Document 44 Average Unique Men:ons per Document 8.2 Max Men:ons in a Single Document 227 Max Unique Men:ons in a Document 33 Resources with only a Single Men:on 117 7
  8. 8. Ambiguity and Variability • Compared names to – Acronym Dic:onary: 1,933 – English Dic:onary: 86,308 • Ambiguity in corpus: – ≈ 2% (case-­‐sensi:ve) – ≈ 12% (case-­‐insensi:ve) • Ambiguity in names dic:onary: – ≈ 0.1% (case-­‐sensi:ve) – ≈ 0.5% (case-­‐insensi:ve) • 224 unique names – 45 were variants • 15 acronyms • Orthographics • Spellings – 179 different resources • 79% one variant • 17% two variants • 4% three variants 8
  9. 9. Name Composi:on • Majority are single nouns – includes acronyms • 6% lowercase common nouns – affy, bioconductor • A few contained numbers – S4, t2prhd • A few misclassified as verbs – …each query protein is first BLASTed with… – …held near their equilibrium values using SHAKE. – …graphical representaPons were achieved using dot v1.10… NNP 68.0% NNP NNP 8.8% NN 5.7% NNP NNP NNP 5.3% NNP CD 3.1% NNP CD . CD 1.8% NNP NNP NNP NNP NNP 1.3% NNP LS 0.9% NNP NNP NNP NNP 0.9% Other Pajerns 4.4% 9
  10. 10. Name Composi:on • Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic:onary: 12 – PredicPon of Protein SorPng Signals and LocalisaPon Sites in Amino Acid Sequences • Evaluated (stemmed) token frequencies within the dic:onary – Long-­‐tail curve – 87% used only once – High frequency words suggest common heads and bioinforma:cs related terms 10
  11. 11. !"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($& !"#$%"& '($)"*#& !"#"& +",-"#."& /-%0#& 621& 611& 51& 41& 31& 21& 1& @<A$1& 1& 27& 71& 87& 611& 627& 671& !"#$%&'($)*$%+,& !"-&./0&!"#$%1&23"(415& 11
  12. 12. Dic:onary Matching • F-­‐score under 55% – Low precision • GO (GO:0007089) • cycle • genomes – Low recall, Incomprehensive • i Linker • xPedPhase • 95% of menPons could be matched… Dic:onary matches 55.3% Heads and Hearst pajerns 9.7% Title appearances 0.6% References and URLs 1.9% Version informa:on 1.2% Noun/Verb associa:ons 20.3% Comparisons 5.8% Remaining 5.2% 12 TP FP FN P R F Lenient 729 633 590 54% 55% 54% Strict 695 667 624 51% 53% 52%
  13. 13. Poten:al Clues • Heads – the stochas:c simulator Dizzy allows ... – The MethMarker so9ware was ... – ... system, PSPE, specifically to ... – tools: CLUSTALW, ..., and MUSCLE. – ... programs such as Simlink, ..., and SimPed. • Titles – CoXpress: differen:al co-­‐ expression in gene expression data – TABASCO: A single molecule, base-­‐pair resolved gene expression simulator – SimHap GUI: An intui:ve graphical user interface for gene:c associa:on analysis 13
  14. 14. Poten:al Clues • References – Galaxy [18] and EpiGRAPH [19] – The learning metrics principle [14,15] • Versions – using dot v1.10 and Graphviz 1.13(v16). – CLUSTAL W version 1.83 – Dynalign 4.5, and LocARNA 0.99 • Comparisons – xPedPhase did beRer than i Linker – Cofogla2 with this cutoff PSVM gives a bejer false posi:ve rate compared to RNAz – Foldalign was much slower than Cofolga2 except for – Like Moleculizer, Tabasco dynamically generates 14 FP
  15. 15. Poten:al Clues • the SimHap GUI installa<on. • implemented within PedPhase • Our mo:va:ons for crea<ng Tabasco • MethMarker therefore provides • A typical screenshot of MethMarker • MethMarker’s user interface reflects • Tested effect on precision • Ran regular expression • Percentage of sentences with resource name and that matched regex: – ran|run(ning|s)? • 48% – RAM • 50% – Website • 77% • … so are plausible clues. 15
  16. 16. Scope • Database • So6ware • Method • Approach • Algorithm • Task • Programming Language • Records/Iden:fiers • File Formats • Author’s mix vocab • Fuzzy dis:nc:on • R language, R so6ware – Dis:nc:on? • Microso6 Excel – Lots of sta:s:cs • Students t-­‐test – Lots of sta:s:cs tools 16
  17. 17. Summary • Annota:on guidelines • Annotated gold corpus • Evaluated resource name men:ons – Composi:on – Ambiguity – Variability • Dic:onary match: < 55% • Provide poten:al clues for capture • Acknowledgments – BBSRC – Dan Jamieson – IAA • hjp://sourceforge.net/ projects/bionerds/ • Thank-­‐you! • Ques:ons? 17

×