Extrac'ng 
pa,erns 
of 
database 
and 
so3ware 
usage 
from 
the 
bioinforma'cs 
literature 
Geraint 
Duck, 
Goran 
Nenadic, 
Andy 
Brass, 
David 
L. 
Robertson 
and 
Robert 
Stevens 
The 
University 
of 
Manchester, 
UK 
h,p://www.cs.man.ac.uk/~duckg/ 
h,p://bionerds.sourceforge.net/networks/
Introduc'on 
• Methods 
are 
fundamental 
to 
science 
– Judgement 
– Replica'on 
– Extension 
• Methods 
in 
bioinforma'cs: 
– In 
silico: 
Data 
and 
tools 
– Workflows 
• Objec've 
representa'on 
• Sharing 
and 
reuse 
2
Bioinforma'cs 
• Resource 
focused 
domain: 
“Resourceome” 
– Our 
research 
suggests: 
• Around 
200,000 
unique 
resources 
in 
the 
literature 
• Over 
4 
million 
men'ons 
• … 
and 
s'll 
growing! 
• Resource/method 
search 
and 
selec'on… 
– Best-­‐prac'ce 
– Common-­‐prac'ce 
• What 
are 
the 
main 
pa,erns 
in 
bioinforma'cs 
resources, 
and 
associated 
methods? 
3
Approach 
• Use 
bioinforma'cs 
literature 
(to 
answer 
this 
ques'on) 
• Extract 
database 
and 
so3ware 
men'ons 
• Combine 
resources 
to 
form 
pairs 
• Combine 
pairs 
to 
forms 
pa,erns 
– Common-­‐prac'ce 
– Method? 
BLAST Modeller PROCHECK 
4 
ClustalW PHYLIP
Document 
Collec'on 
• PubMed 
Central 
open-­‐access 
full-­‐text 
ar'cles 
• Bioinforma2cs[MeSH] 
• 22,376 
ar'cles 
• 67 
journals 
• 3 
journals 
were 
> 
50% 
of 
total 
documents 
5 
(%#!!" 
(%!!!" 
'%#!!" 
'%!!!" 
&%#!!" 
&%!!!" 
$%#!!" 
$%!!!" 
#!!" 
!" 
$))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$(" 
!"#$%&'()'*(+"#%,-.' 
/%0&'
bioNerDS 
• bioNerDS 
– Bioinforma'cs 
named 
en'ty 
recogniser 
for 
databases 
and 
so3ware 
– Full-­‐text; 
Men'on 
level 
– Rule-­‐based 
– F-­‐score 
63-­‐91% 
– Previously 
compared 
resource 
usage 
in: 
• Genome 
Biology 
• BMC 
Bioinforma'cs 
• Networks 
filter: 
– 702,937 
total 
men'ons 
– 167,697 
document 
level 
men'ons 
– 31,053 
unique 
names 
– 93% 
single 
men'on 
• Duck 
et 
al. 
(2013) 
BMC 
Bioinforma'cs 
h,p://bionerds.sourceforge.net/ 
6
bioNerDS 
Genome 
Biology 
• “Biological” 
focus 
– GenBank 
– Ensembl 
– GEO 
– GO 
BMC 
Bioinforma6cs 
• “Resource” 
focus 
– R 
– PDB 
– PubMed 
h,p://bionerds.sourceforge.net/ 
7
Men'on 
Filtering 
• Filter 
resources 
not 
men'oned 
within 
a 
minimum 
of 
2 
documents 
– Removed 
25% 
of 
men'ons 
– Removes 
less 
likely 
names 
• Generic 
resources 
– R 
– Bioconductor 
• Categorise 
to 
database/so3ware 
– Removed 
some 
‘unknown’ 
resources 
8
Methods 
Sec'ons 
• Removed 
resources 
not 
in 
the 
methods 
sec'on 
– Method 
or 
non-­‐method 
• Regular 
expression 
based 
'tle 
detec'on 
– Tested 
on 
100 
ar'cles 
– Precision: 
97%; 
Recall: 
79% 
• Resul'ng 
in: 
– 69,466 
database 
men'ons 
(1,711 
unique) 
– 65,451 
so3ware 
men'ons 
(3,289 
unique) 
9
Extrac'ng 
Pairs 
• Co-­‐occurrence 
within 
text 
• Two 
sets 
of 
pairs: 
– So3ware 
only 
pairs 
– Database 
and 
so3ware 
pairs 
(any 
combina'on 
of) 
• This 
provided 
us 
with: 
– 22,880 
so3ware 
pairs 
(13,965 
unique) 
– 54,562 
database/so3ware 
pairs 
(29,066 
unique) 
• Removed 
pairs 
only 
within 
a 
single 
document 
– 53% 
of 
the 
so3ware 
pairs 
– 46% 
of 
the 
database/so3ware 
pairs 
10
Common 
Pairs 
• With 
sufficient 
data, 
the 
most 
common 
order 
of 
a 
pairing 
is 
the 
correct 
one… 
• Binomial 
test 
– 
each 
order 
is 
equally 
likely 
• Two 
confidence 
thresholds: 
– 95% 
• 2,518 
so3ware 
pairs 
(145 
unique) 
• 
7,001 
database/so3ware 
pairs 
(297 
unique) 
– 99% 
• 1,450 
so3ware 
pairs 
(55 
unique) 
• 3,383 
database/so3ware 
pairs 
(95 
unique) 
11
Most 
Common 
Pairs 
SoAware 
only 
pairs 
Directed 
Pair 
Count 
% 
BLAST 
è 
ClustalW 
205 
14.1 
BLAST 
è 
PSI-­‐BLAST 
103 
7.1 
Phred 
è 
Phrap 
89 
6.1 
ClustalW 
è 
MEGA 
77 
5.3 
Cluster 
è 
Tree 
View 
75 
5.2 
Phrap 
è 
Consed 
51 
3.5 
Modeller 
è 
PROCHECK 
44 
3.0 
BLAST 
è 
ClustalX 
43 
3.0 
ClustalW 
è 
PHYLIP 
41 
2.8 
BLAST 
è 
MUSCLE 
40 
2.8 
SoAware 
and 
database 
pairs 
Direct 
Pair 
Count 
% 
GO 
è 
KEGG 
350 
10.3 
BLAST 
è 
GO 
195 
5.8 
BLAST 
è 
ClustalW 
150 
4.4 
GEO 
è 
GO 
129 
3.8 
Phred 
è 
Phrap 
89 
2.6 
BLAST 
è 
PSI-­‐BLAST 
87 
2.6 
PDB 
è 
Modeller 
85 
2.5 
Swiss-­‐Prot 
è 
TrEMBL 
82 
2.4 
Ensembl 
è 
BioMart 
82 
2.4 
ClustalW 
è 
MEGA 
77 
2.3 
12
13
14
15
16
17
18
Resource 
Pa,erns 
Databases 
• Data 
sources 
• GO 
is 
an 
excep'on 
– Major 
sink 
– Data 
Annota'on 
• Numerous 
‘same’ 
links 
– Enumera'on 
in 
text? 
SoAware 
• Data 
sinks 
• Represents 
the 
primary 
in 
silico 
pipeline(s) 
• Again, 
sequence 
alignment 
is 
central 
19
Pa,erns 
through 
Time 
2004 
to 
2006 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2007 
to 
2009 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
      
      
20
Pa,erns 
through 
Time 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21 
2010 
to 
2012
Phylogene'cs 
Pa,erns 
• Case-­‐study… 
• Eales 
et 
al. 
(2008) 
BMC 
Bioinforma2cs, 
9, 
359 
– Mapped 
phylogene'cs 
methods 
into 
4 
steps: 
• Sequence 
Alignment 
• Tree 
Inference 
• Sta's'cal 
Tes'ng 
• Tree 
Visualisa'on 
– Using 
the 
same 
corpus 
selec'on, 
we 
built 
a 
network… 
• PubMed 
search 
for 
“phylogen*” 
in 
'tles 
or 
abstracts 
22
Phylogene'cs 
Pa,erns 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
23
Phylogene'cs 
Pa,erns 
• Our 
automated 
extrac'on 
can 
recreate 
these 
steps 
– Given 
some 
ambiguous 
resources 
• Encouraging… 
– Viable 
in 
silico 
pa,ern 
extrac'on 
– “Common 
prac'ce” 
• Next 
step: 
Apply 
this 
to 
other 
(sub-­‐)domains 
24
Conclusion 
• Can 
extract 
pa,erns 
of 
resource 
usage 
– Can 
we 
describe 
the 
method 
through 
these? 
• High 
level 
overview 
of 
common-­‐prac'ce 
– With 
lower 
thresholds, 
can 
access 
resources 
specific 
(but 
“common”) 
to 
different 
subdomains 
– Not 
best-­‐prac'ce… 
• Workflows? 
– Requires 
increased 
granularity 
– Could 
help 
inform 
their 
crea'on 
25
Thank-­‐you 
• Acknowledgements 
– Co-­‐authors 
– Manchester 
IT 
Services 
• Computa'onal 
facili'es 
– Funding: 
– Travel: 
26

ECCB 2014: Extracting patterns of database and software usage from the bioinformatics literature

  • 1.
    Extrac'ng pa,erns of database and so3ware usage from the bioinforma'cs literature Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson and Robert Stevens The University of Manchester, UK h,p://www.cs.man.ac.uk/~duckg/ h,p://bionerds.sourceforge.net/networks/
  • 2.
    Introduc'on • Methods are fundamental to science – Judgement – Replica'on – Extension • Methods in bioinforma'cs: – In silico: Data and tools – Workflows • Objec've representa'on • Sharing and reuse 2
  • 3.
    Bioinforma'cs • Resource focused domain: “Resourceome” – Our research suggests: • Around 200,000 unique resources in the literature • Over 4 million men'ons • … and s'll growing! • Resource/method search and selec'on… – Best-­‐prac'ce – Common-­‐prac'ce • What are the main pa,erns in bioinforma'cs resources, and associated methods? 3
  • 4.
    Approach • Use bioinforma'cs literature (to answer this ques'on) • Extract database and so3ware men'ons • Combine resources to form pairs • Combine pairs to forms pa,erns – Common-­‐prac'ce – Method? BLAST Modeller PROCHECK 4 ClustalW PHYLIP
  • 5.
    Document Collec'on •PubMed Central open-­‐access full-­‐text ar'cles • Bioinforma2cs[MeSH] • 22,376 ar'cles • 67 journals • 3 journals were > 50% of total documents 5 (%#!!" (%!!!" '%#!!" '%!!!" &%#!!" &%!!!" $%#!!" $%!!!" #!!" !" $))*" &!!!" &!!&" &!!(" &!!+" &!!*" &!$!" &!$&" &!$(" !"#$%&'()'*(+"#%,-.' /%0&'
  • 6.
    bioNerDS • bioNerDS – Bioinforma'cs named en'ty recogniser for databases and so3ware – Full-­‐text; Men'on level – Rule-­‐based – F-­‐score 63-­‐91% – Previously compared resource usage in: • Genome Biology • BMC Bioinforma'cs • Networks filter: – 702,937 total men'ons – 167,697 document level men'ons – 31,053 unique names – 93% single men'on • Duck et al. (2013) BMC Bioinforma'cs h,p://bionerds.sourceforge.net/ 6
  • 7.
    bioNerDS Genome Biology • “Biological” focus – GenBank – Ensembl – GEO – GO BMC Bioinforma6cs • “Resource” focus – R – PDB – PubMed h,p://bionerds.sourceforge.net/ 7
  • 8.
    Men'on Filtering •Filter resources not men'oned within a minimum of 2 documents – Removed 25% of men'ons – Removes less likely names • Generic resources – R – Bioconductor • Categorise to database/so3ware – Removed some ‘unknown’ resources 8
  • 9.
    Methods Sec'ons •Removed resources not in the methods sec'on – Method or non-­‐method • Regular expression based 'tle detec'on – Tested on 100 ar'cles – Precision: 97%; Recall: 79% • Resul'ng in: – 69,466 database men'ons (1,711 unique) – 65,451 so3ware men'ons (3,289 unique) 9
  • 10.
    Extrac'ng Pairs •Co-­‐occurrence within text • Two sets of pairs: – So3ware only pairs – Database and so3ware pairs (any combina'on of) • This provided us with: – 22,880 so3ware pairs (13,965 unique) – 54,562 database/so3ware pairs (29,066 unique) • Removed pairs only within a single document – 53% of the so3ware pairs – 46% of the database/so3ware pairs 10
  • 11.
    Common Pairs •With sufficient data, the most common order of a pairing is the correct one… • Binomial test – each order is equally likely • Two confidence thresholds: – 95% • 2,518 so3ware pairs (145 unique) • 7,001 database/so3ware pairs (297 unique) – 99% • 1,450 so3ware pairs (55 unique) • 3,383 database/so3ware pairs (95 unique) 11
  • 12.
    Most Common Pairs SoAware only pairs Directed Pair Count % BLAST è ClustalW 205 14.1 BLAST è PSI-­‐BLAST 103 7.1 Phred è Phrap 89 6.1 ClustalW è MEGA 77 5.3 Cluster è Tree View 75 5.2 Phrap è Consed 51 3.5 Modeller è PROCHECK 44 3.0 BLAST è ClustalX 43 3.0 ClustalW è PHYLIP 41 2.8 BLAST è MUSCLE 40 2.8 SoAware and database pairs Direct Pair Count % GO è KEGG 350 10.3 BLAST è GO 195 5.8 BLAST è ClustalW 150 4.4 GEO è GO 129 3.8 Phred è Phrap 89 2.6 BLAST è PSI-­‐BLAST 87 2.6 PDB è Modeller 85 2.5 Swiss-­‐Prot è TrEMBL 82 2.4 Ensembl è BioMart 82 2.4 ClustalW è MEGA 77 2.3 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
    Resource Pa,erns Databases • Data sources • GO is an excep'on – Major sink – Data Annota'on • Numerous ‘same’ links – Enumera'on in text? SoAware • Data sinks • Represents the primary in silico pipeline(s) • Again, sequence alignment is central 19
  • 20.
    Pa,erns through Time 2004 to 2006                     2007 to 2009                                    20
  • 21.
    Pa,erns through Time                                                        21 2010 to 2012
  • 22.
    Phylogene'cs Pa,erns •Case-­‐study… • Eales et al. (2008) BMC Bioinforma2cs, 9, 359 – Mapped phylogene'cs methods into 4 steps: • Sequence Alignment • Tree Inference • Sta's'cal Tes'ng • Tree Visualisa'on – Using the same corpus selec'on, we built a network… • PubMed search for “phylogen*” in 'tles or abstracts 22
  • 23.
    Phylogene'cs Pa,erns                                             23
  • 24.
    Phylogene'cs Pa,erns •Our automated extrac'on can recreate these steps – Given some ambiguous resources • Encouraging… – Viable in silico pa,ern extrac'on – “Common prac'ce” • Next step: Apply this to other (sub-­‐)domains 24
  • 25.
    Conclusion • Can extract pa,erns of resource usage – Can we describe the method through these? • High level overview of common-­‐prac'ce – With lower thresholds, can access resources specific (but “common”) to different subdomains – Not best-­‐prac'ce… • Workflows? – Requires increased granularity – Could help inform their crea'on 25
  • 26.
    Thank-­‐you • Acknowledgements – Co-­‐authors – Manchester IT Services • Computa'onal facili'es – Funding: – Travel: 26