Similarity Search in Large Datasets using
Gene Ontology
COMPUTATIONAL INFORMATICS
Heiko Müller, David Rozado, Mat Cook, As...
Gene01: ACGGTAGGCTAGACTAGATATTAACG
Gene02: CCTGAGTACCTGGACTAGATAC
Gene03: GATGCGGTTACGTACGATCCATGGA
Gene04: CATTTATTATATAT...
3 |
Semantic Similarity Search
Algorithms for Comparing Large
Datasets
Results
Semantic Similarity Search in Large Datase...
Gene Ontology (GO)
Semantic Similarity Search in Large Datasets | Heiko Müller4 |
Example from Molecular Function ontology
GO Annotations
Semantic Similarity Search in Large Datasets | Heiko Müller5 |
GOA(g1) = {GO:0055100, GO:0070122}
“[...] th...
Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller6 |
GOA(g1) = {GO:0055100, GO:0070122}
GOA(...
Term Specificity
less similar
more similar
))(log()( tPtic 
Corpus-based
Structure-based





 

)_log(
)1)(...
Group-wise Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller8 |
GOA(g1) = {GO:0055100, GO:00...
IC(g1) = 10.6609
IC(g2) = 9.7925
IC(g1  g2) = 2.7925
sim(g1, g2) = 0.2736





 



)(
)(
)(
)(
2
1
),(
2
21
...
10 |
Semantic Similarity Search
 Algorithms for Comparing
Large Datasets
Results

Semantic Similarity Search in Large Da...
Gene Identifier Sets
1 = g11: GO:0003824, GO:0005488
2 = g12: GO:0016787, GO:0042562
3 = g13: GO:0008233, GO:0031406
4 = g...
Exhaustive Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 ...
Similarity-based Ranking
Semantic Similarity Search in Large Datasets | Heiko Müller13 |
sim(g1,g2) = 1
sim(g3,g4) = 0.82
...
Top-k Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 5
GO:...
15 |
Semantic Similarity Search
 Algorithms for Comparing Large
Datasets
Results


Semantic Similarity Search in Large ...
Results
Runtime – MF (438.406 entries with GO annotations)
UniProt – Swiss-Prot (Rel. 2014_02)
Baseline Exhaustive Top 10,...
Results (cont.)
• Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot
against itself (e=10-4).
Semantic ...
Heiko Müller
e heiko.mueller@csiro.au
t +61 3 6232 5575
COMPUTATIONAL INFORMATICS
Thank you
Upcoming SlideShare
Loading in...5
×

Similarity search in large sets of genes using semantic similarity of gene ontology annotations heiko muller

333

Published on

Over the past years, more than 30 different semantic similarity measures for GO annotations have been proposed. In the first part of the presentation I will give an overview on the strength and weaknesses of these methods. In the second part I will talk about our efforts to develop algorithms that allow time efficient semantic similarity search in large sets of genes (e.g. UniProt).

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
333
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Similarity search in large sets of genes using semantic similarity of gene ontology annotations heiko muller

  1. 1. Similarity Search in Large Datasets using Gene Ontology COMPUTATIONAL INFORMATICS Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman
  2. 2. Gene01: ACGGTAGGCTAGACTAGATATTAACG Gene02: CCTGAGTACCTGGACTAGATAC Gene03: GATGCGGTTACGTACGATCCATGGA Gene04: CATTTATTATATATACGCGCGCGA Gene05: TTTCGATAGGGGATATATTAACGCCG Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC Gene07: GATAGACTCGCGCCGATATATAG Gene08: ATATATTTCCTAGATCGAGAGATAC Gene09: GATAGGTTAATTAATTTCCTATAT Gene10: TGGATTGGATAGCGCGATAGATC Gene11: AAAAGTCGATAAGGCTAGAGCTAG Gene12: GGATATAGATATATCTAGATATC Gene13: CGATATAGCCCTCTAGAGATACTTT Gene14: GATACCCGCGATATATCAT Gene15: TAGATCCCCGAGATAGAGACT Gene16: CACCATAGAAGACTGATCGAGATAG Gene01: GGCTAGACTAGATATTAACGACGGTA Gene02: AGTACCTGGACTAGCCTGTAC Gene03: GATGCGGTTACGCCATTACGAT Gene04: GATATATATATATACGCGCGCGA Gene05: CATTTATGGGATATATTAACGCCG Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC Gene07: GATAGACTCGCGCCGATATATAG Gene08: TCCTAGATCAGATCGAGAGATAC Gene09: GATAGGTTAATTAATTTCCTATAT Gene10: GCGATCCTATGGATAGCAGATC Gene11: AAAAGTCGATAAGGCTAGAGCTAG Gene12: GGATATAGATATATCTAGATATC Gene13: CGATATAGCCAGAAGTCGAACTTT Gene14: GATACCCGCGCTCTATATATCAT Gene15: TAGATCCCCGAGATAGAGACT Gene16: CACCATAGAAGACTGATCGAGATAG N. perurans N. pemaquidensis Compare sets of genes and gene products to discover: 1. Similarities between them. 2. The most dissimilar genes in each dataset.
  3. 3. 3 | Semantic Similarity Search Algorithms for Comparing Large Datasets Results Semantic Similarity Search in Large Datasets | Heiko Müller
  4. 4. Gene Ontology (GO) Semantic Similarity Search in Large Datasets | Heiko Müller4 | Example from Molecular Function ontology
  5. 5. GO Annotations Semantic Similarity Search in Large Datasets | Heiko Müller5 | GOA(g1) = {GO:0055100, GO:0070122} “[...] the pathway from a child term all the way up to its top-level parent(s) must always be true“. True Path Rule
  6. 6. Semantic Similarity Semantic Similarity Search in Large Datasets | Heiko Müller6 | GOA(g1) = {GO:0055100, GO:0070122} GOA(g2) = {GO:0030332, GO:0070012} • Annotations provide an objective representation to compare genes on functional aspects. • Semantic similarity measure quantifies relationships between (sets of) GO terms. sim(g1, g2) = ?
  7. 7. Term Specificity less similar more similar ))(log()( tPtic  Corpus-based Structure-based         )_log( )1)(log( 1)()( termstotal tdesc tdepthtic Quantify semantics or information content (ic) of GO terms.
  8. 8. Group-wise Semantic Similarity Semantic Similarity Search in Large Datasets | Heiko Müller8 | GOA(g1) = {GO:0055100, GO:0070122} GOA(g2) = {GO:0030332, GO:0070012}
  9. 9. IC(g1) = 10.6609 IC(g2) = 9.7925 IC(g1  g2) = 2.7925 sim(g1, g2) = 0.2736           )( )( )( )( 2 1 ),( 2 21 1 21 21 gIC ggIC gIC ggIC ggsim Group-wise Similarity X. Chen et al., Gene, 509 (2012)
  10. 10. 10 | Semantic Similarity Search  Algorithms for Comparing Large Datasets Results  Semantic Similarity Search in Large Datasets | Heiko Müller
  11. 11. Gene Identifier Sets 1 = g11: GO:0003824, GO:0005488 2 = g12: GO:0016787, GO:0042562 3 = g13: GO:0008233, GO:0031406 4 = g14: GO:0005515, GO:0016787 5 = g15: GO:0055100, GO:0070122 D1 1 = g21: GO:0003824, GO:0005488 2 = g22: GO:0016829, GO:0042562 3 = g23: GO:0043168, GO:0008233 4 = g24: GO:0055100, GO:0070012 5 = g25: GO:0004325, GO:0043177 D2 5 4 1-5 1-5 2-5 3-4 5 3,5 4 3-4 1-5 1-5
  12. 12. Exhaustive Search Term IC GIDS-D1 GIDS-D2 GO:0070012 4.0000 4 GO:0070122 3.5212 5 GO:0055100 3.0000 5 4 GO:0004325 2.7734 5 GO:0031406 2.2228 3 GO:0043177 1.6616 3 5 GO:0008233 1.6305 3,5 3-4 GO:0043168 1.3777 3 3 GO:0042562 1.3472 2,5 2,4 GO:0036094 0.8873 3 5 GO:0043167 0.8624 3 3 GO:0016829 0.6347 2,5 GO:0005515 0.5123 4-5 4 GO:0016787 0.4144 2-5 3-4 GO:0005488 0.1898 1-5 1-5 GO:0003824 0.0455 1-5 1-5 1 2 3 4 5 1 2 3 4 5 IC-D1 IC-D2 IC-D12 4 3.52 3 7 6.52
  13. 13. Similarity-based Ranking Semantic Similarity Search in Large Datasets | Heiko Müller13 | sim(g1,g2) = 1 sim(g3,g4) = 0.82 simrank(g1,g2) simrank(g1,g2) = 0.2353 simrank(g3,g4) = 14.0304 ),()( 2121 ggsimggIC 
  14. 14. Top-k Search Term IC GIDS-D1 GIDS-D2 GO:0070012 4.0000 4 GO:0070122 3.5212 5 GO:0055100 3.0000 5 4 GO:0004325 2.7734 5 GO:0031406 2.2228 3 GO:0043177 1.6616 3 5 GO:0008233 1.6305 3,5 3-4 GO:0043168 1.3777 3 3 GO:0042562 1.3472 2,5 2,4 GO:0036094 0.8873 3 5 GO:0043167 0.8624 3 3 GO:0016829 0.6347 2,5 GO:0005515 0.5123 4-5 4 GO:0016787 0.4144 2-5 3-4 GO:0005488 0.1898 1-5 1-5 GO:0003824 0.0455 1-5 1-5 1 2 3 4 5 Top-5 5,4 4.68 5,3 0.82 5,2 0.68 5,1 0.12 5,5 0.01 Step 1 5,4 4.68 3,3 3.36 3,5 1.04 5,3 0.82 5,2 0.68 Step 2 5,4 4.68 3,3 3.36 2,2 1.19 2,4 1.18 3,5 1.04 Step 3 IC-D2 1 2 3 4 5 IC-D1 0.24 2 9.29 1.16 10.7 0.24 2.22 4.52 11.1 6.19 1 2 3 4 5
  15. 15. 15 | Semantic Similarity Search  Algorithms for Comparing Large Datasets Results   Semantic Similarity Search in Large Datasets | Heiko Müller
  16. 16. Results Runtime – MF (438.406 entries with GO annotations) UniProt – Swiss-Prot (Rel. 2014_02) Baseline Exhaustive Top 10,000 Top 1,000 Top 100 > 2 days ~ 45 min. 2.5 - 4.5 min. 1 – 3.5 min. 15 sec. – 2.5 min. Semantic Similarity Search in Large Datasets | Heiko Müller16 |
  17. 17. Results (cont.) • Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot against itself (e=10-4). Semantic Similarity Search in Large Datasets | Heiko Müller17 | How does it compare to sequence similarity search? Number of similar pairs in Top 10,000 that are not included in BLAST results 0 1000 2000 3000 4000 5000 6000 7000 8000 MF-ALL MF-CUR CORPUS STRUCTURE
  18. 18. Heiko Müller e heiko.mueller@csiro.au t +61 3 6232 5575 COMPUTATIONAL INFORMATICS Thank you
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×