A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer S...
PhyloFinder http://pilin.cs.iastate.edu/phylofinder/
Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Futu...
Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Futu...
Issues in Phylogenetic Databases <ul><li>Taxonomic consistency </li></ul><ul><ul><li>Species may appear in multiple trees ...
Classification trees and phylogenies
Exploiting taxonomic classifications <ul><li>The leaves in phylogenetic trees may represent different taxonomic levels </l...
TreeBASE (Piel, Donoghue, & Sanderson, 1996)
TreeBASE capabilities <ul><li>Search by </li></ul><ul><ul><li>taxon   </li></ul></ul><ul><ul><li>author </li></ul></ul><ul...
TreeBASE limitations <ul><li>Taxonomic name consistency </li></ul><ul><li>Querying </li></ul><ul><ul><li>Few options </li>...
PhyloFinder <ul><li>A search engine for tree databases </li></ul><ul><ul><li>Not a database  </li></ul></ul><ul><li>Allows...
PhyloFinder Design <ul><li>Uses simple but powerful techniques </li></ul><ul><ul><li>Inverted index for filtering </li></u...
Outline <ul><li>Introduction </li></ul><ul><li>Queries </li></ul><ul><li>Storage and querying </li></ul><ul><li>Acknowledg...
PhyloFinder Queries <ul><li>Taxonomic queries  involve a single taxon or set of taxa.  </li></ul><ul><li>Phylogenetic quer...
Taxonomic Queries <ul><li>Contains:   Given a list of taxa, return all trees that contain  all  or  any  of these names.  ...
Taxonomic Queries: Contains
Phylogenetic Queries <ul><li>Tree mining:  Given a query tree Q, find the database trees that exhibit Q   in some way.  Op...
Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d...
Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d...
Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d...
Phylogenetic queries: Notation 2 <ul><li>T|A is obtained from T(A) by suppressing all internal nodes that have only one ch...
Phylogenetic queries <ul><li>Let Q be a query tree with leaf set A.  </li></ul><ul><li>Q is  an embedded subtree  of T if ...
Phylogenetic queries
Phylogenetic queries: Embedded
Phylogenetic queries: Refined by
Phylogenetic queries: Refined by Q embedded in T     Q refined by T
Phylogenetic queries: Embedded
Similarity queries <ul><li>Return trees ranked by a  similarity score </li></ul><ul><ul><li>Score is a percentage between ...
Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Futu...
System architecture
Least Common Ancestors (LCAs) a b c d e f g
Least Common Ancestors (LCAs) a b c d e f g LCA( b , e )
Storage: Nested intervals <ul><li>Ancestor/descendant relationship is easy to determine </li></ul><ul><ul><li>The  between...
Storage: Inverted index <ul><li>For each taxon, store a list of all trees that contain it. </li></ul><ul><li>Easy to find ...
Building the inverted index <ul><li>Input trees: 1:  (((man,pan),gorilla),pongo),   2.  (((human, coprinus),cryptomonas),z...
Schema
Query Processing: Outline Consult inverted index Q: Candidate trees: Results : Compare against Q using LCA queries
Implementing Phylogenetic Queries <ul><li>Idea:  Use LCA queries to compare ancestor-descendant relationships in Q with th...
Implementing Taxonomic Queries <ul><li>Use Boolean (union/intersection) operations on the inverted index </li></ul><ul><li...
Tree visualization
Tree visualization <ul><li>Other tree visualization tools are available:  </li></ul><ul><ul><li>Hillis, Heath, & St. John ...
Spelling <ul><li>Suggestions come from TreeBASE and NCBI </li></ul><ul><li>Uses GNU Aspell </li></ul><ul><ul><li>Modified ...
Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Futu...
Under construction <ul><li>Unrooted trees </li></ul><ul><li>Supertree methods </li></ul><ul><ul><li>MRP, MRF, MMC </li></u...
Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Futu...
Thanks to <ul><li>Rod Page for TBMap </li></ul><ul><li>Bill Piel for TreeBASE data </li></ul><ul><li>Mike Sanderson </li><...
Upcoming SlideShare
Loading in …5
×

A search engine for phylogenetic tree databases - D. Fernándes-Baca

2,720 views

Published on

Published in: Technology, Business
  • Be the first to comment

A search engine for phylogenetic tree databases - D. Fernándes-Baca

  1. 1. A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)
  2. 2. PhyloFinder http://pilin.cs.iastate.edu/phylofinder/
  3. 3. Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Future directions </li></ul><ul><li>Acknowledgements </li></ul>
  4. 4. Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Future directions </li></ul><ul><li>Acknowledgements </li></ul>
  5. 5. Issues in Phylogenetic Databases <ul><li>Taxonomic consistency </li></ul><ul><ul><li>Species may appear in multiple trees by different but synonymous names. </li></ul></ul><ul><ul><li>Homonyms </li></ul></ul><ul><ul><li>Misspellings </li></ul></ul><ul><li>Querying capability </li></ul><ul><ul><li>Storage/representation </li></ul></ul><ul><ul><li>Exploiting classification trees (e.g., NCBI) </li></ul></ul><ul><li>Clustering capabilities </li></ul><ul><ul><li>Distance measures </li></ul></ul><ul><li>Aggregation (synthesis) capabilities </li></ul><ul><ul><li>Supertrees </li></ul></ul><ul><li>Visualization </li></ul>
  6. 6. Classification trees and phylogenies
  7. 7. Exploiting taxonomic classifications <ul><li>The leaves in phylogenetic trees may represent different taxonomic levels </li></ul><ul><li>A classification tree can allows us to locate trees that contain a taxon, as well as its descendants or ancestors. </li></ul><ul><ul><li>E.g., a “Pinaceae&quot; query would identify trees that contain “Pinus thunbergii ” or “Abies alba ” </li></ul></ul>
  8. 8. TreeBASE (Piel, Donoghue, & Sanderson, 1996)
  9. 9. TreeBASE capabilities <ul><li>Search by </li></ul><ul><ul><li>taxon </li></ul></ul><ul><ul><li>author </li></ul></ul><ul><ul><li>citation </li></ul></ul><ul><ul><li>study accession number </li></ul></ul><ul><ul><li>matrix accession number </li></ul></ul><ul><ul><li>structure (topology) </li></ul></ul><ul><li>Tree surfing </li></ul>
  10. 10. TreeBASE limitations <ul><li>Taxonomic name consistency </li></ul><ul><li>Querying </li></ul><ul><ul><li>Few options </li></ul></ul><ul><ul><li>Does not exploit classification </li></ul></ul><ul><ul><ul><li>Can’t identify ancestors/descendants </li></ul></ul></ul><ul><li>Visualization </li></ul><ul><li>Clustering and aggregation (supertrees) </li></ul>
  11. 11. PhyloFinder <ul><li>A search engine for tree databases </li></ul><ul><ul><li>Not a database </li></ul></ul><ul><li>Allows powerful phylogenetic queries </li></ul><ul><ul><li>Handles synonymous taxonomic names (via TBMap) </li></ul></ul><ul><ul><li>Handles misspellings. </li></ul></ul><ul><ul><li>Exploits taxonomic classification </li></ul></ul><ul><ul><li>Offers precise options for identifying different types of subtrees and metrics for identifying similar trees. </li></ul></ul><ul><ul><li>Provides a visualization tool with links to GenBank and TBMap. </li></ul></ul><ul><li>Fast </li></ul><ul><ul><li>Efficient storage and filtering </li></ul></ul>
  12. 12. PhyloFinder Design <ul><li>Uses simple but powerful techniques </li></ul><ul><ul><li>Inverted index for filtering </li></ul></ul><ul><ul><li>Nested-set representation of trees </li></ul></ul><ul><ul><ul><li>Least common ancestor queries directly on database </li></ul></ul></ul><ul><ul><li>Off-the-shelf spell-checking technology </li></ul></ul><ul><li>Can be used with any phylogenetic database </li></ul><ul><ul><li>E.g., PhyLoTA browser </li></ul></ul><ul><ul><li>However, set-up is not (yet) automatic </li></ul></ul>
  13. 13. Outline <ul><li>Introduction </li></ul><ul><li>Queries </li></ul><ul><li>Storage and querying </li></ul><ul><li>Acknowledgements </li></ul>
  14. 14. PhyloFinder Queries <ul><li>Taxonomic queries involve a single taxon or set of taxa. </li></ul><ul><li>Phylogenetic queries take as input a phylogenetic tree </li></ul><ul><ul><li>Locate trees that match it in some specified way. </li></ul></ul>
  15. 15. Taxonomic Queries <ul><li>Contains: Given a list of taxa, return all trees that contain all or any of these names. </li></ul><ul><ul><li>Similar to Boolean “AND” and “OR” searches. </li></ul></ul><ul><ul><li>Automatically searches for synonymous taxa </li></ul></ul><ul><li>Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. </li></ul><ul><ul><li>E.g., if the query taxon is “birds &quot; , identify all trees that contain bird taxa. </li></ul></ul><ul><li>Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree. </li></ul>
  16. 16. Taxonomic Queries: Contains
  17. 17. Phylogenetic Queries <ul><li>Tree mining: Given a query tree Q, find the database trees that exhibit Q in some way. Options: </li></ul><ul><ul><li>Return the trees that have Q as an embedded subtree. </li></ul></ul><ul><ul><li>Return the trees that refine Q. </li></ul></ul><ul><li>Similarity: Given a query tree Q and a specified similarity measure , return trees in database ranked by decreasing similarity from Q. </li></ul><ul><ul><li>Requires at least 3 taxon overlap </li></ul></ul>
  18. 18. Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d e f g
  19. 19. Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d e f g
  20. 20. Phylogenetic queries: Notation 1 <ul><li>T(A) is the minimal subtree of T that contains the leaves in A. </li></ul>a b c d e f g
  21. 21. Phylogenetic queries: Notation 2 <ul><li>T|A is obtained from T(A) by suppressing all internal nodes that have only one child. </li></ul>a b c d e f g
  22. 22. Phylogenetic queries <ul><li>Let Q be a query tree with leaf set A. </li></ul><ul><li>Q is an embedded subtree of T if and only if it is identical to T|A. </li></ul><ul><li>Q is refined by T (T refines Q) if T|A is a refinement of Q. </li></ul>
  23. 23. Phylogenetic queries
  24. 24. Phylogenetic queries: Embedded
  25. 25. Phylogenetic queries: Refined by
  26. 26. Phylogenetic queries: Refined by Q embedded in T  Q refined by T
  27. 27. Phylogenetic queries: Embedded
  28. 28. Similarity queries <ul><li>Return trees ranked by a similarity score </li></ul><ul><ul><li>Score is a percentage between 0 and 100% reflecting how similar query tree is to candidate tree. </li></ul></ul><ul><li>PhyloFinder’s similarity measures: </li></ul><ul><ul><li>Robinson-Foulds (RF) similarity </li></ul></ul><ul><ul><li>Least common ancestor (LCA) similarity </li></ul></ul><ul><li>Score takes degree of taxon overlap into account. </li></ul>
  29. 29. Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Future directions </li></ul><ul><li>Acknowledgements </li></ul>
  30. 30. System architecture
  31. 31. Least Common Ancestors (LCAs) a b c d e f g
  32. 32. Least Common Ancestors (LCAs) a b c d e f g LCA( b , e )
  33. 33. Storage: Nested intervals <ul><li>Ancestor/descendant relationship is easy to determine </li></ul><ul><ul><li>The between predicate defines subtrees </li></ul></ul><ul><li>LCAs are easily computed </li></ul><ul><ul><li>Find common ancestor with largest Node_ID </li></ul></ul>a b d c e f (1,10) (2,9) (3,5) (10,10) (4,4) (5,5) (6,6) (7,9) (8,8) (9,9) (Node_ID,RMD_ID)
  34. 34. Storage: Inverted index <ul><li>For each taxon, store a list of all trees that contain it. </li></ul><ul><li>Easy to find trees containing any or all elements in a list of taxa </li></ul><ul><ul><li>Used as a filter </li></ul></ul>Cornus Spigelia Hedera 1 2 3 5 8 13 21 34 2 4 8 16 32 64 128 13 16
  35. 35. Building the inverted index <ul><li>Input trees: 1: (((man,pan),gorilla),pongo), 2. (((human, coprinus),cryptomonas),zea_mays), . . . , N: (((dogs,homo_sapiens),pig),lambs) </li></ul><ul><li>Convert trees into lists of taxa:  man pan gorilla pongo  ,  human coprinus . . .  . . . </li></ul><ul><li>Synonymy preprocessing: Replace names by TBMap name clusters:  tc1 tc2 tc3 tc4  ,  tc1 tc5 . . .  . . . </li></ul><ul><li>Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs). tc1 1 2 3 4 tc2 1 </li></ul>
  36. 36. Schema
  37. 37. Query Processing: Outline Consult inverted index Q: Candidate trees: Results : Compare against Q using LCA queries
  38. 38. Implementing Phylogenetic Queries <ul><li>Idea: Use LCA queries to compare ancestor-descendant relationships in Q with those in T. </li></ul><ul><li>M(x) and M(y) have the same relationship in T as x and y have in Q  Q can be embedded in T. </li></ul><ul><li>Advantage: Database trees need not be read into main memory. </li></ul>
  39. 39. Implementing Taxonomic Queries <ul><li>Use Boolean (union/intersection) operations on the inverted index </li></ul><ul><li>Example : Querying for “birds&quot; </li></ul><ul><ul><li>Find all bird species in the database trees using the NCBI taxonomy tree. </li></ul></ul><ul><ul><li>Use inverted index to retrieves the tree ID lists for each bird species. </li></ul></ul><ul><ul><li>Return the union of these lists. </li></ul></ul>
  40. 40. Tree visualization
  41. 41. Tree visualization <ul><li>Other tree visualization tools are available: </li></ul><ul><ul><li>Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482. </li></ul></ul><ul><ul><li>Sanderson 2006. Bioinformatics 22: 1004-1006. </li></ul></ul><ul><ul><li>Zmasek & Eddy. 2001. Bioinformatics 17: 383-384. </li></ul></ul><ul><li>We developed our own to </li></ul><ul><ul><li>avoid plug-ins, and </li></ul></ul><ul><ul><li>easily highlight query results and provide outlinks to GenBank and TBMap. </li></ul></ul>
  42. 42. Spelling <ul><li>Suggestions come from TreeBASE and NCBI </li></ul><ul><li>Uses GNU Aspell </li></ul><ul><ul><li>Modified to handle special characters (`-', `&', '.') and compound words. </li></ul></ul>
  43. 43. Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Future directions </li></ul><ul><li>Acknowledgements </li></ul>
  44. 44. Under construction <ul><li>Unrooted trees </li></ul><ul><li>Supertree methods </li></ul><ul><ul><li>MRP, MRF, MMC </li></ul></ul><ul><li>Desktop version </li></ul><ul><li>Automatic update </li></ul><ul><li>Suggestions? </li></ul>
  45. 45. Outline <ul><li>Introduction </li></ul><ul><li>PhyloFinder queries </li></ul><ul><li>Implementation </li></ul><ul><li>Future directions </li></ul><ul><li>Acknowledgements </li></ul>
  46. 46. Thanks to <ul><li>Rod Page for TBMap </li></ul><ul><li>Bill Piel for TreeBASE data </li></ul><ul><li>Mike Sanderson </li></ul><ul><li>Oliver Eulenstein </li></ul><ul><li>National Science Foundation (grant EF-0334832) </li></ul>

×