Working with Trees in the Phyloinformatic Age William H. Piel Yale Peabody Museum Hilmar Lapp NESCent, Duke University
Dealing with the Growth of Phyloinformatics <ul><li>Trees: Too Many </li></ul><ul><ul><li>Search, organize, triage, summar...
Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul>...
Dewey system: A B C D E 0.1 0.1.1 0.1.2 0.2 0.2.1 0.2.1.1 0.2.1.2 0.2.2 0
Find clade for: Z = (<C S +D s ) Find common pattern starting from left SELECT *  FROM nodes WHERE (path LIKE “0.2.1%”); 0...
<ul><li>ATreeGrep </li></ul><ul><ul><li>Uses special suffix indexing to optimize speed </li></ul></ul><ul><ul><li>Shasha, ...
Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul>...
Depth-first traversal scoring each node with a lef and right ID A B C D E 2 3 5 8 9 10 12 15 1 4 6 7 17 11 13 16 18 14
SELECT *  FROM nodes INNER JOIN nodes AS include ON (nodes.left_id BETWEEN include.left_id AND include.right_id) WHERE inc...
<ul><li>PhyloFinder </li></ul><ul><ul><li>Duhong Chen  et al. </li></ul></ul><ul><ul><li>http://pilin.cs.iastate.edu/phylo...
Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul>...
A B C D E 1 2 3 4 5 6 7 8 9 - 1 - - 2 1 A 3 2 B 4 2 - 6 5 - 5 1 C 7 6 E 9 5 D 8 6
SQL Query to find parent node of node “D”: SELECT * FROM nodes AS parent INNER JOIN nodes AS child ON (child.parent_id = p...
Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul>...
Searching trees by distance metrics:  USim distance Wang, J. T. L., H. Shan, D. Shasha and W. H. Piel. 2005. Fast Structur...
Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul>...
Transitive Closure <ul><li>Finding paths between vertices on a graph </li></ul><ul><li>DB2 and Oracle have special functio...
Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summari...
BioSQL:  http://www.biosql.org/ Schema for persistent storage of sequences and features tightly integrated with BioPerl (+...
CREATE TABLE node_path ( child_node_id integer, parent_node_id integer, distance integer); Index of all paths from ancesto...
SELECT pA.parent_node_id FROM  node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id ...
SELECT pA.parent_node_id FROM  node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id ...
SELECT pA.parent_node_id FROM  node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id ...
SELECT e.parent_id AS parent, e.child_id AS child, ch.node_label, pt.tree_id FROM node_path p, edges e, nodes pt, nodes ch...
SELECT DISTINCT t.tree_id, t.name FROM node_path p, nodes ch, trees t WHERE ch.node_id = p.child_node_id AND ch.tree_id = ...
SELECT qry.tree_id, MIN(qry.name) AS &quot;tree_name&quot; FROM ( SELECT DISTINCT ON (n.node_id) n.node_id, t.tree_id, t.n...
SELECT t.tree_id, t.name FROM trees t INNER JOIN (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id FROM no...
Matching a whole tree means querying for all clades (A, B) but not C, D, E (C, D) but not A, B, E (C, D, E) but not A, B A...
Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summari...
(((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_caballus) vs ((Sus_scrofa, (Hippopotamus,Balaenoptera)),Equus_caballus) M...
Even if with perfectly-resolved OTUs, you will still fail to hit relevant trees: Sus scrofa Hippopotamus Balaenoptera Equu...
Step 1: for each clade all trees in database, run a stem query on a classification tree (e.g. NCBI) Stem Queries: Node 2: ...
Rename nodes according to their deepest stem query… Gorilla gorilla Homo sapiens Pan troglodytes Macaca sinica Macaca nigr...
Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summari...
PhyloWidget <ul><li>Greg Jordan </li></ul><ul><ul><li>Google Summer of Code student </li></ul></ul><ul><ul><li>Nick Goldma...
Thanks
Upcoming SlideShare
Loading in …5
×

Working with Trees in the Phyloinformatic Age. WH Piel

1,386
-1

Published on

Published in: Education, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,386
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
33
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Working with Trees in the Phyloinformatic Age. WH Piel

  1. 1. Working with Trees in the Phyloinformatic Age William H. Piel Yale Peabody Museum Hilmar Lapp NESCent, Duke University
  2. 2. Dealing with the Growth of Phyloinformatics <ul><li>Trees: Too Many </li></ul><ul><ul><li>Search, organize, triage, summarize, synthesize </li></ul></ul><ul><ul><ul><li>Review existing methods </li></ul></ul></ul><ul><ul><ul><li>Describe queries for BioSQL phylo extension </li></ul></ul></ul><ul><ul><ul><li>Making generic queries </li></ul></ul></ul><ul><li>Trees: Too Big </li></ul><ul><ul><li>Visualizing and manipulating large trees </li></ul></ul><ul><ul><ul><li>Demo PhyloWidget </li></ul></ul></ul>
  3. 3. Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul><ul><li>Transitive Closure </li></ul>
  4. 4. Dewey system: A B C D E 0.1 0.1.1 0.1.2 0.2 0.2.1 0.2.1.1 0.2.1.2 0.2.2 0
  5. 5. Find clade for: Z = (<C S +D s ) Find common pattern starting from left SELECT * FROM nodes WHERE (path LIKE “0.2.1%”); 0.2.2 E 0.2.1.2 D 0.2.1.1 C 0.2.1 NULL 0.2 NULL 0.1.2 B 0.1.1 A 0.1 NULL 0 Root Path Label A B C D E
  6. 6. <ul><li>ATreeGrep </li></ul><ul><ul><li>Uses special suffix indexing to optimize speed </li></ul></ul><ul><ul><li>Shasha, D., J. T. L. Wang, H. Shan and K. Zhang. 2002. ATreeGrep: Approximate Searching in Unordered Tree. Proceedings of the 14th SSDM, Edinburgh, Scotland, pp. 89-98. </li></ul></ul><ul><li>Crimson </li></ul><ul><ul><li>Uses nested subtrees to avoid long strings </li></ul></ul><ul><ul><li>Zheng, Y. S. Fisher, S. Cohen, S. Guo, J. Kim, and S. B. Davidson. 2006. Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms. 32nd International Conference on Very Large Data Bases, ACM, pp. 1231-1234. </li></ul></ul>
  7. 7. Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul><ul><li>Metrics </li></ul><ul><li>Transitive Closure </li></ul>
  8. 8. Depth-first traversal scoring each node with a lef and right ID A B C D E 2 3 5 8 9 10 12 15 1 4 6 7 17 11 13 16 18 14
  9. 9. SELECT * FROM nodes INNER JOIN nodes AS include ON (nodes.left_id BETWEEN include.left_id AND include.right_id) WHERE include.node_id = 5 ; Minimum Spanning Clade of Node 5 16 15 E 13 12 D 11 10 C 14 9 17 8 6 5 B 4 3 A 7 2 18 1 Right Left Label A B C D E 2 3 5 8 9 10 12 15 1 4 6 7 17 11 13 16 18 14
  10. 10. <ul><li>PhyloFinder </li></ul><ul><ul><li>Duhong Chen et al. </li></ul></ul><ul><ul><li>http://pilin.cs.iastate.edu/phylofinder/ </li></ul></ul><ul><li>Mackey, A. 2002. Relational Modeling of Biological Data: Trees and Graphs. Bioinformatics Technology Conference . http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html </li></ul>
  11. 11. Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul><ul><li>Metrics </li></ul><ul><li>Transitive Closure </li></ul>
  12. 12. A B C D E 1 2 3 4 5 6 7 8 9 - 1 - - 2 1 A 3 2 B 4 2 - 6 5 - 5 1 C 7 6 E 9 5 D 8 6
  13. 13. SQL Query to find parent node of node “D”: SELECT * FROM nodes AS parent INNER JOIN nodes AS child ON (child.parent_id = parent.node_id) WHERE child.node_label = ‘D’; … but this requires an external procedure to navigate the tree. - 1 - - 2 1 A 3 2 B 4 2 - 6 5 - 5 1 C 7 6 E 9 5 D 8 6 node_label: node_id: parent_id:
  14. 14. Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul><ul><li>Metrics </li></ul><ul><li>Transitive Closure </li></ul>
  15. 15. Searching trees by distance metrics: USim distance Wang, J. T. L., H. Shan, D. Shasha and W. H. Piel. 2005. Fast Structural Search in Phylogenetic Databases. Evolutionary Bioinformatics Online , 1: 37-46 A B C D A B C D 0 1 1 1 D 2 0 1 1 C 3 2 0 1 B 3 2 1 0 A D C B A 0 1 2 2 D 1 0 2 2 C 2 2 0 1 B 2 2 1 0 A D C B A
  16. 16. Searching Stored Tree <ul><li>Path Enumerations </li></ul><ul><li>Nested Sets </li></ul><ul><li>Adjacency Lists </li></ul><ul><li>Transitive Closure </li></ul>
  17. 17. Transitive Closure <ul><li>Finding paths between vertices on a graph </li></ul><ul><li>DB2 and Oracle have special functions: </li></ul><ul><ul><li>From Edge Start With (child_id = A and tree_id = T) Connect By (Prior parent_id = child_id) And (Prior tree_id = tree_id) </li></ul></ul><ul><li>Nakhleh, L., D. Miranker, F. Barbancon, W. H. Piel, and M. Donoghue. 2003. Requirements of phylogenetic databases. Third IEEE Symposium on Bioinformatics and Bioengineering, p. 141-148. </li></ul><ul><li>Paths can be precomputed and stored: BioSQL </li></ul>
  18. 18. Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summarize, synthesize </li></ul></ul><ul><ul><ul><li>Review existing methods </li></ul></ul></ul><ul><ul><ul><li>Describe queries for BioSQL phylo extension </li></ul></ul></ul><ul><ul><ul><li>Making generic queries </li></ul></ul></ul><ul><li>Trees Too Big </li></ul><ul><ul><li>Visualizing and manipulating large trees </li></ul></ul><ul><ul><ul><li>Demo PhyloWidget </li></ul></ul></ul>
  19. 19. BioSQL: http://www.biosql.org/ Schema for persistent storage of sequences and features tightly integrated with BioPerl (+ BioPython, BioJava, and BioRuby) • phylodb extension designed at NESCent Hackathon • perl command-line interface by Jamie Estill, GSoC
  20. 20. CREATE TABLE node_path ( child_node_id integer, parent_node_id integer, distance integer); Index of all paths from ancestors to descendants A B 1 2 3 4 5 C 1 2 1 5 3 2 4 2 1 3 1 4
  21. 21. SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B'; Find all paths where A and B share a common parent_node_id A B 1 2 3 4 5 C 1 2 1 5 3 2 4 2 1 3 1 4
  22. 22. SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B' ORDER BY pA.distance LIMIT 1; … of those paths, select one that has the shortest path A B 1 2 3 4 5 C 1 2 1 5 3 2 4 2 1 3 1 4
  23. 23. SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B' ORDER BY pA.distance DESC LIMIT 1; … of those paths, select one that has the longest path A B 1 2 3 4 5 C 1 2 1 5 3 2 4 2 1 3 1 4
  24. 24. SELECT e.parent_id AS parent, e.child_id AS child, ch.node_label, pt.tree_id FROM node_path p, edges e, nodes pt, nodes ch WHERE e.child_id = p.child_node_id AND pt.node_id = e.parent_id AND ch.node_id = e.child_id AND p.parent_node_id IN (       SELECT pA.parent_node_id       FROM   node_path pA, node_path pB, nodes nA, nodes nB       WHERE pA.parent_node_id = pB.parent_node_id       AND   pA.child_node_id = nA.node_id       AND   nA.node_label = 'A'       AND   pB.child_node_id = nB.node_id       AND   nB.node_label = 'B') AND NOT EXISTS (     SELECT 1 FROM node_path np, nodes n     WHERE    np.child_node_id = n.node_id     AND n.node_label  = 'C'     AND np.parent_node_id = p.parent_node_id); Find the maximum spanning clade (i.e. the subtree) for each tree that includes A and B but not C: Get all ancestors shared by A and B Exclude those that are also ancestors to C Return an adjacency list for each subtree
  25. 25. SELECT DISTINCT t.tree_id, t.name FROM node_path p, nodes ch, trees t WHERE ch.node_id = p.child_node_id AND ch.tree_id = t.tree_id AND p.parent_node_id IN ( SELECT pA.parent_node_id FROM node_path pA, node_path pB, nodes nA, nodes nB WHERE pA.parent_node_id = pB.parent_node_id AND pA.child_node_id = nA.node_id AND nA.node_label = 'A' AND pB.child_node_id = nB.node_id AND nB.node_label = 'B') AND NOT EXISTS ( SELECT 1 FROM node_path np, nodes n WHERE np.child_node_id = n.node_id AND n.node_label = 'C' AND np.parent_node_id = p.parent_node_id); Find trees that contain a clade that includes A and B but not C: Get all ancestors shared by A and B Exclude those that are also ancestors to C List the set of trees with these ancestors
  26. 26. SELECT qry.tree_id, MIN(qry.name) AS &quot;tree_name&quot; FROM ( SELECT DISTINCT ON (n.node_id) n.node_id, t.tree_id, t.name FROM trees t, nodes n, (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id FROM nodes inN, node_path inP WHERE inN.node_label IN ('A','B','C') AND inP.child_node_id = inN.node_id GROUP BY inN.tree_id, inP.parent_node_id HAVING COUNT(inP.child_node_id) = 3 ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca, WHERE n.node_id IN (lca2.parent_node_id) AND t.tree_id = n.tree_id AND NOT EXISTS (SELECT 1 FROM nodes outN, node_path outP WHERE outN.node_label IN ('D','E') AND outP.child_node_id = outN.node_id AND outP.parent_node_id = lca.parent_node_id) AND EXISTS (SELECT c.tree_id FROM trees c, nodes q WHERE q.node_label IN ('D','E') AND q.tree_id = c.tree_id AND c.tree_id = t.tree_id GROUP BY c.tree_id HAVING COUNT(c.tree_id) = 2)) AS qry GROUP BY (qry.tree_id) HAVING COUNT(qry.node_id) = 1; Find trees that contain a clade that includes (A, B, C) but not D or E: Get all ancestors of A, B, C from all trees that have A, B, C Exclude those that are also ancestors to D, E But make sure that the tree still contains D, E Number of clades that each tree must satisfy Number of ingroups that share node Number of non-ingroups that must be in tree
  27. 27. SELECT t.tree_id, t.name FROM trees t INNER JOIN (SELECT DISTINCT ON (inN.tree_id) inP.parent_node_id, inN.tree_id FROM nodes inN, node_path inP WHERE inN.node_label IN ('A','B','C') AND inP.child_node_id = inN.node_id GROUP BY inN.tree_id, inP.parent_node_id HAVING COUNT(inP.child_node_id) = 3 ORDER BY inN.tree_id, inP.parent_node_id DESC) AS lca USING (tree_id) WHERE NOT EXISTS ( SELECT 1 FROM nodes outN, node_path outP WHERE outN.node_label IN ('D','E') AND outP.child_node_id = outN.node_id AND outP.parent_node_id = lca.parent_node_id) AND EXISTS ( SELECT c.tree_id FROM trees c, nodes q WHERE q.node_label IN ('D','E') AND q.tree_id = c.tree_id AND c.tree_id = t.tree_id GROUP BY c.tree_id HAVING COUNT(c.tree_id) = 2); Here's a faster, cleaner version:
  28. 28. Matching a whole tree means querying for all clades (A, B) but not C, D, E (C, D) but not A, B, E (C, D, E) but not A, B A B C D E 1 2 3 4 5 6 7 8 9
  29. 29. Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summarize, synthesize </li></ul></ul><ul><ul><ul><li>Review existing methods </li></ul></ul></ul><ul><ul><ul><li>Describe queries for BioSQL phylo extension </li></ul></ul></ul><ul><ul><ul><li>Making generic queries </li></ul></ul></ul><ul><li>Trees Too Big </li></ul><ul><ul><li>Visualizing and manipulating large trees </li></ul></ul><ul><ul><ul><li>Demo PhyloWidget </li></ul></ul></ul>
  30. 30. (((Sus_scrofa, Hippopotamus),Balaenoptera),Equus_caballus) vs ((Sus_scrofa, (Hippopotamus,Balaenoptera)),Equus_caballus) Mining trees for interesting, general, relationship questions: Sus scrofa Hippopotamus Balaenoptera Equus caballus Felis catus Balaenoptera Hippopotamus Sus scrofa Equus caballus Felis catus
  31. 31. Even if with perfectly-resolved OTUs, you will still fail to hit relevant trees: Sus scrofa Hippopotamus Balaenoptera Equus caballus Felis catus Sus celebensis Hippopotamus Balaenoptera Equus asinus Felis catus
  32. 32. Step 1: for each clade all trees in database, run a stem query on a classification tree (e.g. NCBI) Stem Queries: Node 2: (>A, B - C, D, E) Node 3: (>A - B, C, D, E) Node 4: (>B - A, C, D, E) Node 5: (>C, D, E - A, B) Node 6: (>C, D - A, B, E) Node 7: (>C - A, B, D, E) Node 8: (>D - A, B, C, E) Node 9: (>E - A, B, C, D) Step 2: label each node with an NCBI taxon id (if there is a match) Step 3: do the same for the query tree A B C D E 1 2 3 4 5 6 7 8 9
  33. 33. Rename nodes according to their deepest stem query… Gorilla gorilla Homo sapiens Pan troglodytes Macaca sinica Macaca nigra Hominoidea Cercopithecoidea Gorilla Homo Pan Macaca sinica Macaca nigra Pongo pygmaeus Macaca irus Hominoidea Cercopithecoidea
  34. 34. Dealing with the Growth of Phyloinformatics <ul><li>Trees Too Many </li></ul><ul><ul><li>Search, organize, triage, summarize, synthesize </li></ul></ul><ul><ul><ul><li>Review existing methods </li></ul></ul></ul><ul><ul><ul><li>Describe queries for BioSQL phylo extension </li></ul></ul></ul><ul><ul><ul><li>Making generic queries </li></ul></ul></ul><ul><li>Trees Too Big </li></ul><ul><ul><li>Visualizing and manipulating large trees </li></ul></ul><ul><ul><ul><li>Demo PhyloWidget </li></ul></ul></ul>
  35. 35. PhyloWidget <ul><li>Greg Jordan </li></ul><ul><ul><li>Google Summer of Code student </li></ul></ul><ul><ul><li>Nick Goldman's group, EBI </li></ul></ul><ul><li>Java Applet </li></ul><ul><ul><li>Uses the Processing graphics library </li></ul></ul><ul><li>Originally as a graphical phylogenetic query and display tool for TreeBASE, BioSQL, etc </li></ul><ul><li>Can be used for: </li></ul><ul><ul><li>Manipulating, visualizing large trees </li></ul></ul><ul><ul><li>Building supertrees through pruning & grafting </li></ul></ul>
  36. 36. Thanks

×