Upcoming SlideShare
×

# Approximating the subtree distance between phylogenies. C Semple

920 views

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
920
On SlideShare
0
From Embeds
0
Number of Embeds
27
Actions
Shares
0
7
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Approximating the subtree distance between phylogenies. C Semple

1. 1. Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies Charles Semple Biomathematics Research Centre Department of Mathematics and Statistics University of Canterbury New Zealand Joint work with Magnus Bordewich, Catherine McCartin e-Science Institute, Edinburgh 2007
2. 2. Subtree Prune and Regraft (SPR) Example. r r 1 SPR bc d a a bc d T1 S r 1 SPR a d b c T2
3. 3. Applications of SPR Used I. As a search tool for selecting the best tree in reconstruction algorithms. II. To quantify the dissimilarity between two phylogenetic trees. III. To provide a lower bound on the number of reticulation events in the case of non-tree-like evolution. For II and III, one wants the minimum number of SPR operations to transform one phylogeny into another. This number is the SPR distance between two phylogenies S and T.
4. 4. The Mathematical Problem MINIMUM SPR Instance: Two rooted binary phylogenetic trees S and T. Goal: Find a minimum length sequence of single SPR operations that transforms S into T. Measure: The length of the sequence. Notation: Use dSPR(S, T) to denote this minimum length. Theorem (Bordewich, S 2004) MINIMUM SPR is NP-hard. Overriding goal is to find (with no restrictions) the exact solution or a heuristic solution with a guarantee of closeness.
5. 5. Algorithms for NP-Hard Problems Fixed-parameter algorithms are a practical way to find optimal solutions if the parameter measuring the hardness of the problem is small. Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution.
6. 6. Agreement Forests A forest of T is a disjoint Example. r collection of phylogenetic subtrees whose union of leaf sets is X U r. a cd e f b S r a cd e f b F1
7. 7. Agreement Forests A forest of T is a disjoint Example. r collection of phylogenetic subtrees whose union of leaf sets is X U r. a cd e f b S r a cd e f b F1
8. 8. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. a cd e f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
9. 9. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. a cd e f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
10. 10. Agreement Forests An agreement forest for S and T is a forest of both S and T. r r Example. e a cd f d fa b c b e S T r r a cd e f a cd e f b b F1 F2
11. 11. Theorem. (Bordewich, S, 2004) Let S and T be two binary phylogenetic trees. Then dSPR(S,T) = size of maximum-agreement forest - 1. o It’s fast to construct from a maximum-agreement forest for S and T a sequence of SPR operations that transforms S into T.
12. 12. Reducing the Size of the Instance Subtree Reduction Chain Reduction
13. 13. Fixed-Parameter Algorithms The underlying idea is to find an algorithm whose running time separates the size of the problem instance from the parameter of interest. One way to obtain such an algorithm is to reduce the size of the problem instance, while preserving the optimal value (kernalizing the problem). Are the subtree and chain reductions enough to kernalize the problem?
14. 14. Fixed-Parameter Algorithms Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’ < 28dSPR(S,T). Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parameter tractable. 1. Repeatedly apply the subtree and chain rules. 2. Exhaustively find a maximum-agreement forest by deleting edges in S and comparing with T. Running time is O((56k)k + p(n)) compared with O((2n)k), where k=dSPR(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.
15. 15. Fixed-Parameter Algorithms A second way to obtain a fixed-parameter algorithm is using the method of bounded search trees. Preliminaries o A rooted triple is a binary tree with three leaves. a bc ab|c o A tree S displays ab|c if the last common ancestor of ab is a descendant of the last common ancestor of ac.
16. 16. r Fixed-Parameter Algorithms r Observation. If F is an agreement d fa b c e forest for S and T, then F can T be obtained from S by deleting |F|-1 edges. a cd e f b Recall. If F is maximum, then S |F|-1=dSPR(S,T).
17. 17. r Fixed-Parameter Algorithms r Observation. If F is an agreement d fa b c e forest for S and T, then F can T be obtained from S by deleting |F|-1 edges. a cd e f b Recall. If F is maximum, then S |F|-1=dSPR(S,T).
18. 18. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
19. 19. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
20. 20. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
21. 21. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
22. 22. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components.
23. 23. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
24. 24. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
25. 25. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
26. 26. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e er not in T. T 2. Branch into 4 computational ee paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
27. 27. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e er not in T. T 2. Branch into 4 computational ee paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
28. 28. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
29. 29. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T er 2. Branch into 4 computational ed paths ex, ey, ez, er, and repeat ea eb for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
30. 30. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
31. 31. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
32. 32. r Fixed-Parameter Algorithms vst r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
33. 33. r Fixed-Parameter Algorithms vst r es 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat et for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb er
34. 34. r Fixed-Parameter Algorithms vst r es 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat et for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb es er et
35. 35. r Fixed-Parameter Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Branch into 4 computational paths ex, ey, ez, er, and repeat for each path until no such triple. a cd e f b 3. For each path, find a pair of S overlapping components ts and tt and branch into 2 paths es, et. Repeat until no such components. ed ea eb es er et
36. 36. r Fixed-Parameter Algorithms r Key Lemma. dSPR(S,T) <= k if and only d fa b c e if one of the computational T paths of length at most k results in an agreement forest for S and T. a cd e f b Theorem. (Bordewich, McCartin, S S 2007) The bounded search tree algorithm has running time O(4kn4). ed Combining the methods of kernalization and bounded ea search gives an FPT algorithm eb with running time O(4kk4+p(n)). es er et
37. 37. Approximation Algorithms Polynomial-time approximation algorithms can efficiently find feasible solutions that are sometimes arbitrarily close to the optimal solution. An r-approximation algorithm A for MINIMUM SPR means that the size of an agreement forest (minus 1) for S and T returned by A is at most r times dSPR(S,T). Example. If r=3, then the size of any agreement forest (minus 1) for S and T returned by A is at most 3 times dSPR(S,T). Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta 2006 establish a 5-approximation algorithm (O(n)) for MINIMUM SPR.
38. 38. r Approximation Algorithms r 1. Find a minimal triple xy|z in S d fa b c e not in T. T 2. Delete each of ex, ey, ez, er, and repeat until no such triple. 3. Find a pair of overlapping components ts and tt and delete a cd e f b es, et. Repeat until no such S components. Theorem. (Bordewich, McCartin, S ed 2007) This gives a 4-approximation alg. for MINIMUM SPR (O(n5)). ea eb Deleting only ex, ez, er gives a 3- es approximation algorithm. er et
39. 39. No r, unless P=NP Exact (B, S 07) (B, M,S 07) algorithms 1 + 1/2112 3 r=1 MINIMUM SPR 5 Maximum 1. Travelling (B S,M, A 06)salesman independent set in planar problem graphs (PTAS) 2. Maximum clique
40. 40. Modelling Hybridization Events with SPR Operations Reticulation processes cause … molecular phylogeneticists will species to be a composite of have failed to find the `true DNA regions derived from tree’, not because their different ancestors. methods are inadequate or because they have chosen the wrong genes, but Processes include because the history of life o horizontal gene transfer, cannot be properly o hybridization, and represented as a tree. o recombination. Ford Doolittle, 1999 (Dalhousie University)
41. 41. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
42. 42. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
43. 43. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r a bc d H
44. 44. Modelling Hybridization Events with SPR Operations A single SPR operation models a single hybridization event (Maddison 1997). r r Example. bc d a a bc d T S r r deleting hybrid edges a d b a bc d c H F
45. 45. A Fundamental Problem for Biologists Given an initial set of data that correctly repesents the tree-like evolution of different parts of various species genomes, what is the smallest number of reticulation events required that simultaneously explains the evolution of the species? This smallest number o Provides a lower bound on the number of such events. o Indicates the extent that hybridization has had on the evolutionary history of the species under consideration. Since 1930’s botantists have asked the question: How significant has the effect of hybridization been on the New Zealand flora?
46. 46. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
47. 47. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
48. 48. Trees and Hybridization Networks H explains T if T can be obtained from a rooted subtree of H by suppressing degree-2 vertices. Example. b d a bc d a c T S c d c d a b a b H1 H2
49. 49. The Mathematical Problem MINIMUM HYBRIDIZATION Instance: Two rooted binary phylogenetic trees S and T. Goal: Find a hybridization network H that explains S and T, and minimizes the number of hybridization vertices. Measure: The number of hybridization vertices in H. Notation: Use h(S, T) to denote this minimum number.
50. 50. Example: Arbitrary SPR operations not sufficient. r a cd e f b F1
51. 51. o A sequence of SPR operations that avoids creating directed cycles to make a hybridization network that explains S and T. o If one minimizes the length of an (acyclic) sequence, does the resulting network minimize the number of hybridization events to explain S and T? o YES, and such a sequence corresponds to an acyclic-agreement forest.
52. 52. Theorem. (Baroni, Grünewald, Moulton, S, 2005) Let S and T be two binary phylogenetic trees. Then h(S,T) = size of maximum-acyclic agreement forest - 1. o It’s fast to construct from a maximum-acyclic agreement for S and T a hybridization network that realizes h(S,T). Theorem. (Bordewich, S, 2007) MINIMUM HYBRIDIZATION is NP-hard.
53. 53. Reducing the Size of the Instance Subtree Reduction Chain Reduction
54. 54. Fixed-Parameter Algorithms Are the subtree and chain reductions enough to kernalize the problem? Lemma. If n’ denotes the size of the leaf sets of the fully reduced trees using the subtree and chain reductions, then n’<14h(S,T). Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION is fixed-parameter tractable. Running time is O((28k)k + p(n)) compared with O((2n)k), where k=h(S,T) and p(n) is the polynomial bound for reducing the trees using the subtree and chain reductions.
55. 55. Reducing the Size of the Instance Cluster Reduction (Baroni 2004) +
56. 56. A Grass (Poaceae) Dataset (Grass Phylogeny Working Group, Düsseldorf) o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneous plant hybrids) o For each sequence, used fastDNAml to reconstruct a phylogenetic tree (H. Schmidt).
57. 57. Chloroplast (phyB) sequences Nuclear (ITS) sequences
58. 58. Chloroplast (phyB) sequences Nuclear (ITS) sequences
59. 59. Chloroplast (phyB) sequences Nuclear (ITS) sequences
60. 60. Chloroplast (phyB) sequences Nuclear (ITS) sequences
61. 61. Chloroplast (phyB) sequences Nuclear (ITS) sequences
62. 62. Chloroplast (phyB) sequences Nuclear (ITS) sequences
63. 63. h=3 h=1 h=4 h=0 Chloroplast (phyB) sequences Nuclear (ITS) sequences
64. 64. pairwise combination # overlapping h(S,T) running time taxa 2000 MHz CPU, 2GB RAM ndhF phyB 40 14 11h ndhF rbcL 36 13 11.8h ndhF rpoC2 34 12 26.3h ndhF waxy 19 9 320s ndhF ITS 46 at least 15 phyB rbcL 21 4 1s phyB rpoC2 21 7 180s phyB waxy 14 3 1s phyB ITS 30 8 19s rbcL rpoC2 26 13 29.5h rbcL waxy 12 7 230s rbcL ITS 29 at least 9 rpoC2 waxy 10 1 1s rpoC2 ITS 31 at least 10 waxy ITS 15 8 620s Bordewich, Linz, St John, S, 2007
65. 65. Computing dSPR(S,T) and h(S,T) dSPR(S,T) h(S,T) 1. FPT using kernalization and 1. FPT using kernalization only bounded search tree (O((28k)k +p(n))). Unknown if methods (O(4kk4+p(n))). a bounded search tree method exists. 2. Cluster-based reduction. 2. No cluster-based reduction. 3. Unknown if there even is an 3. 3-approximation algorithm. approximation algorithm.
66. 66. Reducing the Size of the Instance Subtree Reduction Chain Reduction x3 an x2 an x1 a2 a2 x3 T’ a1 a1 x2 x1 S T S’