Constraints and Global Optimization for Gene Prediction Overlap Resolution

546 views

Published on

We apply constraints and global optimization to the problem of restricting overlapping of gene predictions for prokaryotic genomes. We investigate existing heuristic methods and show how they may be expressed using Constraint Handling Rules. Furthermore, we integrate existing methods in a global optimization procedure expressed as probabilistic model in the PRISM language. This approach yields an optimal (highest scoring) subset of predictions that satisfy the constraints. Experimental results indicate accuracy comparable to the heuristic approaches.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
546
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Constraints and Global Optimization for Gene Prediction Overlap Resolution

  1. 1. Constraints and Global Optimization for Gene Prediction Overlap Resolution Christian Theil Have Research group PLIS: Programming, Logic and Intelligent Systems Department of Communication, Business and Information Technologies Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark WCB 2011 in Perugia, September 12, 2011Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  2. 2. Outline1 Motivation and background Gene finding2 Overlap resolution Local heuristics Global optimization3 Implementation4 Evaluation5 Summary and future directions Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  3. 3. Motivation and backgroundMotivation In the LoSt project we explore the use of Probabilistic Logic programming (with PRISM) for biological sequence analysis. In particular, we have been concerned with applying these techniques to gene finding for prokaryotes. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  4. 4. Motivation and background Gene findingGene findingA traditional view, Considered as a sequence classification task. Performed without much context. Ignores constraints between predicted genes and their placement in the genome. Problem with overlapping genes. Over-prediction of overlapping genes and shadow genes. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  5. 5. Motivation and background Gene findingTwo step approachA divide and conquer two step approach to gene finding, 1 Gene prediction: A gene finder supplies a set of candidate predictions p1 . . . pn , called the initial set. 2 Pruning: The initial set is pruned according to certain rules or constraints. We call the pruned set the final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  6. 6. Overlap resolutionPruning step as a Constraint Satisfaction ProblemDefinitionA Constraint Satisfaction Problem is a triplet X , D, C . X is a set of nvariables, X = x1 , . . . , xn , with domains D = D(x1 ), . . . , D(xn ). Theconstraints C impose restrictions on possible assignments for sets ofvariables. A solution is an assignment of a value v ∈ D(xi ) to eachvariable xi ∈ X , consistent with C .We introduce variables X = xi . . . xn corresponding to each predictionp1 . . . pn in the initial set. All variables have boolean domains,∀xi ∈ X , D(xi ) = {true, false} and xi = true ⇔ pi ∈ final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  7. 7. Overlap resolution Local heuristicsLocal heuristic techniques Local heuristic pruning rules to post-process a set of gene predictions. Deletes inconsistent predictions based on various criteria. Pruning decisions based on the context of only a subset of the predictions.CSP formulationPruning xi just means D(xi ) = D(xi )true ⇔ D(xi ) = false. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  8. 8. Overlap resolution Local heuristicsLocal heuristic techniques as Constraint Handling RulesPruning is conveniently expressed as CHR simplification rules. 1 Constraint store starts out as the initial set. 2 Simplification rules prune predictions from the constraint store, until no more rules apply. 3 Final constraint store represents the final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  9. 9. Overlap resolution Local heuristicsGenemark frame-by-frame M.Shmatkov, Anton and A.Melikyan, Arik and L.Chernousko, Felix and Borodovsky, Mark Finding prokaryotic genes by the frame-by-frame algorithm: targeting gene starts and overlapping genes Bioinfomatics, 1999Frame-by-frameprediction(Left1,Right1), prediction(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | prediction(Left1,Right1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  10. 10. Overlap resolution Local heuristicsECOPARSE (1) Krogh, Anders and Mian, I. Saira and Haussler, David A hidden Markov model that finds genes in E.coli DNA. Nucl. Acids Res, 22, 1994.Rule 1pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=> overlap_len((Left1,Right1),(Left2,Right2),OverlapLen), length_ratio((Left1,Right1),(Left2,Right2),Ratio), length(Left1,Right1,Length1), length(Left2,Right2,Length2), OverlapLen > 15, Score1 > Score2 ((Length1 > 400, Length2 > 400) ; Ratio > 0.5), | pred(Left1,Right1,Score1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  11. 11. Overlap resolution Local heuristicsECOPARSE (2)Rule 2pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=> overlap_len((Left1,Right1),(Left2,Right2),OverlapLen), length_ratio((Left1,Right1),(Left2,Right2),Ratio), length(Left1,Right1,Length1), length(Left2,Right2,Length2), OverlapLen > 15, Ratio =< 0.5, Length1 =< Length2 | pred(Left1,Right1,Score1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  12. 12. Overlap resolution Local heuristicsECOPARSE (3) - Ambiguity!We have two predictions P1 and P2 that overlap each end of a thirdprediction P3 by more than 15 bases. If P1 and P3 is are consideredbefore P2 and P3 then P3 will be removed by the first rule. ConsequentlyP2 does not overlap and is kept. If they are considered in opposite order,however, then P2 will be removed by the second rule and subsequently P3is removed by the first rule. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  13. 13. Overlap resolution Local heuristicsLocal heuristics and confluence the final set will be same ⇔ program is confluent, i.e. it is not sensitive to the order of execution. With the simple Genemark rule it does not matter in which order the predictions are processed. With slightly more complicated heuristics (such as ECOPARSE), programs tend not to be confluent. Conclusion: Execution strategy is important! Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  14. 14. Overlap resolution Global optimizationPruning step as a Constraint Optimization Problem (1) CSP may have multiple solutions! We want the “best” solution, meaning many correct predictions, few incorrect. Set of real genes are not known in advance! Instead, optimize the probability that the predictions are correct (confidence scores from gene finder). Extends the problem as a constraint optimization problem.DefinitionA Constraint Optimization Problem (COP) is a CSP where each solution isassociated with a cost and the goal is to find a solution with minimal cost. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  15. 15. Overlap resolution Global optimizationCOP formulationWe would like the final set to reflect the relative confidence scores in thepredictions assigned by the gene finder and at the same time be consistentwith the overlap constraints.COP formulationLet the scores of p1 . . . pn be s1 . . . sn and si ∈ R+ .Maximize n si , subject to C. i=1 Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  16. 16. Overlap resolution Global optimizationVariable orderingAssume an ordering on the variables, Initial set predictions p1 . . . pn are sorted by the position of their left-most base, such that ∀pi , pj , i < j ⇒ left-most(pi ) ≤ left-most(pj ). The variables x1 . . . xn of the CSP/COP are given the same ordering. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  17. 17. Overlap resolution Global optimizationCOP implementation with Markov Chain (1)We propose to use a (constrained) Markov chain for the COP. The Markov chain has a begin state, an end state and two states for each variable xi corresponding to its boolean domain D(xi ). The state corresponding to D(xi ) = true is denoted αi and the state corresponding to D(xi ) = false is denoted βi . In this model, a path from the begin state to the end state corresponds to a potential solution of the CSP. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  18. 18. Overlap resolution Global optimizationFrom confidence scores to transition probabilities P(α1 |begin) = σ1 P(β1 |begin) = 1 − σ1 P(end|αn ) = P(end|βn ) = 1. P(αi |αi−1 ) = P(αi |βi−1 ) = σi P(βi |αi−1 ) = P(βi |βi−1 ) = 1 − σi (0.5 − λ) × (si − min(s1 . . . sn )) σi = 0.5 + λ + max(s1 . . . sn ) − min(s1 . . . sn ) Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  19. 19. Overlap resolution Global optimizationConstraining the Markov chain Constraints are defined on states that are not allowed to occur together in a path. CHR rules similar to those of the local heuristics. Instead of removing predictions they define conditions for inconsistency (inconsistency rules).Inconsistency rulesInconsistency rules match predictions corresponding to α and β states inthe head of the rule. The guard of the rule ensures that the additionalcriteria for rule application are met and the implication of the rules isalways failure.Such rules are necessarily confluent! Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  20. 20. Overlap resolution Global optimizationExample: Glimmer 3 Delcher, Arthur L. and Bratke, Kirsten A. and Powers, Edwin C. and Salzberg, Steven L. Identifying bacterial genes and endosymbiont DNA with Glimmer Bioinformatics, 23, 2007.Glimmers inconsistency rulealpha(Left1,Right1), alpha(Left2,Right2) <=> overlap_length((Left1,Right1),(Left2,Right2),OverlapLen), OverlapLen > 110 | fail. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  21. 21. Overlap resolution Global optimizationInconsistency rules, Genemark and ECOGENEGenemark inconsistency rulesalpha(Left1,Right1), alpha(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | fail.beta(Left1,Right1), alpha(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | fail.ECOGENE inconsistency rulesalpha(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.beta(Left1,Right1), alpha(Left2,Right2) <=> ... | fail. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  22. 22. ImplementationImplementation in PRISMImplemented as a constrained markov Chain in PRISM.PRISM Markov chainvalues(choose(_), [alpha,beta]).markov_chain(P,_) :- last_prediction_id(LPID), P > LPID.markov_chain(P,Store1) :- last_prediction_id(N), P =< N, msw(choose(P), AlphaBeta), X =.. [ AlphaBeta, P ], check_constraints(X,Store1,Store2), NextPrediction is P + 1, markov_chain(NextPrediction,Store2). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  23. 23. ImplementationConstraint checking and relevant recent states Maintain a list of recent states (mi ) sorted by the right-most position of the corresponding predictions. Constraints are only checked for predictions corresponding to elements of mi .In step i, we construct mi as the maximal prefix of xi + mi−1 , such thatxj ∈ mi ⇐⇒ right-most(pj ) ≥ left-most(pi ). If the constraints propagatefail, then the PRISM derivation fails and the (partial) path it representsis pruned from the solution space.The most probable consistent path is found using PRISMs genericadaptation of the Viterbi algorithm for PRISM programs. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  24. 24. EvaluationAre the constraints realistic assumptions?DefinitionWe consider constraints safe when the constraints prune only falsepositives.Neither of the examined constraints are safe with respect the ”known”genes in E.coli, (NC 000913). Genemark constraint removes three completely overlapped known genes. Four overlaps longer than 110 bases potentially removed Glimmer 3 constraint. 93 overlaps longer than 15 bases potentially removed by ECOGENE constraints. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  25. 25. EvaluationExperimental results Prediction on E.coli. using simplistic codon frequency based gene finder. Pruning using our global optimization approach (with all inconsistency rules) versus local heuristic rules1 . Method #predictions Sensitivity Specificity Time (seconds) initial set 10799 0.7625 0.2926 na Genemark rules 5823 0.7558 0.5379 1.4 ECOGENE rules 4981 0.7148 0.5947 1.7global optimization 5222 0.7201 0.5714 75 Sensitivity = fraction of known reference genes predicted. Specificity = fraction of predicted genes that are correct. 1 Note that the results for the ECOGENE heuristic may vary depending on executionstrategy - in case of above results, predictions with lower left position are consideredfirst. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  26. 26. Summary and future directionsSummaryWe presented A unified view overlap resolution using Constraint Handling Rules. a novel way to post-process gene prediction results based on constrained global optimization that provides an optimality guarantee wrt. to gene finder confidence scores and it has similar accuracy to the heuristic methods. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  27. 27. Summary and future directionsFuture directions Better modeling transition probabilities using assumed gene-finder false positive rates. Soft constraints - since constraints are unsafe. Use technique as a combiner: Include more gene finder predictions in initial set. Include and experiment with other constraints. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  28. 28. Summary and future directionsQuestions Questions? Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution

×