SlideShare a Scribd company logo
1 of 28
Download to read offline
Constraints and Global Optimization
            for Gene Prediction Overlap Resolution

                            Christian Theil Have

         Research group PLIS: Programming, Logic and Intelligent Systems
       Department of Communication, Business and Information Technologies
          Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark


                 WCB 2011 in Perugia, September 12, 2011




Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Outline


1   Motivation and background
     Gene finding

2   Overlap resolution
      Local heuristics
      Global optimization

3   Implementation

4   Evaluation

5   Summary and future directions



     Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Motivation and background


Motivation




   In the LoSt project we explore the use of Probabilistic Logic
   programming (with PRISM) for biological sequence analysis.
   In particular, we have been concerned with applying these techniques
   to gene finding for prokaryotes.




   Christian Theil Have        Constraints and Global Optimization for Gene Prediction Overlap Resolution
Motivation and background    Gene finding


Gene finding



A traditional view,
    Considered as a sequence classification task.
    Performed without much context.
    Ignores constraints between predicted genes and their placement in
    the genome.
    Problem with overlapping genes.
    Over-prediction of overlapping genes and shadow genes.




     Christian Theil Have        Constraints and Global Optimization for Gene Prediction Overlap Resolution
Motivation and background    Gene finding


Two step approach




A divide and conquer two step approach to gene finding,
  1   Gene prediction: A gene finder supplies a set of candidate predictions
      p1 . . . pn , called the initial set.
  2   Pruning: The initial set is pruned according to certain rules or
      constraints. We call the pruned set the final set.




      Christian Theil Have        Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution


Pruning step as a Constraint Satisfaction Problem



Definition
A Constraint Satisfaction Problem is a triplet X , D, C . X is a set of n
variables, X = x1 , . . . , xn , with domains D = D(x1 ), . . . , D(xn ). The
constraints C impose restrictions on possible assignments for sets of
variables. A solution is an assignment of a value v ∈ D(xi ) to each
variable xi ∈ X , consistent with C .
We introduce variables X = xi . . . xn corresponding to each prediction
p1 . . . pn in the initial set. All variables have boolean domains,
∀xi ∈ X , D(xi ) = {true, false} and xi = true ⇔ pi ∈ final set.




     Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


Local heuristic techniques



    Local heuristic pruning rules to post-process a set of gene predictions.
    Deletes inconsistent predictions based on various criteria.
    Pruning decisions based on the context of only a subset of the
    predictions.

CSP formulation
Pruning xi just means D(xi ) = D(xi )true ⇔ D(xi ) = false.




    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


Local heuristic techniques as Constraint Handling Rules




Pruning is conveniently expressed as CHR simplification rules.
  1   Constraint store starts out as the initial set.
  2   Simplification rules prune predictions from the constraint store, until
      no more rules apply.
  3   Final constraint store represents the final set.




      Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


Genemark frame-by-frame


   M.Shmatkov, Anton and A.Melikyan, Arik and L.Chernousko, Felix
   and Borodovsky, Mark
   Finding prokaryotic genes by the frame-by-frame algorithm: targeting
   gene starts and overlapping genes
   Bioinfomatics, 1999
Frame-by-frame
prediction(Left1,Right1), prediction(Left2,Right2) <=>
         Left1 =< Left2,
         Right1 >= Right2
         |
         prediction(Left1,Right1).



   Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


ECOPARSE (1)

   Krogh, Anders and Mian, I. Saira and Haussler, David
   A hidden Markov model that finds genes in E.coli DNA.
   Nucl. Acids Res, 22, 1994.
Rule 1
pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>
     overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),
     length_ratio((Left1,Right1),(Left2,Right2),Ratio),
     length(Left1,Right1,Length1),
     length(Left2,Right2,Length2),
     OverlapLen > 15, Score1 > Score2
     ((Length1 > 400, Length2 > 400) ; Ratio > 0.5),
     |
     pred(Left1,Right1,Score1).

   Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


ECOPARSE (2)



Rule 2
pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=>
     overlap_len((Left1,Right1),(Left2,Right2),OverlapLen),
     length_ratio((Left1,Right1),(Left2,Right2),Ratio),
     length(Left1,Right1,Length1),
     length(Left2,Right2,Length2),
     OverlapLen > 15, Ratio =< 0.5, Length1 =< Length2
     |
     pred(Left1,Right1,Score1).




   Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


ECOPARSE (3) - Ambiguity!




We have two predictions P1 and P2 that overlap each end of a third
prediction P3 by more than 15 bases. If P1 and P3 is are considered
before P2 and P3 then P3 will be removed by the first rule. Consequently
P2 does not overlap and is kept. If they are considered in opposite order,
however, then P2 will be removed by the second rule and subsequently P3
is removed by the first rule.



    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Local heuristics


Local heuristics and confluence



   the final set will be same ⇔ program is confluent, i.e. it is not
   sensitive to the order of execution.
   With the simple Genemark rule it does not matter in which order the
   predictions are processed.
   With slightly more complicated heuristics (such as ECOPARSE),
   programs tend not to be confluent.
   Conclusion: Execution strategy is important!




    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


Pruning step as a Constraint Optimization Problem (1)


    CSP may have multiple solutions!
    We want the “best” solution, meaning many correct predictions, few
    incorrect.
    Set of real genes are not known in advance!
    Instead, optimize the probability that the predictions are correct
    (confidence scores from gene finder).
    Extends the problem as a constraint optimization problem.

Definition
A Constraint Optimization Problem (COP) is a CSP where each solution is
associated with a cost and the goal is to find a solution with minimal cost.



    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


COP formulation



We would like the final set to reflect the relative confidence scores in the
predictions assigned by the gene finder and at the same time be consistent
with the overlap constraints.
COP formulation
Let the scores of p1 . . . pn be s1 . . . sn and si ∈ R+ .
Maximize n si , subject to C.
             i=1




     Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


Variable ordering




Assume an ordering on the variables,
    Initial set predictions p1 . . . pn are sorted by the position of their
    left-most base, such that
    ∀pi , pj , i < j ⇒ left-most(pi ) ≤ left-most(pj ).
    The variables x1 . . . xn of the CSP/COP are given the same ordering.




    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


COP implementation with Markov Chain (1)
We propose to use a (constrained) Markov chain for the COP.
    The Markov chain has a begin state, an end state and two states for
    each variable xi corresponding to its boolean domain D(xi ).
    The state corresponding to D(xi ) = true is denoted αi and the state
    corresponding to D(xi ) = false is denoted βi .
    In this model, a path from the begin state to the end state
    corresponds to a potential solution of the CSP.




    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


From confidence scores to transition probabilities
    P(α1 |begin) = σ1
    P(β1 |begin) = 1 − σ1
    P(end|αn ) = P(end|βn ) = 1.
    P(αi |αi−1 ) = P(αi |βi−1 ) = σi
    P(βi |αi−1 ) = P(βi |βi−1 ) = 1 − σi

                                    (0.5 − λ) × (si − min(s1 . . . sn ))
                 σi = 0.5 + λ +
                                     max(s1 . . . sn ) − min(s1 . . . sn )




    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


Constraining the Markov chain

    Constraints are defined on states that are not allowed to occur
    together in a path.
    CHR rules similar to those of the local heuristics.
    Instead of removing predictions they define conditions for
    inconsistency (inconsistency rules).

Inconsistency rules
Inconsistency rules match predictions corresponding to α and β states in
the head of the rule. The guard of the rule ensures that the additional
criteria for rule application are met and the implication of the rules is
always failure.
Such rules are necessarily confluent!

    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


Example: Glimmer 3


   Delcher, Arthur L. and Bratke, Kirsten A. and Powers, Edwin C. and
   Salzberg, Steven L.
   Identifying bacterial genes and endosymbiont DNA with Glimmer
   Bioinformatics, 23, 2007.
Glimmers inconsistency rule
alpha(Left1,Right1), alpha(Left2,Right2) <=>
    overlap_length((Left1,Right1),(Left2,Right2),OverlapLen),
    OverlapLen > 110
    |
    fail.




   Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Overlap resolution   Global optimization


Inconsistency rules, Genemark and ECOGENE



Genemark inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=>
    Left1 =< Left2, Right1 >= Right2 | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=>
    Left1 =< Left2, Right1 >= Right2 | fail.

ECOGENE inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=> ... | fail.




   Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Implementation


Implementation in PRISM

Implemented as a constrained markov Chain in PRISM.
PRISM Markov chain
values(choose(_), [alpha,beta]).

markov_chain(P,_) :-
    last_prediction_id(LPID), P > LPID.

markov_chain(P,Store1) :-
    last_prediction_id(N), P =< N,
    msw(choose(P), AlphaBeta),
    X =.. [ AlphaBeta, P ],
    check_constraints(X,Store1,Store2),
    NextPrediction is P + 1,
    markov_chain(NextPrediction,Store2).

    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Implementation


Constraint checking and relevant recent states


    Maintain a list of recent states (mi ) sorted by the right-most position
    of the corresponding predictions.
    Constraints are only checked for predictions corresponding to
    elements of mi .
In step i, we construct mi as the maximal prefix of xi + mi−1 , such that
xj ∈ mi ⇐⇒ right-most(pj ) ≥ left-most(pi ). If the constraints propagate
fail, then the PRISM derivation fails and the (partial) path it represents
is pruned from the solution space.
The most probable consistent path is found using PRISMs generic
adaptation of the Viterbi algorithm for PRISM programs.



    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Evaluation


Are the constraints realistic assumptions?


Definition
We consider constraints safe when the constraints prune only false
positives.
Neither of the examined constraints are safe with respect the ”known”
genes in E.coli, (NC 000913).
    Genemark constraint removes three completely overlapped known
    genes.
    Four overlaps longer than 110 bases potentially removed Glimmer 3
    constraint.
    93 overlaps longer than 15 bases potentially removed by ECOGENE
    constraints.


    Christian Theil Have   Constraints and Global Optimization for Gene Prediction Overlap Resolution
Evaluation


Experimental results
     Prediction on E.coli. using simplistic codon frequency based gene
     finder.
     Pruning using our global optimization approach (with all
     inconsistency rules) versus local heuristic rules1 .

      Method                #predictions            Sensitivity     Specificity          Time (seconds)
     initial set              10799                  0.7625          0.2926                   na
  Genemark rules               5823                  0.7558          0.5379                  1.4
 ECOGENE rules                 4981                  0.7148          0.5947                  1.7
global optimization            5222                  0.7201          0.5714                   75


     Sensitivity = fraction of known reference genes predicted.
     Specificity = fraction of predicted genes that are correct.
    1
      Note that the results for the ECOGENE heuristic may vary depending on execution
strategy - in case of above results, predictions with lower left position are considered
first.
     Christian Theil Have    Constraints and Global Optimization for Gene Prediction Overlap Resolution
Summary and future directions


Summary



We presented
    A unified view overlap resolution using Constraint Handling Rules.
    a novel way to post-process gene prediction results based on
    constrained global optimization
    that provides an optimality guarantee wrt. to gene finder confidence
    scores
    and it has similar accuracy to the heuristic methods.




    Christian Theil Have           Constraints and Global Optimization for Gene Prediction Overlap Resolution
Summary and future directions


Future directions




    Better modeling transition probabilities using assumed gene-finder
    false positive rates.
    Soft constraints - since constraints are unsafe.
    Use technique as a combiner: Include more gene finder predictions in
    initial set.
    Include and experiment with other constraints.




    Christian Theil Have           Constraints and Global Optimization for Gene Prediction Overlap Resolution
Summary and future directions


Questions




                                                Questions?




   Christian Theil Have           Constraints and Global Optimization for Gene Prediction Overlap Resolution

More Related Content

Similar to Constraints and Global Optimization for Gene Prediction Overlap Resolution

Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine ModelRisk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine ModelJessica Minnier
 
Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...Ana Luísa Pinho
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1VitAnhNguyn94
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeBigMine
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimizationMahesh Tibrewal
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural NetworksNatan Katz
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeSean Davis
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPatricia Francis-Lyon
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminargvesom
 
Lecture 2 fuzzy inference system
Lecture 2  fuzzy inference systemLecture 2  fuzzy inference system
Lecture 2 fuzzy inference systemParveenMalik18
 
PSA pattern to predict CRPC
PSA pattern to predict CRPCPSA pattern to predict CRPC
PSA pattern to predict CRPCYejin Kim
 
Machine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalMachine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalAnggi Andriyadi
 
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsAndres Lopez-Sepulcre
 

Similar to Constraints and Global Optimization for Gene Prediction Overlap Resolution (20)

Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
GA.pptx
GA.pptxGA.pptx
GA.pptx
 
Genetic Algorithm
Genetic AlgorithmGenetic Algorithm
Genetic Algorithm
 
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine ModelRisk Classification with an Adaptive Naive Bayes Kernel Machine Model
Risk Classification with an Adaptive Naive Bayes Kernel Machine Model
 
Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...Functional specialization in human cognition: a large-scale neuroimaging init...
Functional specialization in human cognition: a large-scale neuroimaging init...
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 
Ga
GaGa
Ga
 
Exact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping YeExact Data Reduction for Big Data by Jieping Ye
Exact Data Reduction for Big Data by Jieping Ye
 
Particle swarm optimization
Particle swarm optimizationParticle swarm optimization
Particle swarm optimization
 
Bayesian Neural Networks
Bayesian Neural NetworksBayesian Neural Networks
Bayesian Neural Networks
 
RNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the TranscriptomeRNA-seq: A High-resolution View of the Transcriptome
RNA-seq: A High-resolution View of the Transcriptome
 
Predicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learningPredicting phenotype from genotype with machine learning
Predicting phenotype from genotype with machine learning
 
Cec09
Cec09Cec09
Cec09
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Objective Bayesian Ana...
 
Thesis seminar
Thesis seminarThesis seminar
Thesis seminar
 
Lecture 2 fuzzy inference system
Lecture 2  fuzzy inference systemLecture 2  fuzzy inference system
Lecture 2 fuzzy inference system
 
PSA pattern to predict CRPC
PSA pattern to predict CRPCPSA pattern to predict CRPC
PSA pattern to predict CRPC
 
Machine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm FundamentalMachine Learning - Genetic Algorithm Fundamental
Machine Learning - Genetic Algorithm Fundamental
 
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian StatisticsFoundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
Foundations of Statistics in Ecology and Evolution. 8. Bayesian Statistics
 
Healthcare
HealthcareHealthcare
Healthcare
 

More from Christian Have

Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisChristian Have
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisChristian Have
 
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationEfficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationChristian Have
 
Nagios præsentation (på dansk)
Nagios præsentation (på dansk)Nagios præsentation (på dansk)
Nagios præsentation (på dansk)Christian Have
 
Stochastic Definite Clause Grammars
Stochastic Definite Clause GrammarsStochastic Definite Clause Grammars
Stochastic Definite Clause GrammarsChristian Have
 
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...Christian Have
 
Inference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISMInference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISMChristian Have
 

More from Christian Have (7)

Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationEfficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program Transformation
 
Nagios præsentation (på dansk)
Nagios præsentation (på dansk)Nagios præsentation (på dansk)
Nagios præsentation (på dansk)
 
Stochastic Definite Clause Grammars
Stochastic Definite Clause GrammarsStochastic Definite Clause Grammars
Stochastic Definite Clause Grammars
 
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
 
Inference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISMInference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISM
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Constraints and Global Optimization for Gene Prediction Overlap Resolution

  • 1. Constraints and Global Optimization for Gene Prediction Overlap Resolution Christian Theil Have Research group PLIS: Programming, Logic and Intelligent Systems Department of Communication, Business and Information Technologies Roskilde University, P.O.Box 260, DK-4000 Roskilde, Denmark WCB 2011 in Perugia, September 12, 2011 Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 2. Outline 1 Motivation and background Gene finding 2 Overlap resolution Local heuristics Global optimization 3 Implementation 4 Evaluation 5 Summary and future directions Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 3. Motivation and background Motivation In the LoSt project we explore the use of Probabilistic Logic programming (with PRISM) for biological sequence analysis. In particular, we have been concerned with applying these techniques to gene finding for prokaryotes. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 4. Motivation and background Gene finding Gene finding A traditional view, Considered as a sequence classification task. Performed without much context. Ignores constraints between predicted genes and their placement in the genome. Problem with overlapping genes. Over-prediction of overlapping genes and shadow genes. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 5. Motivation and background Gene finding Two step approach A divide and conquer two step approach to gene finding, 1 Gene prediction: A gene finder supplies a set of candidate predictions p1 . . . pn , called the initial set. 2 Pruning: The initial set is pruned according to certain rules or constraints. We call the pruned set the final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 6. Overlap resolution Pruning step as a Constraint Satisfaction Problem Definition A Constraint Satisfaction Problem is a triplet X , D, C . X is a set of n variables, X = x1 , . . . , xn , with domains D = D(x1 ), . . . , D(xn ). The constraints C impose restrictions on possible assignments for sets of variables. A solution is an assignment of a value v ∈ D(xi ) to each variable xi ∈ X , consistent with C . We introduce variables X = xi . . . xn corresponding to each prediction p1 . . . pn in the initial set. All variables have boolean domains, ∀xi ∈ X , D(xi ) = {true, false} and xi = true ⇔ pi ∈ final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 7. Overlap resolution Local heuristics Local heuristic techniques Local heuristic pruning rules to post-process a set of gene predictions. Deletes inconsistent predictions based on various criteria. Pruning decisions based on the context of only a subset of the predictions. CSP formulation Pruning xi just means D(xi ) = D(xi )true ⇔ D(xi ) = false. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 8. Overlap resolution Local heuristics Local heuristic techniques as Constraint Handling Rules Pruning is conveniently expressed as CHR simplification rules. 1 Constraint store starts out as the initial set. 2 Simplification rules prune predictions from the constraint store, until no more rules apply. 3 Final constraint store represents the final set. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 9. Overlap resolution Local heuristics Genemark frame-by-frame M.Shmatkov, Anton and A.Melikyan, Arik and L.Chernousko, Felix and Borodovsky, Mark Finding prokaryotic genes by the frame-by-frame algorithm: targeting gene starts and overlapping genes Bioinfomatics, 1999 Frame-by-frame prediction(Left1,Right1), prediction(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | prediction(Left1,Right1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 10. Overlap resolution Local heuristics ECOPARSE (1) Krogh, Anders and Mian, I. Saira and Haussler, David A hidden Markov model that finds genes in E.coli DNA. Nucl. Acids Res, 22, 1994. Rule 1 pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=> overlap_len((Left1,Right1),(Left2,Right2),OverlapLen), length_ratio((Left1,Right1),(Left2,Right2),Ratio), length(Left1,Right1,Length1), length(Left2,Right2,Length2), OverlapLen > 15, Score1 > Score2 ((Length1 > 400, Length2 > 400) ; Ratio > 0.5), | pred(Left1,Right1,Score1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 11. Overlap resolution Local heuristics ECOPARSE (2) Rule 2 pred(Left1,Right1,Score1), pred(Left2,Right2,Score2) <=> overlap_len((Left1,Right1),(Left2,Right2),OverlapLen), length_ratio((Left1,Right1),(Left2,Right2),Ratio), length(Left1,Right1,Length1), length(Left2,Right2,Length2), OverlapLen > 15, Ratio =< 0.5, Length1 =< Length2 | pred(Left1,Right1,Score1). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 12. Overlap resolution Local heuristics ECOPARSE (3) - Ambiguity! We have two predictions P1 and P2 that overlap each end of a third prediction P3 by more than 15 bases. If P1 and P3 is are considered before P2 and P3 then P3 will be removed by the first rule. Consequently P2 does not overlap and is kept. If they are considered in opposite order, however, then P2 will be removed by the second rule and subsequently P3 is removed by the first rule. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 13. Overlap resolution Local heuristics Local heuristics and confluence the final set will be same ⇔ program is confluent, i.e. it is not sensitive to the order of execution. With the simple Genemark rule it does not matter in which order the predictions are processed. With slightly more complicated heuristics (such as ECOPARSE), programs tend not to be confluent. Conclusion: Execution strategy is important! Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 14. Overlap resolution Global optimization Pruning step as a Constraint Optimization Problem (1) CSP may have multiple solutions! We want the “best” solution, meaning many correct predictions, few incorrect. Set of real genes are not known in advance! Instead, optimize the probability that the predictions are correct (confidence scores from gene finder). Extends the problem as a constraint optimization problem. Definition A Constraint Optimization Problem (COP) is a CSP where each solution is associated with a cost and the goal is to find a solution with minimal cost. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 15. Overlap resolution Global optimization COP formulation We would like the final set to reflect the relative confidence scores in the predictions assigned by the gene finder and at the same time be consistent with the overlap constraints. COP formulation Let the scores of p1 . . . pn be s1 . . . sn and si ∈ R+ . Maximize n si , subject to C. i=1 Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 16. Overlap resolution Global optimization Variable ordering Assume an ordering on the variables, Initial set predictions p1 . . . pn are sorted by the position of their left-most base, such that ∀pi , pj , i < j ⇒ left-most(pi ) ≤ left-most(pj ). The variables x1 . . . xn of the CSP/COP are given the same ordering. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 17. Overlap resolution Global optimization COP implementation with Markov Chain (1) We propose to use a (constrained) Markov chain for the COP. The Markov chain has a begin state, an end state and two states for each variable xi corresponding to its boolean domain D(xi ). The state corresponding to D(xi ) = true is denoted αi and the state corresponding to D(xi ) = false is denoted βi . In this model, a path from the begin state to the end state corresponds to a potential solution of the CSP. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 18. Overlap resolution Global optimization From confidence scores to transition probabilities P(α1 |begin) = σ1 P(β1 |begin) = 1 − σ1 P(end|αn ) = P(end|βn ) = 1. P(αi |αi−1 ) = P(αi |βi−1 ) = σi P(βi |αi−1 ) = P(βi |βi−1 ) = 1 − σi (0.5 − λ) × (si − min(s1 . . . sn )) σi = 0.5 + λ + max(s1 . . . sn ) − min(s1 . . . sn ) Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 19. Overlap resolution Global optimization Constraining the Markov chain Constraints are defined on states that are not allowed to occur together in a path. CHR rules similar to those of the local heuristics. Instead of removing predictions they define conditions for inconsistency (inconsistency rules). Inconsistency rules Inconsistency rules match predictions corresponding to α and β states in the head of the rule. The guard of the rule ensures that the additional criteria for rule application are met and the implication of the rules is always failure. Such rules are necessarily confluent! Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 20. Overlap resolution Global optimization Example: Glimmer 3 Delcher, Arthur L. and Bratke, Kirsten A. and Powers, Edwin C. and Salzberg, Steven L. Identifying bacterial genes and endosymbiont DNA with Glimmer Bioinformatics, 23, 2007. Glimmers inconsistency rule alpha(Left1,Right1), alpha(Left2,Right2) <=> overlap_length((Left1,Right1),(Left2,Right2),OverlapLen), OverlapLen > 110 | fail. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 21. Overlap resolution Global optimization Inconsistency rules, Genemark and ECOGENE Genemark inconsistency rules alpha(Left1,Right1), alpha(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | fail. beta(Left1,Right1), alpha(Left2,Right2) <=> Left1 =< Left2, Right1 >= Right2 | fail. ECOGENE inconsistency rules alpha(Left1,Right1), alpha(Left2,Right2) <=> ... | fail. beta(Left1,Right1), alpha(Left2,Right2) <=> ... | fail. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 22. Implementation Implementation in PRISM Implemented as a constrained markov Chain in PRISM. PRISM Markov chain values(choose(_), [alpha,beta]). markov_chain(P,_) :- last_prediction_id(LPID), P > LPID. markov_chain(P,Store1) :- last_prediction_id(N), P =< N, msw(choose(P), AlphaBeta), X =.. [ AlphaBeta, P ], check_constraints(X,Store1,Store2), NextPrediction is P + 1, markov_chain(NextPrediction,Store2). Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 23. Implementation Constraint checking and relevant recent states Maintain a list of recent states (mi ) sorted by the right-most position of the corresponding predictions. Constraints are only checked for predictions corresponding to elements of mi . In step i, we construct mi as the maximal prefix of xi + mi−1 , such that xj ∈ mi ⇐⇒ right-most(pj ) ≥ left-most(pi ). If the constraints propagate fail, then the PRISM derivation fails and the (partial) path it represents is pruned from the solution space. The most probable consistent path is found using PRISMs generic adaptation of the Viterbi algorithm for PRISM programs. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 24. Evaluation Are the constraints realistic assumptions? Definition We consider constraints safe when the constraints prune only false positives. Neither of the examined constraints are safe with respect the ”known” genes in E.coli, (NC 000913). Genemark constraint removes three completely overlapped known genes. Four overlaps longer than 110 bases potentially removed Glimmer 3 constraint. 93 overlaps longer than 15 bases potentially removed by ECOGENE constraints. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 25. Evaluation Experimental results Prediction on E.coli. using simplistic codon frequency based gene finder. Pruning using our global optimization approach (with all inconsistency rules) versus local heuristic rules1 . Method #predictions Sensitivity Specificity Time (seconds) initial set 10799 0.7625 0.2926 na Genemark rules 5823 0.7558 0.5379 1.4 ECOGENE rules 4981 0.7148 0.5947 1.7 global optimization 5222 0.7201 0.5714 75 Sensitivity = fraction of known reference genes predicted. Specificity = fraction of predicted genes that are correct. 1 Note that the results for the ECOGENE heuristic may vary depending on execution strategy - in case of above results, predictions with lower left position are considered first. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 26. Summary and future directions Summary We presented A unified view overlap resolution using Constraint Handling Rules. a novel way to post-process gene prediction results based on constrained global optimization that provides an optimality guarantee wrt. to gene finder confidence scores and it has similar accuracy to the heuristic methods. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 27. Summary and future directions Future directions Better modeling transition probabilities using assumed gene-finder false positive rates. Soft constraints - since constraints are unsafe. Use technique as a combiner: Include more gene finder predictions in initial set. Include and experiment with other constraints. Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution
  • 28. Summary and future directions Questions Questions? Christian Theil Have Constraints and Global Optimization for Gene Prediction Overlap Resolution