Logic-Statistic Models with Constraintsfor Biological Sequence Analysis           Christian Theil Have, <cth@ruc.dk>  Prog...
Motivation and outline● Short motivation and introduction to biological sequence analysis● Different ways of integrating c...
Biological sequence analysisThe basic problems:  Alignment of biological sequences  Phylogeny  Gene prediction● RNA second...
Biological sequence analysisThe basic problems:  Alignment of biological sequences  Phylogeny➔ Gene prediction● RNA second...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis  Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletide...
Biological sequence analysis,          tools of the trade● Statistical models )in order of expression power(    ● Hidden M...
Gene-finding with Hidden Markov Models Hidden Markov Models )HMMs( commonly used for gene prediction A Hidden Markov Model ...
Genefinding with Hidden Markov ModelsExample: Toy HMM for gene-finding.
Decoding: The Viterbi algorithmFinding the most probable path for a given sequence:                     argmax P(state seq...
Predicting is decodingDecoding of an HMM may be considered as an optimization problem:●  We have a set of variables T0 .. ...
Constraints as model structure● The structure of the HMM consists of    ● states    ● allowed transitions between these st...
Side-constraintsSide-constraints:                                          Statistical● Constraints which are not embedded...
Side-constraintsSide-constraints:                                          Statistical● Constraints which are not embedded...
Side-constraintsSide-constraints:                                               Statistical● Constraints which are not emb...
Side-constraintsSide-constraints:                                                    Statistical● Constraints which are no...
Example: Fixing known genes                                     known                                     geneDNAS        ...
Combining models  Combine the predictions of several models to form more accurate predictions.                            ...
Combining models with constraints  Combine the predictions of several models to form more accurate predictions.           ...
Combining models with constraintsI ssues to consider :     ● Ability to combine both blackbox and whitebox models     ● Th...
Outlook● Formulating biosequence problems in terms of constraints● Integrating these constraints in probablistic models● T...
Upcoming SlideShare
Loading in …5
×

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

552 views

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
552
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

  1. 1. Logic-Statistic Models with Constraintsfor Biological Sequence Analysis Christian Theil Have, <cth@ruc.dk> Programming, Logic and Intelligent Systems  plis.ruc.dk  CBIT  Roskilde University  Denmark
  2. 2. Motivation and outline● Short motivation and introduction to biological sequence analysis● Different ways of integrating constraints with probabilistic models● Combining models with constraints
  3. 3. Biological sequence analysisThe basic problems: Alignment of biological sequences Phylogeny Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction
  4. 4. Biological sequence analysisThe basic problems: Alignment of biological sequences Phylogeny➔ Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function predictionWe focus on gene prediction for now...
  5. 5. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, CAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGAAATATAGGCATAGCGCACAGACAGATA
  6. 6. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codonsAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  7. 7. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different framesAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  8. 8. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  9. 9. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  10. 10. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  11. 11. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  12. 12. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  13. 13. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  14. 14. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.● In general, DNA sequences have an exponential amount of different genecompositions.
  15. 15. Biological sequence analysis, tools of the trade● Statistical models )in order of expression power( ● Hidden Markov Models ● Probabilistic Context Free Grammars ● Probabilistic Context Sensitive Grammars ● Stochastic Definite Clause Grammars ● All these can be modeled in PRISM ● Probabilistic extension of Prolog● Problems: ● Computational complexity of inference ● Extremely large sequences ● Use of more expressive models infeasible ● Essential: Enforce right independence assumptions ● limit amount of conditional probabilities
  16. 16. Gene-finding with Hidden Markov Models Hidden Markov Models )HMMs( commonly used for gene prediction A Hidden Markov Model is a quadruple < S,A,T,E> S is a set of states A is a set of emission symbols T is a set of transition probabilities E is a set of emission probabilities An observation is a sequence of emissions Transition and emission probabilities can be derived from sampleobservations though parameter estimation Decoding finds the most probable sequence of states corresponding to anobservation
  17. 17. Genefinding with Hidden Markov ModelsExample: Toy HMM for gene-finding.
  18. 18. Decoding: The Viterbi algorithmFinding the most probable path for a given sequence: argmax P(state sequence | observation)Method: Incrementally keep track of the most probable path to a given state Dynamic programming )tabling in Prolog/PRISM( Time steps )observation( States Time complexity O(|states| * |observation|)
  19. 19. Predicting is decodingDecoding of an HMM may be considered as an optimization problem:● We have a set of variables T0 .. Tn, one for each time step A set of constraints, C, on these variables:A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is atransition to S and the state has an emission corresponding to the emission in the observation● Goal: Optimize P(state sequence| observation), subject to C T0 T1 T2 T3 Tn States Time steps )observation( ➔ Accomplished with Viterbialgorithm in O)| states| *| observation| ) using DP
  20. 20. Constraints as model structure● The structure of the HMM consists of ● states ● allowed transitions between these states ● possible emissions from these states● The structure of the HMM defines a regular language● Can model )only( regular languages, but..● Not all regular languages can be modeled equally compact● Some regular languages requires an exponential amount of statesConsider a fully-connectedautomaton with only Nstates: All-different: No state visited more than once
  21. 21. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations.
  22. 22. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations.Advantages✔ Convenient method of expression✔ Can express non- regular languages✔ Does affect the number of states
  23. 23. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations. Problems ✗ Models with constraints can failAdvantages✔ Convenient method of expression ✗ Probability mass disappears✔ Can express non- regular languages ✗ Complicates model inference✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed ✗ NP- Complete in general case
  24. 24. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations. Problems ✗ Models with constraints can failAdvantages✔ Convenient method of expression ✗ Probability mass disappears✔ Can express non- regular languages ✗ Complicates model inference✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed Possible solutions ✗ NP- Complete in general case Parameterlearning: ● Training with fgEM / Failure- adjusted maximization ● Requires failure estimates ● Apply soft-constraints do not fail Inference: ● Incremental constraint- solving ● Local constraints
  25. 25. Example: Fixing known genes known geneDNAS C C C C C C C E N N N N ● Difficult/expensive to model with model structure ● HMM needs to do position counting = > many states required! ● Easy to model with side- constraints ● Local constraint: Affects only a limited size sequential set of variables ● Decoding possible in linear time complexity
  26. 26. Combining models Combine the predictions of several models to form more accurate predictions. O bvious approaches: ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection/majority voting ● Lowest common denominator ● Throws away the mostGene predictor A Gene predictor B interesting predictions
  27. 27. Combining models with constraints Combine the predictions of several models to form more accurate predictions. O bvious approaches ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection ● Lowest common denominator ● Throws away the mostGene predictor A Gene predictor B interesting predictions We need to know the strengths of individual models to define better constraints...
  28. 28. Combining models with constraintsI ssues to consider : ● Ability to combine both blackbox and whitebox models ● The nature of the combination constraints ● Uncertainty ● Lack of knowledge: what the right constraints.. ● Induction Some possible ways to represent combination constraints being considered : ● Hard constraints ● Inability to handle uncertainty ● Factorial Hidden Markov Models ● Probability distribution defines how much to listen to each model ● Throws away information: What model contributed what? ● Expensive to train ● Bayesian networks ● Model probablistic constraints ● We can model sequences with Dynamic Bayesian Networks ● Soft- Constraints ● Possibly good complement to probabilistic inference ● Co- training ● Use the models to train each other
  29. 29. Outlook● Formulating biosequence problems in terms of constraints● Integrating these constraints in probablistic models● Tradeoffs between constraint representations ● Finding the right balance...● Combining models with constraints● Inference and parameter estimation in mixed models

×