ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

  • 283 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
283
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Logic-Statistic Models with Constraintsfor Biological Sequence Analysis Christian Theil Have, <cth@ruc.dk> Programming, Logic and Intelligent Systems  plis.ruc.dk  CBIT  Roskilde University  Denmark
  • 2. Motivation and outline● Short motivation and introduction to biological sequence analysis● Different ways of integrating constraints with probabilistic models● Combining models with constraints
  • 3. Biological sequence analysisThe basic problems: Alignment of biological sequences Phylogeny Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function prediction
  • 4. Biological sequence analysisThe basic problems: Alignment of biological sequences Phylogeny➔ Gene prediction● RNA secondary structure prediction● Protein structure prediction● Protein function predictionWe focus on gene prediction for now...
  • 5. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, CAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGAAATATAGGCATAGCGCACAGACAGATA
  • 6. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codonsAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 7. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different framesAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 8. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 9. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 10. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  • 11. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  • 12. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  • 13. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.
  • 14. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence● DNA is composed of nucletides: A, T, G, C● Genes are sequences of triplets of nucleotides, called codons● Genes can occur in both strands in three different frames● Specific start codons signals a possible beginning of a gene● Specific stop codons definitively signals the end of a geneAAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACAGAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACCATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGTGCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA● There are three possible genes in this sample in this frame )on this strand(.● In general, DNA sequences have an exponential amount of different genecompositions.
  • 15. Biological sequence analysis, tools of the trade● Statistical models )in order of expression power( ● Hidden Markov Models ● Probabilistic Context Free Grammars ● Probabilistic Context Sensitive Grammars ● Stochastic Definite Clause Grammars ● All these can be modeled in PRISM ● Probabilistic extension of Prolog● Problems: ● Computational complexity of inference ● Extremely large sequences ● Use of more expressive models infeasible ● Essential: Enforce right independence assumptions ● limit amount of conditional probabilities
  • 16. Gene-finding with Hidden Markov Models Hidden Markov Models )HMMs( commonly used for gene prediction A Hidden Markov Model is a quadruple < S,A,T,E> S is a set of states A is a set of emission symbols T is a set of transition probabilities E is a set of emission probabilities An observation is a sequence of emissions Transition and emission probabilities can be derived from sampleobservations though parameter estimation Decoding finds the most probable sequence of states corresponding to anobservation
  • 17. Genefinding with Hidden Markov ModelsExample: Toy HMM for gene-finding.
  • 18. Decoding: The Viterbi algorithmFinding the most probable path for a given sequence: argmax P(state sequence | observation)Method: Incrementally keep track of the most probable path to a given state Dynamic programming )tabling in Prolog/PRISM( Time steps )observation( States Time complexity O(|states| * |observation|)
  • 19. Predicting is decodingDecoding of an HMM may be considered as an optimization problem:● We have a set of variables T0 .. Tn, one for each time step A set of constraints, C, on these variables:A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is atransition to S and the state has an emission corresponding to the emission in the observation● Goal: Optimize P(state sequence| observation), subject to C T0 T1 T2 T3 Tn States Time steps )observation( ➔ Accomplished with Viterbialgorithm in O)| states| *| observation| ) using DP
  • 20. Constraints as model structure● The structure of the HMM consists of ● states ● allowed transitions between these states ● possible emissions from these states● The structure of the HMM defines a regular language● Can model )only( regular languages, but..● Not all regular languages can be modeled equally compact● Some regular languages requires an exponential amount of statesConsider a fully-connectedautomaton with only Nstates: All-different: No state visited more than once
  • 21. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations.
  • 22. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations.Advantages✔ Convenient method of expression✔ Can express non- regular languages✔ Does affect the number of states
  • 23. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations. Problems ✗ Models with constraints can failAdvantages✔ Convenient method of expression ✗ Probability mass disappears✔ Can express non- regular languages ✗ Complicates model inference✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed ✗ NP- Complete in general case
  • 24. Side-constraintsSide-constraints: Statistical● Constraints which are not embedded in Side-Constraintsthe model. Model● Delimits allowed derivations. Problems ✗ Models with constraints can failAdvantages✔ Convenient method of expression ✗ Probability mass disappears✔ Can express non- regular languages ✗ Complicates model inference✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed Possible solutions ✗ NP- Complete in general case Parameterlearning: ● Training with fgEM / Failure- adjusted maximization ● Requires failure estimates ● Apply soft-constraints do not fail Inference: ● Incremental constraint- solving ● Local constraints
  • 25. Example: Fixing known genes known geneDNAS C C C C C C C E N N N N ● Difficult/expensive to model with model structure ● HMM needs to do position counting = > many states required! ● Easy to model with side- constraints ● Local constraint: Affects only a limited size sequential set of variables ● Decoding possible in linear time complexity
  • 26. Combining models Combine the predictions of several models to form more accurate predictions. O bvious approaches: ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection/majority voting ● Lowest common denominator ● Throws away the mostGene predictor A Gene predictor B interesting predictions
  • 27. Combining models with constraints Combine the predictions of several models to form more accurate predictions. O bvious approaches ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection ● Lowest common denominator ● Throws away the mostGene predictor A Gene predictor B interesting predictions We need to know the strengths of individual models to define better constraints...
  • 28. Combining models with constraintsI ssues to consider : ● Ability to combine both blackbox and whitebox models ● The nature of the combination constraints ● Uncertainty ● Lack of knowledge: what the right constraints.. ● Induction Some possible ways to represent combination constraints being considered : ● Hard constraints ● Inability to handle uncertainty ● Factorial Hidden Markov Models ● Probability distribution defines how much to listen to each model ● Throws away information: What model contributed what? ● Expensive to train ● Bayesian networks ● Model probablistic constraints ● We can model sequences with Dynamic Bayesian Networks ● Soft- Constraints ● Possibly good complement to probabilistic inference ● Co- training ● Use the models to train each other
  • 29. Outlook● Formulating biosequence problems in terms of constraints● Integrating these constraints in probablistic models● Tradeoffs between constraint representations ● Finding the right balance...● Combining models with constraints● Inference and parameter estimation in mixed models