ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Logic-Statistic Models with Constraints
for Biological Sequence Analysis
Christian Theil Have, <cth@ruc.dk>

Programming, Logic and Intelligent Systems  plis.ruc.dk  CBIT  Roskilde University  Denmark

Motivation and outline
● Short motivation and introduction to biological sequence analysis
● Different ways of integrating constraints with probabilistic models

● Combining models with constraints

Biological sequence analysis
The basic problems:
Alignment of biological sequences
Phylogeny
Gene prediction
● RNA secondary structure prediction

● Protein structure prediction

● Protein function prediction

The basic problems:
Alignment of biological sequences
Phylogeny
➔ Gene prediction

● RNA secondary structure prediction

● Protein structure prediction

● Protein function prediction

We focus on gene prediction for now...

Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

AATATAGGCATAGCGCACAGACAGATAAAAATTACA
GAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGT
GCGGGCTGAAATATAGGCATAGCGCACAGACAGATA


● Genes are sequences of triplets of nucleotides, called codons

AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA
GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC
ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT
GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA



● Genes can occur in both strands in three different frames





● Speciﬁc start codons signals a possible beginning of a gene






● Speciﬁc stop codons deﬁnitively signals the end of a gene







● There are three possible genes in this sample in this frame )on this strand(.






● There are three possible genes in this sample in this frame )on this strand(.
● In general, DNA sequences have an exponential amount of different gene

compositions.

Biological sequence analysis,
tools of the trade
● Statistical models )in order of expression power(
● Hidden Markov Models

● Probabilistic Context Free Grammars

● Probabilistic Context Sensitive Grammars

● Stochastic Deﬁnite Clause Grammars

● All these can be modeled in PRISM

● Probabilistic extension of Prolog

● Problems:

● Computational complexity of inference

● Extremely large sequences

● Use of more expressive models infeasible

● Essential: Enforce right independence assumptions

● limit amount of conditional probabilities

Gene-ﬁnding with Hidden Markov Models
Hidden Markov Models )HMMs( commonly used for gene prediction
A Hidden Markov Model is a quadruple < S,A,T,E>
S is a set of states
A is a set of emission symbols
T is a set of transition probabilities
E is a set of emission probabilities
An observation is a sequence of emissions
Transition and emission probabilities can be derived from sample
observations though parameter estimation
Decoding ﬁnds the most probable sequence of states corresponding to an
observation

Geneﬁnding with Hidden Markov Models
Example: Toy HMM for gene-ﬁnding.

Decoding: The Viterbi algorithm
Finding the most probable path for a given sequence:

argmax P(state sequence | observation)

Method:
Incrementally keep track of the most probable path to a given state
Dynamic programming )tabling in Prolog/PRISM(
Time steps )observation(

States

Time complexity O(|states| * |observation|)

Predicting is decoding
Decoding of an HMM may be considered as an optimization problem:
●
We have a set of variables T0 .. Tn, one for each time step
A set of constraints, C, on these variables:
A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is a
transition to S and the state has an emission corresponding to the emission in the observation
● Goal: Optimize P(state sequence| observation), subject to C

T0 T1 T2 T3 Tn

States

Time steps )observation(
➔ Accomplished with Viterbialgorithm in O)| states| *| observation| ) using DP

Constraints as model structure
● The structure of the HMM consists of
● states

● allowed transitions between these states

● possible emissions from these states

● The structure of the HMM deﬁnes a regular language

● Can model )only( regular languages, but..

● Not all regular languages can be modeled equally compact

● Some regular languages requires an exponential amount of states

Consider a fully-connected
automaton with only N
states:

All-diﬀerent: No state visited more than once

Side-constraints
Side-constraints:
Statistical
● Constraints which are not embedded in Side-Constraints
the model. Model
● Delimits allowed derivations.

Side-constraints
Side-constraints:
Statistical
the model. Model

Advantages
✔ Convenient method of expression

✔ Can express non- regular languages
✔ Does affect the number of states

Side-constraints
Side-constraints:
Statistical
the model. Model

Problems
✗ Models with constraints can fail
Advantages
✔ Convenient method of expression ✗ Probability mass disappears

✔ Can express non- regular languages ✗ Complicates model inference
✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong
distributions
✗ Decoding must adhere to constraints

✗ Constraint solving techniques needed

✗ NP- Complete in general case

Side-constraints
Side-constraints:
Statistical
the model. Model

Problems
✗ Models with constraints can fail
Advantages
✔ Convenient method of expression ✗ Probability mass disappears

✔ Can express non- regular languages ✗ Complicates model inference
✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong
distributions
✗ Decoding must adhere to constraints

✗ Constraint solving techniques needed

Possible solutions ✗ NP- Complete in general case
Parameterlearning:
● Training with fgEM / Failure- adjusted maximization
● Requires failure estimates

● Apply soft-constraints do not fail
Inference:
● Incremental constraint- solving
● Local constraints

Example: Fixing known genes
known
gene
DNA

S C C C C C C C E

N N N N

● Difﬁcult/expensive to model with model structure
● HMM needs to do position counting = > many states required!

● Easy to model with side- constraints
● Local constraint: Affects only a limited size sequential set of variables

● Decoding possible in linear time complexity

Combining models
Combine the predictions of several models to form more accurate predictions.

O bvious approaches:
● Union

● Many false positives
A Genes B Genes
● Conﬂicts

● Intersection/majority voting
● Lowest common

denominator
● Throws away the most

Gene predictor A Gene predictor B interesting predictions

Combining models with constraints
Combine the predictions of several models to form more accurate predictions.

O bvious approaches
● Union

● Many false positives
A Genes B Genes
● Conﬂicts

● Intersection

● Lowest common

denominator
● Throws away the most

Gene predictor A Gene predictor B interesting predictions

We need to know the strengths
of individual models to deﬁne
better constraints...

Combining models with constraints
I ssues to consider :
● Ability to combine both blackbox and whitebox models

● The nature of the combination constraints

● Uncertainty

● Lack of knowledge: what the right constraints..

● Induction

Some possible ways to represent combination constraints being considered :
● Hard constraints

● Inability to handle uncertainty

● Factorial Hidden Markov Models

● Probability distribution deﬁnes how much to listen to each model

● Throws away information: What model contributed what?

● Expensive to train

● Bayesian networks

● Model probablistic constraints

● We can model sequences with Dynamic Bayesian Networks

● Soft- Constraints
● Possibly good complement to probabilistic inference

● Co- training
● Use the models to train each other

Outlook
● Formulating biosequence problems in terms of constraints
● Integrating these constraints in probablistic models

● Tradeoffs between constraint representations

● Finding the right balance...

● Combining models with constraints

● Inference and parameter estimation in mixed models

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Recommended

Recommended

More Related Content

Similar to ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

Similar to ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis (20)

More from Christian Have

More from Christian Have (7)

Recently uploaded

Recently uploaded (20)

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis