Efficient Probabilistic Logic Programming for Biological Sequence Analysis

INTRODUCTION Background A peek into my work Conclusions
Efﬁcient
Probabilistic Logic Programming
for
Biological Sequence Analysis
Christian Theil Have
Research group PLIS: Programming, Logic and Intelligent Systems
Department of Communication, Business and Information Technologies
Roskilde University

OUTLINE
INTRODUCTION
Domain
Research questions
Background
Gene ﬁnding
Probabilistic Logic Programming
A peek into my work
Overview of papers
The trouble with tabling of structured data
Constrained HMMs
Applications: Genome models
Conclusions

INTRODUCTION

BIOLOGICAL SEQUENCE ANALYSIS
Subﬁeld of bioinformatics

Analyze biological
sequences

Analyze biological
sequences
DNA

Analyze biological
sequences
DNA
RNA

Analyze biological
sequences
DNA
RNA
Proteins

Analyze biological
sequences
DNA
RNA
Proteins
to understand

Analyze biological
sequences
DNA
RNA
Proteins
to understand
Features

Analyze biological
sequences
DNA
RNA
Proteins
to understand
Features
Functions

Analyze biological
sequences
DNA
RNA
Proteins
to understand
Features
Functions
Evolutionary
relationships

PROBABILISTIC LOGIC PROGRAMMING
Declarative programming paradigm
Ability to express common and complex models used in
biological sequence analysis
Concise expression of complex models
Separation between logic and control
Generic inference algorithms
Code as data: Transformations

MODELS FOR BIOLOGICAL SEQUENCE ANALYSIS
Reﬂect relationships between features of sequence data
Embody constraints – assumptions about data
Infer information from data
Reasoning about uncertainty → probabilities

THE LOST PROJECT
. . . seeks to improve ease of modeling, accuracy and reliability of
sequence analysis by using logic-statistical models that are yet largely
untested in bioinformatics . . .
Key focus areas:
The PRISM system
Prokaryotic gene ﬁnding
My Ph.D. project is part of the LoSt project and share these
focus areas.

RESEARCH QUESTIONS
1. To what extent is it possible to use probabilistic logic
programming for biological sequence analysis?
2. How can constraints relevant to the domain of biological
sequence analysis be combined with probabilistic logic
programming?
3. What are the limitations with regard to efﬁciency and how can
these be dealt with?
I believe that these are the central questions that need be
addressed in order to be able to construct useful tools for
biological sequence analysis using probabilistic logic
programming.

RELATIONS BETWEEN RESEARCH QUESTIONS
1. To what extent is it possible to use
probabilistic logic programming for
biological sequence analysis?
2. How can constraints relevant to the
domain of biological sequence analysis
be combined with probabilistic logic
programming?
3. What are the limitations with regard to
efﬁciency and how can these be dealt
with?

APPROACH
To build and evaluate
Applications
Abstractions
Optimizations
for biological sequence analysis using probabilistic logic
programming.

APPROACH
Applications
Deal with relevant biological sequence analysis problems
Potentially to contribute new knowledge to biology or
bioinformatics
Direct substantiation with regard to research question 1
Abstractions
Ease modeling
Language for incorporating constraints from the domain
A higher level of declarativity;
Focus on problem rather than implementation (model)
details
Optimizations
Deal with limitations of probabilistic logic programming
that may hinder its use in biological sequence analysis.
Efﬁcient inference is a precondition for practical use.

BACKGROUND
Prokaryotic gene ﬁnding
Probabilistic logic programming

PROKARYOTIC GENE FINDING
Identify regions of DNA which encode proteins:
A (prokaryotic) gene is a consecutive stretch of DNA which,
is transcribed as part of an RNA
is translated to a complete protein and
has a length which is a multiple of three (codons)
starts with a “start” codon
last codon is a “stop” codon

GENES AND OPEN READING FRAMES
The identification of prokaryotic genes may be decomposed
into two distinct problems:
1. Identification of ORFs which contain protein coding genes.
2. Identification of the correct start codon within an ORF.
ORF ::= start not-stop * stop
start ::= TTG | CTG | ATT | ATC | ATA | ATG | GTG
stop ::= TAA | TAG | TGA
not-stop ::= AAA | ... | TTT //all codons except those in stop

SIGNALS FOR PROKARYOTIC GENE FINDING
Open reading frames
Length
Nucleotide sequence composition
Conservation (sequence similarity in other organisms)
Local context
Promoters
Ribosomal binding site
Termination signal
GB
-35
PB
-10 +1
tss
SD
≈ +10
Gene
≈ +15-20
Terminator

READING FRAMES AND OVERLAPPING GENES
RNA can be transcribed from either strand
Genes may start in different “reading frames”
Genes can overlap
in the same and in different reading frames
on opposite strands

Logic programming and Prolog
Probabilistic logic programming and PRISM

LOGIC PROGRAMMING AND PROLOG
A Prolog program consist of a ﬁnite sequence of rules,
B:-A1, . . . , An.
These rules deﬁne implications, i.e.,
B if A1 and . . . and An

TERMS, LITERALS AND VARIABLES
Literals can consist of (possibly) structured terms, that may
include variables.
number(0).
number(s(X)) :- number(X).

include variables.
fact
number(0).

include variables.
constant
number(0).

include variables.
number(0).
term

include variables.
number(0).
variables

RESOLUTION
Problems are stated as theorems (goals) to be proved, e.g.,
number(X)
To prove a consequent, we recursively need to prove the
antecedents by using rules where these appear as consequents,
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
number(X) → X = s(0)
number(X) → X = s(s(0))
. . .
Derivation

RESOLUTION
number(X)
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
. . .
Derivation
number(X) →
number(0)

RESOLUTION
number(X)
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
. . .
Derivation
number(X) →
number(s(X)) →

RESOLUTION
number(X)
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
. . .
Derivation
number(X) →
number(s(X)) →
number(s(0))

RESOLUTION
number(X)
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
. . .
Derivation
number(X) →
number(s(X)) →
number(s(s(X))) →

RESOLUTION
number(X)
number(0).
number(s(X)) :-
number(X).
Solutions
number(X) → X = 0
. . .
Derivation
number(X) →
number(s(X)) →
number(s(s(X))) →
number(s(s(0)))

DERIVATION TREES AND EXPLANATION GRAPHS
Consider the following
program which adds natural
numbers:
add(0+0,0).
add(A+s(B),s(C)) :-
add(A+B,C).
add(s(A)+B,s(C)) :-
add(A+B,C).
And suppose we call the goal,
add(s(s(0))+s(s(0)),R)
We now have two alternative
applicable clauses,
alternatives
Resulting in either,
add(s(0)+s(s(0)),s(R))
or
add(s(s(0))+s(0),s(R))

DERIVATION TREE
s(s(0))+s(s(0))
s(0)+s(s(0))
0+s(s(0))
0+s(0)
0+0
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+s(0)
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+0
s(0)+0
0+0

DERIVATION TREE
s(s(0))+s(s(0))
s(0)+s(s(0))
0+s(s(0))
0+s(0)
0+0
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+s(0)
s(0)+s(0)
0+s(0)
0+0
s(0)+0
0+0
s(s(0))+0
s(0)+0
0+0
Exponential!

EXPLANATION GRAPH
Polynomial1
1
O(n ∗ m), but would be O(n + m) if arguments were ordered by size

TABLING
Idea
The system maintains a table of calls and their answers.
when a new call is entered, check if it is stored in the table
if so, use previously found solution
Consequence:
Explanation graph representation.
Signiﬁcant speed-up of program execution.

Probabilistic logic programming is a form of logic
programming which deals with uncertainty.
A logic program induces a set of possible worlds, i.e., the set of
derivable consequents and their alternative proofs.
Probabilistic logic programming extends logic programming by
assigning probabilities to each of these possible worlds and
extends logical inference into probabilistic inference, as to, e.g.,
derive the probability of a goal
Infer the most probable derivation of a goal
Infer the afﬁnities (represented by probabilities) for
different possible worlds from data

PRISM
PRogramming In Statistical Modelling is a framework for
probabilistic logic programming
Developed by collaboration partners of the Lost project:
Yoshitaka Kameya, Taisuke Sato, and Neng-Fa Zhou.
An extension of Prolog with random variables, called MSWs
Provides efﬁcient generalized inference algorithms
(Viterbi, EM, etc) using tabling
PRISM program = probabilistic model

HIDDEN MARKOV MODEL EXAMPLE
Postcard
Greetings from wherever, where I am having
a great time. Here is what I have been doing:
The ﬁrst two days, I stayed at the hotel
reading a good book. Then, on the third day I
decided to go shopping. The next three days I
did nothing but lie on the beach. On my last
day, I went shopping for some gifts to bring
home and wrote you this postcard.
Sincerely, Some friend of yours
Observation sequence

HIDDEN MARKOV MODEL run
Definition
A run of an HMM as a pair consisting of a sequence of states
s(0)
s(1)
. . . s(n)
, called a path and a corresponding sequence of
emissions e(1)
. . . e(n)
, called an observation, such that
s(0) = s0;
∀i, 0 ≤ i ≤ n − 1, p(s(i); s(i+1)) > 0
(probability to transit from s(i) to s(i+1));
∀i, 0 < i ≤ n, p(s(i); e(i)) > 0
(probability to emit e(i) from s(i)).
Definition
The probability of such a run is defined as
i=1..n p(s(i−1); s(i)) · p(s(i); e(i))

DECODING WITH HIDDEN MARKOV MODELS
Inferr the hidden path given the observation sequence.
argmaxpathP(path|observation)
source: wikipedia
The Viterbi algorithm: can be seen as keeping track of, for each preﬁx
of an observed emission sequence, the most probable (partial) path
leading to each possible state, and extending those step by step into
longer paths, eventually covering the entire emission sequence.

EXAMPLE HMM IN PRISM
values/2
declares the
outcomes of
random variables
msw/2
simulates a
random variable,
stochastically
selecting one of
the outcomes
Model in Prolog
Speciﬁes relation
between variables
Example HMM in PRISM
values(trans(_), [sunny,rainy]).
values(emit(_), [shop,beach,read]).
hmm(L):- run_length(T),hmm(T,start,L).
hmm(0,_,[]).
hmm(T,State,[Emit|EmitRest]) :-
T > 0,
msw(trans(State),NextState),
msw(emit(NextState),Emit),
T1 is T-1,
hmm(T1,NextState,EmitRest).
run_length(7).

A PEEK INTO MY WORK
Overview of papers
A few selected cases:
An abstraction: Constrained HMMs (also an optimization)
An optimization: Regarding tabling of structured data
A couple of applications: Genome models
Using constrained probabilistic models for gene ﬁnding with
overlapping genes
Gene ﬁnding with a probabilistic model for
genome-sequence of reading

PAPERS 1
1. Henning Christiansen, Christian Theil Have, Ole Torp Lassen
and Matthieu Petit
Taming the Zoo of Discrete HMM Subspecies & some of their Relatives
Frontiers in Artiﬁcial Intelligence and Applications, 2011
and Matthieu Petit
Inference with constrained hidden Markov models in PRISM
Theory and Practice of Logic Programming, 2010
3. Christian Theil Have
Constraints and Global Optimization for Gene Prediction Overlap
Resolution
Workshop on Constraint Based Methods for Bioinformatics, 2011

PAPERS 2
and Matthieu Petit
The Viterbi Algorithm expressed in Constraint Handling Rules
7th International Workshop on Constraint Handling Rules, 2010
5. Christian Theil Have and Henning Christiansen
Modeling Repeats in DNA Using Probabilistic Extended Regular
Expressions
Frontiers in Artiﬁcial Intelligence and Applications, 2011
and Matthieu Petit
Bayesian Annotation Networks for Complex Sequence Analysis
Technical Communications of the 27th International Conference
on Logic Programming (ICLP’11)

PAPERS 3
and Matthieu Petit
A declarative pipeline language for big data analysis
Presented at LOPSTR, 2012
8. Christian Theil Have and Henning Christiansen
Efﬁcient Tabling of Structured Data Using Indexing and Program
Transformation
Practical Aspects of Declarative Languages, 2012
9. Neng-Fa Zhou and Christian Theil Have
Efﬁcient tabling of structured data with enhanced hash-consing

PAPERS 4
10. Christian Theil Have and Søren Mørk
A Probabilistic Genome-Wide Gene Reading Frame Sequence Model
Submitted to PLOS One, 2012
11. Christian Theil Have, Sine Zambach and Henning
Christiansen
Effects of using Coding Potential, Sequence Conservation and mRNA
Structure Conservation for Predicting Pyrrolysine Containing Genes
Submitted to BMC Bionformatics, 2012

THE TROUBLE WITH TABLING OF STRUCTURED DATA

THE TROUBLE WITH TABLING OF STRUCTURED DATA
An innocent looking
predicate: last/2
last([X],X).
last([_|L],X) :-
last(L,X).
Traverses a list to
ﬁnd the last element.
Time/space
complexity: O(n).
If we table last/2:
n + n − 1 + n − 2 . . . 1
≈ O(n2) !
call:
last([1,2,3,4,5],X)
last([1,2,3,4,5],X)
last([1,2,3,4],X)
last([1,2,3],X)
last([1,2],X)
last([1],X)
call table
last([1,2,3,4,5],X).
last([1,2,3,4],X).
last([1,2,3],X).
last([1,2],X).
last([1],X).

A WORKAROUND IMPLEMENTED IN PROLOG
We describe a workaround giving O(1) time and space
complexity for table lookups for programs with arbitrarily
large ground structured data as input arguments.
A term is represented as a set of facts.
A subterm is referenced by a unique integer serving as an
abstract pointer.
Matching related to tabling is done solely by comparison
of such pointers.

AN ABSTRACT DATA TYPE
The representation is given by the following predicates which
all together can be understood as an abstract datatype.
store_term( +ground-term, pointer)
The ground-term is any ground term, and the
pointer returned is a unique reference (an integer)
for that term.
retrieve_term( +pointer, ?functor, ?arg-pointers-list)
Returns the functor and a list of pointers to
representations of the substructures of the term
represented by pointer.
full_retrieve_term( +pointer, ?ground-term)
Returns the term represented by pointer.

ADT EXAMPLE
Example
The following call converts the term f(a,g(b)) into its
internal representation and returns a pointer value in the
variable P.
store_term(f(a,g(b)),P).
After this, the following sequence of calls will succeed.
retrieve_term(P,f,[P1,P2]),
retrieve_term(P1,a,[]),
retrieve_term(P2,g,[P21]),
retrieve_term(P21,b,[]),
full_retrieve_term(P,f(a,g(b))).

AN AUTOMATIC PROGRAM TRANSFORMATION
We introduce an automatic transformation from a tabled
program to an efﬁcient version using our approach.
Structured terms are moved from the head of clauses to calls in
the body to retrive_term/2 and full_retrieve_term/3.

TRANSFORMED HIDDEN MARKOV MODEL
We only need to consider the recursive predicate, hmm/2.
original version
hmm(_,[]).
hmm(S,[Ob|Y]) :-
msw(out(S),Ob),
msw(tr(S),Next),
hmm(Next,Y).
transformed version
hmm(S,ObsPtr):-
retrieve_term(ObsPtr,[]).
hmm(S,ObsPtr) :-
retrieve_term(ObsPtr,[Ob,Y]),
msw(out(S),Ob),
msw(tr(S),Next),
hmm(Next,Y).

BENCHMARKING RESULTS
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 1000 2000 3000 4000 5000
020406080100120140
b) Running time without indexed lookup
sequence length
Runningtime(seconds)
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
● ●
●
●
●
● ●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
● ● ●
●
●
0 1000 2000 3000 4000 5000
0.000.020.040.060.08
a) Running time with indexed lookup
sequence length
Runningtime(seconds)

THE NEXT STEP
Integration at the Prolog engine implementation level.
Neng-Fa Zhou and Christian Theil Have
Efﬁcient tabling of structured data with enhanced hash-consing
Full sharing between tables (call and answer)
Sharing with structured data in call stack

CONSTRAINED HMMS
Deﬁnition
A constrained HMM (CHMM)
is an HMM
extended with a set of constraints C, each of which is a
mapping from HMM runs into {true, false}.
A run of a CHMM, path, observation is a run of the
corresponding HMM for which C(path, observation) is true.

CONSTRAINED HMMS
Why extend an HMM with side-constraints?
To create better, more speciﬁc models with fewer states
Convenient to express prior knowledge in terms of
constraints
No need to change underlying HMM
Sometimes it is not possible or feasible to express such
constraints as HMM structure (e.g. all_different)
→ infeasibly huge state and parameter space
fewer paths to consider for any given sequence
→ decreased running time

PAIR HMMS FOR SEQUENCE ALIGNMENT
A pair HMM is a special kind of HMM that emits two
sequences

sequences
The match state emit a pair (xiyj) of symbols

sequences
The insert state emits one symbol xi, from sequence x

sequences
The delete state emits one symbol yj, from sequence y

sequences
The delete state emits one symbol yj, from sequence y
A run of this model produces an alignment of x and y

ALIGNMENT WITH A CONSTRAINED PAIR HMM
Consider adding constraints to the pair HMM introduced
earlier.
For instance..
In a biological context, we may want to
only consider alignments with a limited
number of insertions and deletions given
the assumption that the two sequences
are closely related.
C = {cardinality_atmost(Nd, [S1, . . . , Sn], delete),
cardinality_atmost(Ni, [S1, . . . , Sn], insert)} .
The constraint cardinality_atmost(N, L, X) is satisﬁed whenever
L is a list of elements, out of which at most N are equal to X.

ALIGNMENT WITH CONSTRAINTS

ADDING CONSTRAINT CHECKING TO THE HMM
HMM with constraint checking
hmm(T,State,[Emit|EmitRest],StoreIn) :-
T > 0,
msw(trans(State),NxtState),
msw(emit(NxtState),Emit),
check_constraints([NxtState,Emit],StoreIn,StoreOut),
T1 is T-1,
hmm(T1,NxtState,EmitRest,StoreOut).
Call to check_constraints/3 after each distinct
sequence of msw applications
Side-constaints: The constraints are assumed to be
declared elsewhere and not interleaved with model
speciﬁcation
Extra Store argument in the probabilistic predicate

CHECKING THE CONSTRAINTS
The goal check_constraints/3 calls constraint checkers for
all constraints declared on the model.
For instance, with our example pair HMM constraint,
C = {cardinality_atmost(Nd, [S1, . . . , Sn], delete),
cardinality_atmost(Ni, [S1, . . . , Sn], insert)} .
We have the following incremental constraint checker
implementation
A cardinality_atmost constraint checker
init_constraint_store(cardinality_atmost(_,_), 0).
check_sat(cardinality_atmost(U,Max), U, In, Out) :-
Out is In + 1,Out =< Max.
check_sat(cardinality_atmost(X,_),U,S,S) :- X = U.

A LIBRARY OF GLOBAL CONSTRAINTS FOR HIDDEN
MARKOV MODELS
Our implementation contains a few well-known global
constraints adapted to Hidden Markov Models.
Global constraints
cardinality lock_to_sequence
all_different lock_to_set
In addition, the implementation provides operators which may
be used to apply constraints to a limited set of variables.
Constraint operators
state_specific
emission_specific
forall_subseq (sliding window operator)
for_range (time step range operator)

TABLING ISSUES
Problem: The extra Store argument makes PRISM table
multiple goals (for different constraint stores) when it should
only store one.
hmm(T,State,[Emit|EmitRest],Store)
To get rid of the extra argument, check_constraints
dynamically maintains it as a stack using assert/retract:
check_constraints(Update) :-
get_store(StoreBefore),
check_constraints(Update,StoreBefore,StoreAfter),
forward_store(StoreAfter).
get_store(S) :- store(S), !.
forward_store(S) :- asserta(store(S)) ; retract(store(S)

IMPACT OF USING A SEPARATE CONSTRAINT STORE
STACK

DISCUSSION AND LIMITATIONS
B-Prolog has later added an nt tabling mode which avoid
tabling of arguments, but implemented at the level of the
Prolog system.
However,
Avoiding tabling does not work for all types of constraints
For some constraints, it only works under the certain
assumptions about the model
No interaction between constraints in this implementation
For tabled constraints we need
canonical representation
Pruning of non-essential parts of the constraint store

GENOME MODELS
Gene ﬁnding in a genomic context
What are the constraints between adjacent genes in the
genome?
Extent of (possible) overlap
Modeled as hard constraints
Gene reading frames, i.e., due to leading strand bias,
operons etc.
Modeled as (probabilistic) soft constraints

AN APPLICATION OF CONSTRAINED MARKOV
MODELS
We wish to incorporate overlapping gene constraints into gene
finding.
Divide and conquer two step approach to gene finding:
1. Gene prediction: A gene finder supplies a set of candidate
predictions p1 . . . pn, called the initial set.
2. Pruning: The initial set is pruned according to certain rules
or constraints. We call the pruned set the final set.

PRUNING STEP AS A CONSTRAINT OPTIMIZATION
PROBLEM
CSP formulation
We introduce variables X = xi . . . xn corresponding to each
prediction p1 . . . pn in the initial set. All variables have boolean
domains, ∀xi ∈ X, D(xi) = {true, false} and
xi = true ⇔ pi ∈ ﬁnal set.
Multiple solutions
We want the “best” solution
Optimize for prediction conﬁdence scores
Constraint Optimization Problem (COP)
COP formulation
Let the scores of p1 . . . pn be s1 . . . sn and si ∈ R+.
Maximize n
i=1 si, subject to C.

VARIABLE ORDERING
Assume an ordering on the variables,
Initial set predictions p1 . . . pn are sorted by the position of
their left-most base, such that
∀pi, pj, i < j ⇒ left-most(pi) ≤ left-most(pj).
The variables x1 . . . xn of the CSP/COP are given the same
ordering.

COP IMPLEMENTATION WITH MARKOV CHAIN (1)
We propose to use a (constrained) Markov chain for the COP.
The Markov chain has a begin state, an end state and two
states for each variable xi corresponding to its boolean
domain D(xi).
The state corresponding to D(xi) = true is denoted αi and
the state corresponding to D(xi) = false is denoted βi.
In this model, a path from the begin state to the end state
corresponds to a potential solution of the CSP.

ENCODING CONSTRAINTS WITH CONSTRAINT
HANDLING RULES
Constraints: alpha/2 and beta/2 ≈ visited states.
Example: Genemark inconsistency rules
alpha(Left1,Right1), alpha(Left2,Right2) <=>
Left1 =< Left2, Right1 >= Right2 | fail.
beta(Left1,Right1), alpha(Left2,Right2) <=>
Left1 =< Left2, Right1 >= Right2 | fail.
The most probable consistent path is found using PRISMs
generic adaptation of the Viterbi algorithm
Each step adds either a alpha or beta (active) constraint
Incremental Pruning: For each step we only apply
constraints which may be transitively involved in rules
with the active constraint

EXPERIMENTAL RESULTS
Prediction on E.coli. using simplistic codon frequency
based gene finder.
Pruning using our global optimization approach (with all
inconsistency rules) versus local heuristic rules2.
Method #predictions Sensitivity Specificity Time (seconds)
initial set 10799 0.7625 0.2926 na
Genemark rules 5823 0.7558 0.5379 1.4
ECOGENE rules 4981 0.7148 0.5947 1.7
global optimization 5222 0.7201 0.5714 75
Sensitivity = fraction of known reference genes predicted.
Specificity = fraction of predicted genes that are correct.
2
Note that the results for the ECOGENE heuristic may vary depending on
execution strategy - in case of above results, predictions with lower left
position are considered first.

A MODEL FOR THE GENOME-WIDE SEQUENCE OF
READING FRAMES
We wish to incorporate gene reading frame constraints into
gene finding.
Divide and conquer two step approach to gene finding (again):
1. Gene prediction: A gene finder supplies a set of candidate
predictions p1 . . . pn, called the initial set.
2. Pruning: The initial set is pruned according to gene finder
confidence scores and the the probabilities adjacent gene
reading frames. We call the pruned set the final set.

METHODOLOGY
Genes predictions are
sorted by stop codon
position.
Gene ﬁnder scores are
discretized into symbolic
values.
A type of Hidden Markov
Model which we call a
delete-HMM:
A state for each of the
six possible reading
frames and
one delete state
F1
F2
F3
F4
F5
F6
delete

MODEL
Emission: Finite set of i
symbols δ1 . . . δn
corresponding to ranges of
prediction scores
Frame state transitions:
Relative frequency of
"observed" adjacent gene
reading frame pairs
Transition to delete:
P(δi|state = delete) =
FPδi
FP
(tunable)
F1
F2
F3
F4
F5
F6
delete

RESULTS
1−Specificity (FPR)
Sensitivity(TPR)
0.0 0.1 0.2 0.3 0.4 0.5
0.50.60.70.80.91.0
threshold
frameseq, trained on Escherichia
frameseq, trained on Salmonella
frameseq, trained on Legionella
frameseq, trained on Bacillus
frameseq, trained on Thermoplasma

CONCLUSIONS
1. To what extent is it possible to use
probabilistic logic programming for
biological sequence analysis?
2. How can constraints relevant to the
domain of biological sequence analysis
be combined with probabilistic logic
programming?
3. What are the limitations with regard to
efﬁciency and how can these be dealt
with?

TO WHAT EXTENT IS IT POSSIBLE TO USE
PROBABILISTIC LOGIC PROGRAMMING FOR
BIOLOGICAL SEQUENCE ANALYSIS?
Commonly used models for biological analysis can be
implemented with probabilistic logic programming
Probabilistic logic programming is a powerful tool for
experimenting with new kinds of models
Efﬁciency is an issue, but with tabling optimizations it is
efﬁcient enough for many interesting problems
Not merely a powerful abstraction, but
a valuable and practical tool for biological sequence
analysis

HOW CAN CONSTRAINTS RELEVANT TO THE DOMAIN
OF BIOLOGICAL SEQUENCE ANALYSIS BE COMBINED
WITH PROBABILISTIC LOGIC PROGRAMMING?
Probabilistic logic programming is a suitable language for
building higher level abstractions
A variety of models from biological sequence analysis can
be expressed, e.g., ZOO paper
Constrained Hidden Markov Models
Probabilistic Regular Expressions
(Probabilistic) soft constraints and hard constraints
Side-constraints or as part of model
Constraints affect the search space

WHAT ARE THE LIMITATIONS WITH REGARD TO
EFFICIENCY AND HOW CAN THESE BE DEALT WITH?
Tabling issues
Discriminating arguments (Christiansen and Gallagher)
Tabling of structured data
Tabling of constraint stores
Constraints.
Can be useful for reducing the search space
Can make search space exponential
Problem decomposition with Bayesian Annotation
Networks
Approximation
Feasibility of inference with complex models
Automatic parallelization – BANpipe

Efficient Probabilistic Logic Programming for Biological Sequence Analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Efficient Probabilistic Logic Programming for Biological Sequence Analysis

Similar to Efficient Probabilistic Logic Programming for Biological Sequence Analysis (20)

More from Christian Have

More from Christian Have (6)

Recently uploaded

Recently uploaded (20)

Efficient Probabilistic Logic Programming for Biological Sequence Analysis