SlideShare a Scribd company logo
Logic-Statistic Models with Constraints
for Biological Sequence Analysis
           Christian Theil Have, <cth@ruc.dk>




  Programming, Logic and Intelligent Systems  plis.ruc.dk  CBIT  Roskilde University  Denmark
Motivation and outline
● Short motivation and introduction to biological sequence analysis
● Different ways of integrating constraints with probabilistic models

● Combining models with constraints
Biological sequence analysis
The basic problems:
  Alignment of biological sequences
  Phylogeny
  Gene prediction
● RNA secondary structure prediction

● Protein structure prediction

● Protein function prediction
Biological sequence analysis
The basic problems:
  Alignment of biological sequences
  Phylogeny
➔ Gene prediction

● RNA secondary structure prediction

● Protein structure prediction

● Protein function prediction




We focus on gene prediction for now...
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C




AATATAGGCATAGCGCACAGACAGATAAAAATTACA
GAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGT
GCGGGCTGAAATATAGGCATAGCGCACAGACAGATA
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons




AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG   TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT   ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG   GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames




AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG   TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT   ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG   GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene




AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG   TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT   ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG   GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG   TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT   ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG   GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT     ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG     TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT     ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG     GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
●   There are three possible genes in this sample in this frame )on this strand(.
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT     ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG     TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT     ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG     GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
●   There are three possible genes in this sample in this frame )on this strand(.
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT     ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG     TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT     ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG     GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
●   There are three possible genes in this sample in this frame )on this strand(.
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT     ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG     TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT     ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG     GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
●   There are three possible genes in this sample in this frame )on this strand(.
Biological sequence analysis
  Gene prediction: Predict genes and non-genes in a DNA sequence
● DNA is composed of nucletides: A, T, G, C

● Genes are sequences of triplets of nucleotides, called codons

● Genes can occur in both strands in three different frames

● Specific start codons signals a possible beginning of a gene

● Specific stop codons definitively signals the end of a gene




AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA   AAA   ATT   ACA
GAG   TAC   ACA   ACA   TCC   ATG   AAA   CGC   ATT   AGC   ACC   ACC
ATT   ACC   ACC   ACC   ATC   ACC   ATT   ACC   ACA   GGT   AAC   GGT
GCG   GGC   TGA   AAT   ATA   GGC   ATA   GCG   CAC   AGA   CAG   ATA
● There are three possible genes in this sample in this frame )on this strand(.
● In general, DNA sequences have an exponential amount of different gene


compositions.
Biological sequence analysis,
          tools of the trade
● Statistical models )in order of expression power(
    ● Hidden Markov Models

    ● Probabilistic Context Free Grammars

    ● Probabilistic Context Sensitive Grammars

         ● Stochastic Definite Clause Grammars

    ● All these can be modeled in PRISM

         ● Probabilistic extension of Prolog

● Problems:

    ● Computational complexity of inference

    ● Extremely large sequences

    ● Use of more expressive models infeasible

    ● Essential: Enforce right independence assumptions

         ● limit amount of conditional probabilities
Gene-finding with Hidden Markov Models
 Hidden Markov Models )HMMs( commonly used for gene prediction
 A Hidden Markov Model is a quadruple < S,A,T,E>
     S is a set of states
     A is a set of emission symbols
     T is a set of transition probabilities
     E is a set of emission probabilities
 An observation is a sequence of emissions
 Transition and emission probabilities can be derived from sample
observations though parameter estimation
 Decoding finds the most probable sequence of states corresponding to an
observation
Genefinding with Hidden Markov Models
Example: Toy HMM for gene-finding.
Decoding: The Viterbi algorithm
Finding the most probable path for a given sequence:

                     argmax P(state sequence | observation)

Method:
 Incrementally keep track of the most probable path to a given state
 Dynamic programming )tabling in Prolog/PRISM(
                      Time steps )observation(




                                                                       States




                  Time complexity O(|states| * |observation|)
Predicting is decoding
Decoding of an HMM may be considered as an optimization problem:
●
  We have a set of variables T0 .. Tn, one for each time step
    A set of constraints, C, on these variables:
A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is a
transition to S and the state has an emission corresponding to the emission in the observation
●   Goal: Optimize P(state sequence| observation), subject to C

    T0            T1          T2            T3          Tn


                                                                            States




                     Time steps )observation(
            ➔   Accomplished with Viterbialgorithm in O)| states| *| observation| ) using DP
Constraints as model structure
● The structure of the HMM consists of
    ● states

    ● allowed transitions between these states

    ● possible emissions from these states

● The structure of the HMM defines a regular language

● Can model )only( regular languages, but..

● Not all regular languages can be modeled equally compact

● Some regular languages requires an exponential amount of states



Consider a fully-connected
automaton with only N
states:




                                All-different: No state visited more than once
Side-constraints
Side-constraints:
                                          Statistical
● Constraints which are not embedded in                 Side-Constraints
the model.                                Model
● Delimits allowed derivations.
Side-constraints
Side-constraints:
                                          Statistical
● Constraints which are not embedded in                 Side-Constraints
the model.                                Model
● Delimits allowed derivations.




Advantages
✔ Convenient method of expression

✔ Can express non-  regular languages
✔ Does affect the number of states
Side-constraints
Side-constraints:
                                               Statistical
● Constraints which are not embedded in                           Side-Constraints
the model.                                     Model
● Delimits allowed derivations.

                                      Problems
                                      ✗ Models with constraints can fail
Advantages
✔ Convenient method of expression         ✗ Probability mass disappears

✔ Can express non-  regular languages ✗ Complicates model inference
✔ Does affect the number of states        ✗ ERF & Baum-    Welch derives wrong
                                          distributions
                                          ✗ Decoding must adhere to constraints

                                      ✗ Constraint solving techniques needed

                                          ✗ NP- Complete in general case
Side-constraints
Side-constraints:
                                                    Statistical
● Constraints which are not embedded in                               Side-Constraints
the model.                                          Model
● Delimits allowed derivations.

                                          Problems
                                          ✗ Models with constraints can fail
Advantages
✔ Convenient method of expression              ✗ Probability mass disappears

✔ Can express non-    regular languages ✗ Complicates model inference
✔ Does affect the number of states             ✗ ERF & Baum-    Welch derives wrong
                                               distributions
                                               ✗ Decoding must adhere to constraints

                                          ✗ Constraint solving techniques needed

       Possible solutions                      ✗ NP- Complete in general case
       Parameterlearning:
             ● Training with fgEM / Failure- adjusted maximization
                  ● Requires failure estimates

             ● Apply soft-constraints do not fail
       Inference:
             ● Incremental constraint- solving
             ● Local constraints
Example: Fixing known genes
                                     known
                                     gene
DNA




S          C           C       C         C        C           C           C              E



           N           N                                      N           N

      ● Difficult/expensive to model with model structure
           ● HMM needs to do position counting = > many states required!

      ● Easy to model with side-  constraints
           ● Local constraint: Affects only a limited size sequential set of variables

           ● Decoding possible in linear time complexity
Combining models
  Combine the predictions of several models to form more accurate predictions.


                                                O bvious approaches:
                                                ● Union

                                                     ● Many false positives
          A Genes        B Genes
                                                     ● Conflicts

                                                ● Intersection/majority voting
                                                     ● Lowest common


                                                     denominator
                                                     ● Throws away the most

Gene predictor A             Gene predictor B        interesting predictions
Combining models with constraints
  Combine the predictions of several models to form more accurate predictions.


                                                O bvious approaches
                                                ● Union

                                                     ● Many false positives
          A Genes        B Genes
                                                     ● Conflicts

                                                ● Intersection

                                                     ● Lowest common


                                                     denominator
                                                     ● Throws away the most

Gene predictor A             Gene predictor B        interesting predictions


                                                We need to know the strengths
                                                of individual models to define
                                                better constraints...
Combining models with constraints
I ssues to consider :
     ● Ability to combine both blackbox and whitebox models

     ● The nature of the combination constraints

           ● Uncertainty

     ● Lack of knowledge: what the right constraints..

           ● Induction


 Some possible ways to represent combination constraints being considered :
     ● Hard constraints

           ● Inability to handle uncertainty

     ● Factorial Hidden Markov Models

           ● Probability distribution defines how much to listen to each model

           ● Throws away information: What model contributed what?

           ● Expensive to train

     ● Bayesian networks

           ● Model probablistic constraints

           ● We can model sequences with Dynamic Bayesian Networks

     ● Soft- Constraints
           ● Possibly good complement to probabilistic inference

     ● Co- training
           ● Use the models to train each other
Outlook
● Formulating biosequence problems in terms of constraints
● Integrating these constraints in probablistic models

● Tradeoffs between constraint representations

    ● Finding the right balance...

● Combining models with constraints

● Inference and parameter estimation in mixed models

More Related Content

Similar to ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdfppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
Paul Gardner
 
Gemoda
GemodaGemoda
Gemoda
Kyle Jensen
 
Genes
GenesGenes
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
 
cloning
cloningcloning
cloning
cloningcloning
C:\fakepath\cloning
C:\fakepath\cloningC:\fakepath\cloning
C:\fakepath\cloning
Prasit Chanarat
 
Cloning
CloningCloning
Cloning
minhdaovan
 
Cloning
CloningCloning
RESTRICTION MAPPING
RESTRICTION MAPPINGRESTRICTION MAPPING
RESTRICTION MAPPING
Afra Fathima
 
Sage technology
Sage technologySage technology
Sage technology
Prasanthperceptron
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
MohamedHasan816582
 
Sequence Assembly
Sequence AssemblySequence Assembly
Sequence Assembly
Meghaj Mallick
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
AGRF_Ltd
 
17._mol_tools_i_21.pptx
17._mol_tools_i_21.pptx17._mol_tools_i_21.pptx
17._mol_tools_i_21.pptx
bhaumikpatelhmt288
 
In silico analysis for unknown data
In silico analysis for unknown dataIn silico analysis for unknown data
In silico analysis for unknown data
Santosh Rama Bhadra Tata
 
Final Presentation-Delta
Final Presentation-DeltaFinal Presentation-Delta
Final Presentation-Delta
Anna Blendermann
 
Mol evolution
Mol evolutionMol evolution
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
Genome Reference Consortium
 
137920
137920137920

Similar to ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis (20)

ppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdfppgardner-lecture04-annotation-comparativegenomics.pdf
ppgardner-lecture04-annotation-comparativegenomics.pdf
 
Gemoda
GemodaGemoda
Gemoda
 
Genes
GenesGenes
Genes
 
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
 
cloning
cloningcloning
cloning
 
cloning
cloningcloning
cloning
 
C:\fakepath\cloning
C:\fakepath\cloningC:\fakepath\cloning
C:\fakepath\cloning
 
Cloning
CloningCloning
Cloning
 
Cloning
CloningCloning
Cloning
 
RESTRICTION MAPPING
RESTRICTION MAPPINGRESTRICTION MAPPING
RESTRICTION MAPPING
 
Sage technology
Sage technologySage technology
Sage technology
 
proteome.pptx
proteome.pptxproteome.pptx
proteome.pptx
 
Sequence Assembly
Sequence AssemblySequence Assembly
Sequence Assembly
 
Introducing data analysis: reads to results
Introducing data analysis: reads to resultsIntroducing data analysis: reads to results
Introducing data analysis: reads to results
 
17._mol_tools_i_21.pptx
17._mol_tools_i_21.pptx17._mol_tools_i_21.pptx
17._mol_tools_i_21.pptx
 
In silico analysis for unknown data
In silico analysis for unknown dataIn silico analysis for unknown data
In silico analysis for unknown data
 
Final Presentation-Delta
Final Presentation-DeltaFinal Presentation-Delta
Final Presentation-Delta
 
Mol evolution
Mol evolutionMol evolution
Mol evolution
 
Graph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regionsGraph and assembly strategies for the MHC and ribosomal DNA regions
Graph and assembly strategies for the MHC and ribosomal DNA regions
 
137920
137920137920
137920
 

More from Christian Have

Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Christian Have
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Christian Have
 
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationEfficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Christian Have
 
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Christian Have
 
Nagios præsentation (på dansk)
Nagios præsentation (på dansk)Nagios præsentation (på dansk)
Nagios præsentation (på dansk)
Christian Have
 
Stochastic Definite Clause Grammars
Stochastic Definite Clause GrammarsStochastic Definite Clause Grammars
Stochastic Definite Clause Grammars
Christian Have
 
Inference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISMInference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISM
Christian Have
 

More from Christian Have (7)

Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence AnalysisEfficient Probabilistic Logic Programming for Biological Sequence Analysis
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
 
Efficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program TransformationEfficient Tabling of Structured Data Using Indexing and Program Transformation
Efficient Tabling of Structured Data Using Indexing and Program Transformation
 
Constraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap ResolutionConstraints and Global Optimization for Gene Prediction Overlap Resolution
Constraints and Global Optimization for Gene Prediction Overlap Resolution
 
Nagios præsentation (på dansk)
Nagios præsentation (på dansk)Nagios præsentation (på dansk)
Nagios præsentation (på dansk)
 
Stochastic Definite Clause Grammars
Stochastic Definite Clause GrammarsStochastic Definite Clause Grammars
Stochastic Definite Clause Grammars
 
Inference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISMInference with Constrained Hidden Markov Models in PRISM
Inference with Constrained Hidden Markov Models in PRISM
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 

ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Constraints for Biological Sequence Analysis

  • 1. Logic-Statistic Models with Constraints for Biological Sequence Analysis Christian Theil Have, <cth@ruc.dk> Programming, Logic and Intelligent Systems  plis.ruc.dk  CBIT  Roskilde University  Denmark
  • 2. Motivation and outline ● Short motivation and introduction to biological sequence analysis ● Different ways of integrating constraints with probabilistic models ● Combining models with constraints
  • 3. Biological sequence analysis The basic problems: Alignment of biological sequences Phylogeny Gene prediction ● RNA secondary structure prediction ● Protein structure prediction ● Protein function prediction
  • 4. Biological sequence analysis The basic problems: Alignment of biological sequences Phylogeny ➔ Gene prediction ● RNA secondary structure prediction ● Protein structure prediction ● Protein function prediction We focus on gene prediction for now...
  • 5. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C AATATAGGCATAGCGCACAGACAGATAAAAATTACA GAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGT GCGGGCTGAAATATAGGCATAGCGCACAGACAGATA
  • 6. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 7. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 8. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 9. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA
  • 10. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA ● There are three possible genes in this sample in this frame )on this strand(.
  • 11. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA ● There are three possible genes in this sample in this frame )on this strand(.
  • 12. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA ● There are three possible genes in this sample in this frame )on this strand(.
  • 13. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA ● There are three possible genes in this sample in this frame )on this strand(.
  • 14. Biological sequence analysis Gene prediction: Predict genes and non-genes in a DNA sequence ● DNA is composed of nucletides: A, T, G, C ● Genes are sequences of triplets of nucleotides, called codons ● Genes can occur in both strands in three different frames ● Specific start codons signals a possible beginning of a gene ● Specific stop codons definitively signals the end of a gene AAT ATA GGC ATA GCG CAC AGA CAG ATA AAA ATT ACA GAG TAC ACA ACA TCC ATG AAA CGC ATT AGC ACC ACC ATT ACC ACC ACC ATC ACC ATT ACC ACA GGT AAC GGT GCG GGC TGA AAT ATA GGC ATA GCG CAC AGA CAG ATA ● There are three possible genes in this sample in this frame )on this strand(. ● In general, DNA sequences have an exponential amount of different gene compositions.
  • 15. Biological sequence analysis, tools of the trade ● Statistical models )in order of expression power( ● Hidden Markov Models ● Probabilistic Context Free Grammars ● Probabilistic Context Sensitive Grammars ● Stochastic Definite Clause Grammars ● All these can be modeled in PRISM ● Probabilistic extension of Prolog ● Problems: ● Computational complexity of inference ● Extremely large sequences ● Use of more expressive models infeasible ● Essential: Enforce right independence assumptions ● limit amount of conditional probabilities
  • 16. Gene-finding with Hidden Markov Models Hidden Markov Models )HMMs( commonly used for gene prediction A Hidden Markov Model is a quadruple < S,A,T,E> S is a set of states A is a set of emission symbols T is a set of transition probabilities E is a set of emission probabilities An observation is a sequence of emissions Transition and emission probabilities can be derived from sample observations though parameter estimation Decoding finds the most probable sequence of states corresponding to an observation
  • 17. Genefinding with Hidden Markov Models Example: Toy HMM for gene-finding.
  • 18. Decoding: The Viterbi algorithm Finding the most probable path for a given sequence: argmax P(state sequence | observation) Method: Incrementally keep track of the most probable path to a given state Dynamic programming )tabling in Prolog/PRISM( Time steps )observation( States Time complexity O(|states| * |observation|)
  • 19. Predicting is decoding Decoding of an HMM may be considered as an optimization problem: ● We have a set of variables T0 .. Tn, one for each time step A set of constraints, C, on these variables: A state S is in the domain of Ti iff there is a state in the domain of Ti-1 from which there is a transition to S and the state has an emission corresponding to the emission in the observation ● Goal: Optimize P(state sequence| observation), subject to C T0 T1 T2 T3 Tn States Time steps )observation( ➔ Accomplished with Viterbialgorithm in O)| states| *| observation| ) using DP
  • 20. Constraints as model structure ● The structure of the HMM consists of ● states ● allowed transitions between these states ● possible emissions from these states ● The structure of the HMM defines a regular language ● Can model )only( regular languages, but.. ● Not all regular languages can be modeled equally compact ● Some regular languages requires an exponential amount of states Consider a fully-connected automaton with only N states: All-different: No state visited more than once
  • 21. Side-constraints Side-constraints: Statistical ● Constraints which are not embedded in Side-Constraints the model. Model ● Delimits allowed derivations.
  • 22. Side-constraints Side-constraints: Statistical ● Constraints which are not embedded in Side-Constraints the model. Model ● Delimits allowed derivations. Advantages ✔ Convenient method of expression ✔ Can express non- regular languages ✔ Does affect the number of states
  • 23. Side-constraints Side-constraints: Statistical ● Constraints which are not embedded in Side-Constraints the model. Model ● Delimits allowed derivations. Problems ✗ Models with constraints can fail Advantages ✔ Convenient method of expression ✗ Probability mass disappears ✔ Can express non- regular languages ✗ Complicates model inference ✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed ✗ NP- Complete in general case
  • 24. Side-constraints Side-constraints: Statistical ● Constraints which are not embedded in Side-Constraints the model. Model ● Delimits allowed derivations. Problems ✗ Models with constraints can fail Advantages ✔ Convenient method of expression ✗ Probability mass disappears ✔ Can express non- regular languages ✗ Complicates model inference ✔ Does affect the number of states ✗ ERF & Baum- Welch derives wrong distributions ✗ Decoding must adhere to constraints ✗ Constraint solving techniques needed Possible solutions ✗ NP- Complete in general case Parameterlearning: ● Training with fgEM / Failure- adjusted maximization ● Requires failure estimates ● Apply soft-constraints do not fail Inference: ● Incremental constraint- solving ● Local constraints
  • 25. Example: Fixing known genes known gene DNA S C C C C C C C E N N N N ● Difficult/expensive to model with model structure ● HMM needs to do position counting = > many states required! ● Easy to model with side- constraints ● Local constraint: Affects only a limited size sequential set of variables ● Decoding possible in linear time complexity
  • 26. Combining models Combine the predictions of several models to form more accurate predictions. O bvious approaches: ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection/majority voting ● Lowest common denominator ● Throws away the most Gene predictor A Gene predictor B interesting predictions
  • 27. Combining models with constraints Combine the predictions of several models to form more accurate predictions. O bvious approaches ● Union ● Many false positives A Genes B Genes ● Conflicts ● Intersection ● Lowest common denominator ● Throws away the most Gene predictor A Gene predictor B interesting predictions We need to know the strengths of individual models to define better constraints...
  • 28. Combining models with constraints I ssues to consider : ● Ability to combine both blackbox and whitebox models ● The nature of the combination constraints ● Uncertainty ● Lack of knowledge: what the right constraints.. ● Induction Some possible ways to represent combination constraints being considered : ● Hard constraints ● Inability to handle uncertainty ● Factorial Hidden Markov Models ● Probability distribution defines how much to listen to each model ● Throws away information: What model contributed what? ● Expensive to train ● Bayesian networks ● Model probablistic constraints ● We can model sequences with Dynamic Bayesian Networks ● Soft- Constraints ● Possibly good complement to probabilistic inference ● Co- training ● Use the models to train each other
  • 29. Outlook ● Formulating biosequence problems in terms of constraints ● Integrating these constraints in probablistic models ● Tradeoffs between constraint representations ● Finding the right balance... ● Combining models with constraints ● Inference and parameter estimation in mixed models