• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Detection of genetic motifs
 

Detection of genetic motifs

on

  • 520 views

Detection of Genetic Motifs. Oral Biology

Detection of Genetic Motifs. Oral Biology

Statistics

Views

Total Views
520
Views on SlideShare
520
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Detection of genetic motifs Detection of genetic motifs Presentation Transcript

    • Detection of Genetic Motif. Promoters – Biology Information theory Random Projections Composed motif detection
    • Motifs and promoters
    • DNA sequence gene junk DNA gene UTR-5 UTR-3 e1 e2 e3 e4 e5 exon intron INR = Initiator Region DSE = DownStream region TSS = Transcription Start SitePromoter module Promoter module TSS TATA box INR INR DSE DSE TFBS TFBSDistal Promoter Proximal promoter Core promoter
    • TFBS-Transcription Factor Binding Site Short strings (12 to 20 nucleotides long) protein that spreaded over up to 5kb is going to bind before TSS The string structure select the protein that will bind on the basis of Van der Waals interactions ACCGATTATCA Van der Waals interactions example of a Transcription Factor - Binding Sites TFBS
    • Assembly of the promoter protein complex of transcription 1st stage Transcription factors TF TFIID TBP TSS TATA box DNA Transcription factor Binding Sites TFBS INR
    • Assembly of the promoter protein complex of transcription 2nd stage TFIID TBP TATA TSS INR core promoter DNA
    • Assembly of the promoter protein complex of transcription DNA looping /Distal promoter enhancer TF4 TF3 TF5 Proximal TF2 promoter TBP TATA TFIIE TFIIA TFIIH TFIIB TFIID TFIIB TSS TF1 INR RNA Poly II core promoter DNA
    • Information based motif detection
    • The set of all TFBS (for a certain class of genes, organism or other) Unknown Known known unknownTFBSs with the same colour are correlated
    • Example Protein of the Promoter complex Protein of the Promoter complexA T G C T C A T C C T G
    • Entropy Given a probability distribution, we want a function representing the quantity of information stored in the distribution. We define the entropy (H) as: H = −∑ p (i ) log( p (i )) i or H = − ∫ p ( x) log( p ( x))dx For the sake of simplicity, we will use from now on the discrete definition.
    • Observed entropy The real distribution is usually unknown, but we can replace it by the observed distribution f(x). The resulting entropy is: H ( x) = −∑ f ( x) log( f ( x)) x For a multi dimensional probability distribution it is: H ( x, y ) = −∑ f ( x, y ) log( f ( x, y )) x, y = ∑ f ( x)∑ f ( y | x) log( f ( x, y )) x, y y
    • Mutual Information f ( x, y )I ( x. y ) = −∑ f ( x, y ) log( ) x, y f ( x) f ( y )= −∑ f ( x, y )[log( f ( x, y )) − log( f ( x)) − log( f ( y ))] x, y= H ( x, y ) − H ( x ) − H ( y ) X and Y are strings of equal length, S={A, C, G, T}, x and y belong to S f(x,y) is the relative joint frequency of x,y in X and Y f(x) is the relative frequency of x in X f(y) is the relative frequency of y in Y
    • Information divergence Given two distributions P and Q p( x) D( P, Q) = ∑ p ( x) log( Not for exam x q( x) ) = ∑ p ( x) log( p ( x)) − ∑ p ( x) log(q ( x)) x x
    • Example of calculationX A C A T T T A CC A T A G A C A A C T AY A C T T T T A CG A T G G A A A C C T G f(x,y) 6 4 4 6 f(x,y) A C G T 9 A 5 1 2 1 f(x) 5 C 1 3 1 0 1 G 0 0 1 0 5 T 0 0 0 5 Divide by 20 to obtain relative frequencies
    • Algorithm for finding new TFBS1) select a true TFBS (for example ACATTTACCATAGACAACT) (from a data bank as IUPAC or TRANSFAC) as a probe;2) shift the probe over a non-coding zone;3) evaluate step-by-step mutual information I(P,S), where P is the probe and S is the current adjacent string on the sequence;4) select the positions (and the corresponding adjacent strings) for which I(P,S)> threshold5) the strings starting from these positions are candidate TFBS,which need to be validated in vitro.
    • Examplethe same string CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC 1 error CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG 2 errors CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT 5 errors CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA C<--> G GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCAT TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA C <-- G GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGAT CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT some C<-> G CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGAT ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG complementarGTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA compl+1errorGTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG compl+2errorsGTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG compl+5errorsGTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTG GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG 1 letter moreCACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG 2 letters moreCACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC3 letters moreCACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT probe CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
    • Detected values for I(P,S) 4 C become G and 5 G become Cthe same string C and G exchanged complementary 1 1 error complementary+1error 2 errors complementary+2errors C becomes G complementary+5errors 08 . 5 errors 1 letter more 06 . 2 letters more 3 letters more 04 . 02 .
    • Conclusions: Use Mutual information as a tool to capture strings that are correlated to a true TFBS used as a probe. validate in vitro the candidates so obtained This is more flexible than the use of Hamming or Levenshtein distance, since correlated strings could be very distant one anotherDrawbacks:1. the method need a precise calibration of the threshold2. Does not include gaps
    • Random Projection Approach to Motif Finding
    • daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3-150 -1
    • The (l,d) Planted Motif Problem Generate a random length l consensus sequence C. Generate 20 instances, each differing from C by d random mutations. Plant one at a random position in each of N=20 random sequences of length n=600. Can you find the planted instances?
    • Planted MotifsAGTTATCGCGGCACAGGCTCCTTCTTTATAGCCATGATAGCATCAACCTAACCCTAGATATGGGATTTTTGGGATATATCGCCCCTACACTGGATGACTGGATATACATGAACACGGTGGGAAAACCCTGAC Each instance differs from ACAGGATCA by 2 mutations Remaining sequence random
    • Random Projection Algorithm Buhler and Tompa (2001) Guiding principle: Some instances of a motif agree on a subset of positions. Use information from multiple motif instances to construct model.x(1) ...ccATCCGACca...x(2) ...ttATGAGGCtc... ATGCGTC =Mx(5) ...ctATAAGTCgc...x(8) ...tcATGTGACac... (7,2) motif
    • k-Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. In l-dimensional Hamming space, projection onto k dimensional subspace. l = 15 k=7 P ATGGCATTCAGATTC TGCTGAT P = (2, 4, 5, 7, 11, 12, 13)
    • Random Projection Algorithm Choose a projection by Input sequence x(i): selecting k positions …TCAATGCACCTAT... uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from bucket containing TGCACCT multiple l-tuples. Bucket TGCT
    • Example  l = 7 (motif size) , k = 4 (projection size)  Choose projection (1,2,5,7)Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC GCCTTAC ATGC GCTC
    • Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain at least s l-tuples, for some parameter s. ATGC GCTC CATC ATTC
    • Frequency Matrix Model From Bucket A1 0 .25 .5 0 .5 0ATCCGAC   C 0 0 .25 .25 0 0 1 G 0 0ATGAGGCATAAGTC 0 .5 0 1 .25  ATGTGAC T 0  1 0 .25 0 .25 0  ATGC Frequency matrix W EM algorithm Refined matrix W*
    • Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are known from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler ATCCGAC Local refinement algorithm ATGAGGC ATGCGTC ATAAGTC Candidate motif ATGTGAC ATGC
    • Expectation Maximization (EM) S = { x(1), …, x(N)} : set of input sequences Given:  W = An initial probabilistic motif model  P0 = background probability distribution. Find value Wmax that maximizes likelihood ratio: Pr( S | Wmax ) Pr( S | P0 ) EM is local optimization scheme. Requires starting value W
    • EM Motif Refinement For each bucket h containing more than s sequences, form weight matrix Wh Use EM algorithm with starting point Wh to obtain refined weight matrix model Wh* For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh* )/ Pr(y(i) | P0). T = {y(1), y(2), …, y(N)} C(T ) = consensus string
    • What Is the Best Motif? Compute score S for each motif:  Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} P( y (i ) | W ) Score = ∑ log i P( y (i ) | P0 ) Return motif with maximal score
    • Iterations Single iteration.  Choose a random k-projection.  Hash each l-mer x in input sequence into bucket labelled by h(x).  From each bucket B with at least s sequences, form weight matrix model, and perform EM/Gibbs sampler refinement.  Candidate motif is the best one found from refinement of all enriched buckets. Multiple iterations.  Repeat process for multiple projections.
    • Parameter Selection Projection size k Choose k small so several motif instances hash to same bucket. (k < l - d) Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k Bucket threshold s: (s = 3, s = 4)
    • How Many Iterations? Planted bucket : bucket with hash value h(M), where M is motif. Choose m = number of iterations, such that Pr(planted bucket contains ≥ s sequences in at least one of m iterations) ≥ 0.95. Probability is readily computable since iterations form a sequence of independent trials.
    • Composite motifs detection Question Monad detection Mitra
    • monad patterns Short contiguous strings Appear surprisingly many times( in a statistically significant way) S= AGTCTTGCTAGTCCGTAATATCCGGATAGAATAATGATC AGTC AGTC GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC AAGATGTACTAGAGTCACGTAGCTAGTCATCTATACGAG AGTC AGTC TCGATGTAGTAGCTATCGATCGTAGCTAGAGTCCGTAGC TC AGTC AGCTAGTATCGTAGTGAGCAACATGAGTCCAGTGCATA AGTC GTCAGCTCATGAGTCGCATAGTC GTC AGTC P = AGTC
    • Introduction However, many of the actual regulatory signals are composite patterns.  Groups of monad patterns  Occur relatively near each other An example of a composite pattern is a dyad signal.
    • Composite Pattern S=ACGTAAATCACGTTGACTAGCTAGCACGAG CTAGCATAATCACACTTTGACGAGTCGACTGC ATGCATTGACGCAGTGCATTGCTAGCATGGG TAATCAAACGTTGGCTAGCTAGCATGCATCTG AGCATGCTAGCTACGTACTAGCGCGATAGTC TACTACAAATCACCCATTGCGAGCTACGTAG CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA GAATCCGATCTTGCGATCGAT CP = AATCxxxxTTG
    • Introduction A possible approach is to find each part of the pattern separately and reconstruct the composite pattern. However, they often fail to output composite regulatory patterns consisting of weak monad parts.
    • Introduction A better approach would be to detect both parts of a composite pattern at the same time. Two steps in the proposed algorithm:  Preprocessing the sample creates a set of ‘virtual’ monads.  Apply an exhaustive monad discovery algorithm to the ’virtual’ monad problem. By preprocessing, original problem can be transformed into a larger monad discovery problem.
    • Monad Pattern Discovery Canonical pattern lmer 3mer: A C A  A continuous string of length l (l, d)-neighbourhood of an lmer P  all possible lmers with up to d mismatches as compared to P  The number of such lmers is : d l  i ∑  i 3   i =0   (l,d)-k patterns  Given a sequence S, find all lmers that occur with up to d mismatches at least k times in the sample  A variant : the sample is split into several sequence, to find all lmers, d mismatches, in at least k sequences
    • Pattern Driven Approach(PDA) (Prvzner, 2000)  Examine all 4 l patterns of fixed length l in lexical order, compares each pattern to every lmer in the sample, and return all (l, d)-k pattern (Waterman et al., 1984 and Galas et al.,1985)  Bypass excessive time requirement  Most of all 4 l examines not worth since neither these patterns nor their neighbours appear in the sample  SDA was therefore designed only explores the lmer appearing in the sample and their neighbours.
    • Sample Driven Approach(SDA) First initializes a table of size 4l  Each table entry corresponds to a pattern SDA generate the (l, d)-neighbourhood of lmer  Incremented by a certain amount  After all lmers processed, SDA return all pattern whose table entries have scores exceed the threshold AAAAA 3 4l AAAAC 1 AAACC 2 … ..
    • Sample Driven Approach(SDA) Faster but requires a large 4l table still  not practical for long pattern in mid 1980  Not mainstream and no tool  (Today gigabytes of RAM memory available thus l increased without a memory-efficient algorithm)
    • SDA Iterations First, explore all neighbour of the first lmer from the sample. Second, explore all neighbour of the second lmer If an lmer P belongs to the neighbour of the lmers appearing at positions i1 ,…ik in the sample  info about P collected at iteration i1 ,…ik . So the Waterman approach update info about P k times  memory slot for P is occupied during the course time even if P is not “interesting” lmer Most of lmers explored are not interesting—waste memory slot
    • To improve SDA Better solution:  Collect info about all P at the same time  to remove the need to keep the info in memory  but require a new approach to navigate the space of all lmers MITRA runs faster than PDA and SDA, and uses only a fraction of the memory of the SDA
    • Pattern-finding vs. profile-based Profile-based is more biologically relevant for finding motifs in biological samples?  Probably the reason Waterman algorithm not popular in the last decade Sagot and colleagues were the first to rebut this opinion  Develop an efficient version of Waterman’s
    • Pattern-based vs. profile-based Similarities  Pattern-based generate the profile  Every profile of length l corresponds to a pattern of length l formed by the most frequent nucleotides in every position.  Pattern-driven at least as good as profile-based Even better on simulated samples with implanted patterns  Though profile-implantation model is somehow limited  Today little evidence profile-based perform any better on either biological or simulated samples
    • MitraMismatch Tree Algorithm
    • Mismatch Tree Algorithm (MITRA) MITRA uses a mismatch tree data structure to split the space of all possible patterns into disjoint subspaces that start with a given prefix.  For reducing the pattern discovery into smaller sub-problems.  MITRA also takes advantage of pair-wise similarity between instances.
    • Splitting Pattern Space A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. A subspace is called weak if all patterns in this subspace are weak.
    • Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.Sequence = AGTATCAGTTP= GTC Not weakl = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }d =1 ; k =2( l ,d )-neighbours in the sample = { GTA, ATC, GTT }
    • Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.Sequence = AGTATCAGTTP= CAG weakl = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }d =1 ; k =2( l ,d )-neighbours in the sample = { CAG }
    • Splitting pattern space  A subspace is called weak if all patterns in this subspace are weak.Sequence = AGTATCAGTT•subspaceA = { AAA, AAT, AAC, AAG ………..AGG }•subspaceT = { TAA, TAT, TAC, TAG ………..TGG }•subspaceC = { CAA, CAT, CAC, CAG ………..CGG }•subspaceGG = { GGA, GGT, GGC, GGG}
    • Question Input:  S, l, d, k Output:  All l mers that occur with up to d mismatches at least k times in the sample.
    • Solution Naïve :  Test all l mer in the space  If occur with up to d mismatches at least k times in the sample than output this l mer.space = { AAA, AAT, AAC, AAG ………..AGG TAA, TAT, TAC, TAG ………..TGG CAA, CAT, CAC, CAG ………..CGG GAA, GAT, GAC, GAG ………..GGG }sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
    • Splitting pattern space if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.  Subspace of all l mers starting with A,  Subspace of all l mers starting with T,  Subspace of all l mers starting with C,  Subspace of all l mers starting with G,
    • Splitting pattern space if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.Space: A* T* C* G* SubspaceA
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: A* Can’t rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: AA* AC* AT* AG* Can rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: Can’t rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.  If we can rule out this subspace contains such a pattern  we stop searching in this subspace;  release the memory slot;  If we can’t rule out this subspace contains such a pattern  we split this subspace again on the next symbol;  and repeat;
    • Mismatch tree data structure A mismatch tree is a rooted tree where each internal node has 4 branches labeled with a symbol in {A,C,T,G} The maximum depth of the tree is l. Each node in the mismatch tree corresponds to the subspace of patterns P with a fixed prefix. Each node contains pointers to all l mers instances from the sample that are within d mismatches from a pattern p.
    • Mismatch tree data structure MITRA start with examining the root node of the mismatch tree that corresponds to the space of all patterns.  When examining a node, MITRA tries to prove that it corresponds to a weak subspace.  If (we can’t prove it)  we expand the node’s children and examine each of them.  Whenever we reach a node corresponding to a weak subspace, we backtrack. The intuition is that many of the nodes correspond to weak subspaces and can be rule out. This allows us to avoid searching much of the pattern space.
    • Mismatch tree data structure If we reach depth l and the number of instances is not less than k.  the l mer corresponding to the path from the root to the leaf .  the pointers from this node correspond to the instances of this pattern.
    • Example Consider a very simple example of finding the pattern of length 4 with up to 1 mismatch and at least 2 times in the sample S = Not for exam “AGTATCAGTT”. The substrings (4mers) in S are { AGTA, GTAT, TATC, ATCA, TCAG, CAGT, AGTT }
    • 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T1 2 1 1 2 1 1 AA G T A T C A k=5 1 1 2 2 1G T A T C A G A T A C AT A T C A G T A T G A T A G 2 2 2 2 2A T C A G T T CA T G C 2 2 T 2 T 1 C2 G T k A k=1=3 2 2 A 1 A 2 T AA CC AA T T 2 G A T A G A T AG CA AT A G T T C G T k=0 G A T TT T GC A k = 1C A A A T T G T T 1 T2 C2 GC T A G CA T A T T 2 2 22 2 1 A C AA T2 T1 2A T A G A G A T A A T AG A G T Tk = 0 T k =A1 G A G k=1 k A 1C T = G GT T T T T T T T TA C T A C T A C T
    • 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T T 0 2 2 1 2 2 0 G k=3 A G T A T C A 0 2 0 G T A T C A G A A A T A T C A G T T G T G A T C A G T T T C T k=2 A A T A T 0 1Output: AGTA CA G 1 1 A 1 0 1 1 A A AGTC G G A A A A G G k = 2k = 2 T T T G G AGTG k=2 G G k = 2T A T T T AGTT T T A T A T A T
    • Overall complexity 0 0 0 0 0 0 0  Space = A G T A T C A  Time = O(l2 × |S|) A G T A T C A G l T A T C A G T O(4l × |S|) A T C A G T T G O(|S|) O(|S|) T O(l)l 0 1 1 0 1 1 0 . . . A G T A T C A . . . G T A T C A G . . . T A T C A G T T A T C A G T T  Number of nodes = O(4l) – Number of comparisons in each node = O(|S|)
    • Take a Closer Look In mismatch tree algorithm, we can not start ruling out a node until traverse to depth . d +1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 A G T A T C A A G T A T C A G T A T C A G T A T C A G T A G T A T C A G T A T C A G T A T C A G T T k =7 A T C A G T T 1 2 1 1 2 1 1 A A G T A T C A G T A T C A G k=5 T A T C A G T A T C A G T T
    • MITRA Graph Information about pairwise similarities between instances of the pattern can significantly the sample-driven approach. speed up The graph that is constructed to model this pairwise similarity is called MITRA-Graph
    • MITRA Graph Given a pattern P and sample S we can construct a graph G(P, S) where each vertex is an lmer in the sample and there is an edge connecting two lmers if P is within d mismatches from both lmers. S = TAACA P = TAC AAC (d=1) TAA ACA (d=1) (d=3)
    • MITRA Graph For an (l,d) – k pattern P the corresponding graph contains a clique of size k. S = TAACA P = AAA AAC (d=1) TAA ACA (d=1) (d=3)
    • MITRA Graph Given a set of patterns P and a sample S, define a graph G(P , S) whose edge set is a union of edge sets of graphs G(P, S) for P∈P . Each vertex of G(P , S) is an lmer in the sample and there is an edge connecting two lmers if there is a pattern P∈P that is within d mismatches from both lmers. If for a subspace of patterns we can rule out an existence of a clique of size k, then the subspace has no (l,d)-k
    • The WINNOWER Algorithm The WINNOWER algorithm by Pevzner and Sze (2000) constructs the following graph: Each lmer in the sample is a vertex, and an edge connects two vertices if the corresponding lmers have less than d mismatches. Instances of a (l,d)-k pattern form a clique of size k in this graph.
    • The WINNOWER Algorithm (con’t) Since clique are difficult to find, WINNOWER takes the approach of trying to remove edges that do not corresponding to a clique. k=4
    • Improvements by MITRA-Graph1. Construct a graph at each node in the mismatch tree. 0 1 1 0 1 1 0 A G T A T C A A G T A T C A G T A T C A G T A T C A G T T
    • Improvements by MITRA-Graph2. Remove edges which are not part of a clique. A
    • Improvements by MITRA-Graph3. If no potential clique remains, rule out the subspace corresponding to the node and backtrack. A A
    • Improvements by MITRA-Graph4. If we cannot rule out a clique, split the subspace of patterns and examine the child nodes A
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER At each node of the tree, we remove edges by computing the degree of each vertex. If the degree of the vertex is less than k-1, we can remove all edges incident to it since we know it is not part of a clique. We repeat this procedure until we cannot remove any more edges. If the number of edges remaining is less than the minimum number of edges in a clique, we can rule out the existence of a clique and backtrack.
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER The problem with this approach is how to efficiently construct the graph at each node in the mismatch tree. Instead of constructing the graph from scratch, we construct it based on the graph at the parent node  an edge connecting two l mers  the first l mer matches the prefix of the pattern subspace with d1 mismatches  the second l mer matches with d2 mismatches
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER  the number of mismatches between the tail of the first and the second l mers as m.  The edge between these two l mers exists in the pattern subspace if and only if d1 <= d, d2 <= d and d1+d2+m <= 2d.The prefix of thepattern subspacethe first lmerthe second lmer
    • MISMATCH TREE ALGORITHM —Improvements over WINNOWER (cont’d) In the root node since d1 = d2 = 0, an edge exists only if m <= 2d which is the equivalent graph to WINNOWER. With moving down the tree, the condition becomes much stronger than the WINNOWER. We can compute the edges of a node based on the edges of the node’s parents by keeping track of the quantities d1, d2, and m for each edge.
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER To summarize, the MITRA-Graph algorithm works as follows  We first compute the set of edges at the root node by performing pairwise comparisons between all l mers due to d1 = d2 = 0.  We traverse the tree in a depth first order, passing on the valid edges and keeping track of the quantities d1, d2, and m for each of them.  At each node, we prune the graph by eliminating any edges incident to vertices that have degrees of less than k-1.  If there are less than the minimum number of edges for a clique, we backtrack.  If we reach a leaf of the tree (depth l), then we output the corresponding pattern.
    • Discovering dyad signals
    • DISCOVERING DYNAD SIGNALS For dyad signals, we are interested in discovering two monads that occur a certain length apart  We use the notation (l1-(s1,s2)-l2,d)-k pattern to denote a dyad signal l1 s l2 l1 s l2 l1 s l2
    • DISCOVERING DYNAD SIGNALS The MITRA-Dyad algorithm casts the dyad discovery problem into a monad discovery problem by preprocessing the input and creating a “virtual” sample to solve the (l1+l2,d)-k monad pattern discovery problem in this sample  For each l1mer in the sample and for each s in [s1,s2], we create an l1+l2 mer which is the l1mer concatenated with the l2 mer upstream s nucleotides of the l1mer.
    • DISCOVERING DYNAD SIGNALS  The number of elements in the “virtual” sample will be approximately (s1-s2+1) times larger.  An (l1+l2,d)-k pattern in the “virtual” sample will correspond to a (l1-(s1,s2)-l2,d)-k pattern in the original sample, and we can easily map the solution from the monad problem to the dyad one. An important feature of MITRA-Dyad is an ability to search for long patterns.
    • DISCOVERING DYNAD SIGNALS If the range s1-s2+1 of acceptable distances between monad parts in a composite pattern is large, the MITRA-Dyad algorithm becomes inefficient  A simple approach to detect these patterns is to generate a long ranked list of candidate monad patterns using MITRA.  Then check each occurrence of each pair from the list to see if they occur within the acceptable distance.