• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Transcription
 

Transcription

on

  • 442 views

RNA synthesis

RNA synthesis

Statistics

Views

Total Views
442
Views on SlideShare
442
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Transcription Transcription Presentation Transcript

    • Detection of Genetic Motifs Bioinfo talk 11 Promoters – Biology Information theory Random Projections Composed motif detection
    • Motifs and promoters
    • DNA sequence gene junk DNA gene UTR-5 UTR-3 e1 e2 e3 e4 e5 exon intron INR = Initiator Region DSE = DownStream region TSS = Transcription Start SitePromoter module Promoter module TSS TATA box INR INR DSE DSE TFBS TFBSDistal Promoter Proximal promoter Core promoter
    • TFBS-Transcription Factor Binding Site Short strings (12 to 20 nucleotides long) protein that spreaded over up to 5kb is going to bind before TSS The string structure select the protein that will bind on the basis of Van der Waals interactions ACCGATTATCA Van der Waals interactions example of a Transcription Factor - Binding Sites TFBS
    • Assembly of the promoter protein complex of transcription 1st stage Transcription factors TF TFIID TBP TSS TATA box DNA Transcription factor Binding Sites TFBS INR
    • Assembly of the promoter protein complex of transcription 2nd stage TFIID TBP TATA TSS INR core promoter DNA
    • Assembly of the promoter protein complex of transcription DNA looping /Distal promoter enhancer TF4 TF3 TF5 Proximal TF2 promoter TBP TATA TFIIE TFIIA TFIIH TFIIB TFIID TFIIB TSS TF1 INR RNA Poly II core promoter DNA
    • Information based motif detection
    • The set of all TFBS (for a certain class of genes, organism or other) Unknown Known known unknownTFBSs with the same colour are correlated
    • Example Protein of the Promoter complex Protein of the Promoter complexA T G C T C A T C C T G
    • Entropy Given a probability distribution, we want a function representing the quantity of information stored in the distribution. We define the entropy (H) as: H = −∑ p (i ) log( p (i )) i or H = − ∫ p ( x) log( p ( x))dx For the sake of simplicity, we will use from now on the discrete definition.
    • Observed entropy The real distribution is usually unknown, but we can replace it by the observed distribution f(x). The resulting entropy is: H ( x) = −∑ f ( x) log( f ( x)) x For a multi dimensional probability distribution it is: H ( x, y ) = −∑ f ( x, y ) log( f ( x, y )) x, y = ∑ f ( x)∑ f ( y | x) log( f ( x, y )) x, y y
    • Mutual Information f ( x, y )I ( x. y ) = −∑ f ( x, y ) log( ) x, y f ( x) f ( y )= −∑ f ( x, y )[log( f ( x, y )) − log( f ( x)) − log( f ( y ))] x, y= H ( x, y ) − H ( x ) − H ( y ) X and Y are strings of equal length, S={A, C, G, T}, x and y belong to S f(x,y) is the relative joint frequency of x,y in X and Y f(x) is the relative frequency of x in X f(y) is the relative frequency of y in Y
    • Information divergence Given two distributions P and Q p( x) D( P, Q) = ∑ p ( x) log( Not for exam x q( x) ) = ∑ p ( x) log( p ( x)) − ∑ p ( x) log(q ( x)) x x
    • Example of calculationX A C A T T T A CC A T A G A C A A C T AY A C T T T T A CG A T G G A A A C C T G f(x,y) 6 4 4 6 f(x,y) A C G T 9 A 5 1 2 1 f(x) 5 C 1 3 1 0 1 G 0 0 1 0 5 T 0 0 0 5 Divide by 20 to obtain relative frequencies
    • Algorithm for finding new TFBS1) select a true TFBS (for example ACATTTACCATAGACAACT) (from a data bank as IUPAC or TRANSFAC) as a probe;2) shift the probe over a non-coding zone;3) evaluate step-by-step mutual information I(P,S), where P is the probe and S is the current adjacent string on the sequence;4) select the positions (and the corresponding adjacent strings) for which I(P,S)> threshold5) the strings starting from these positions are candidate TFBS,which need to be validated in vitro.
    • Examplethe same string CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TTCGGAACCGGCCTTAAGACGGTGAAGGCGCTACTCATTTAATTGTGTTC 1 error CACTGTGCGTCTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT TACTATATAATTGATCGTGTTTTGGCCCGCTACTCATGAAGAGCCGTTCG 2 errors CACTGTGCGTCTGTCATTCGTCATCCACCGTTGTTAGCACAGGGGTCGAT TAAGGGTATCCAAGTCTGAATACCCCCTGTATTACACTCTCGCTGTCAGT 5 errors CACTGTGCGTCTGTCATTCGTCATCCACCATTGTTAGCATAGGGGTCGAC CATTATCGAGGACAGTGATTTGTGGAATGCTTGGCCTTAATACGTCTCTA C<--> G GAGTCTCGCAGTCTGATTGATGATGGAGGCTTCTTACGAGACCCCTGCAT TCAAAGTCAATTTACAGATTGGCGCCTCATGTAATAACGTTGGCATACTA C <-- G GAGTGTGGGAGTGTGATTGATGATGGAGGGTTGTTAGGAGAGGGGTGGAT CTTAAGATAACGGACACTTGATTGAGATACGCTCGACGCTATGTCCGGCT some C<-> G CAGTGTCCGACTGTCATTGATCATCCACGCTTGTTACCAGACGCGTCGAT ACTCGACATAAGGTTACAGCATGTGGAGTAATGCGGTCGCTAACTACGGG complementarGTGACACGCTGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGTGGCGAGCTTAATCCCTGCTGCTCTGAGCAAGGAGGGCGTGTAGAAA compl+1errorGTGACACGCGGACAGTAAGTAGTAGGTGGCAACAATCGTGTCCCCAGCTA CAAGGTGACAGAGTATTGAGTGAATCTACAATGTTCGCAGTGCTTTGTCG compl+2errorsGTGACACGCTGACAGTAAGAAGTAGGTGGCAACAATCGTGTCCCCAGCTA GCGGTCGCCAATCGTCAAGGAAATGATAGGTCTGATTGGCGTGGCTTAAG compl+5errorsGTGACACGCTGACAGTAAGAAGTAGGTGGAAACAATCGTCTCCCCAGCTG GGCGCTAACGAATACTTCAAGGCCCGAAGGATTGGTGTTGATACTAGCCG 1 letter moreCACTGTGCGACTGTCATTCATCATCACACCGTTGTTAGCACAGGGGTCGAT CGTGACCAGATGTCCTTACTCTGAATGTTATGGTATTAAGTGAGGTAGTG 2 letters moreCACTGTGCGACTGTCATTCATCATCCACACCGTTGTTAGCACAGGGGTCGAT GCCCATGAACATACATTCATGACTGTTCAAGCGCACTGGACCACTCGTTC3 letters moreCACTGTGCGACTGTCATTCATCATCCATCACCGTTGTTAGCACAGGGGTCGAT probe CACTGTGCGACTGTCATTCATCATCCACCGTTGTTAGCACAGGGGTCGAT
    • Detected values for I(P,S) 4 C become G and 5 G become Cthe same string C and G exchanged complementary 1 1 error complementary+1error 2 errors complementary+2errors C becomes G complementary+5errors 08 . 5 errors 1 letter more 06 . 2 letters more 3 letters more 04 . 02 .
    • Conclusions: Use Mutual information as a tool to capture strings that are correlated to a true TFBS used as a probe. validate in vitro the candidates so obtained This is more flexible than the use of Hamming or Levenshtein distance, since correlated strings could be very distant one anotherDrawbacks:1. the method need a precise calibration of the threshold2. Does not include gaps
    • Random Projection Approach to Motif Finding
    • daf-19 Binding Sites in C. elegans GTTGTCATGGTGAC GTTTCCATGGAAAC GCTACCATGGCAAC GTTACCATAGTAAC GTTTCCATGGTAAC che-2 daf-19 osm-1 osm-6 F02D8.3-150 -1
    • The (l,d) Planted Motif Problem Generate a random length l consensus sequence C. Generate 20 instances, each differing from C by d random mutations. Plant one at a random position in each of N=20 random sequences of length n=600. Can you find the planted instances?
    • Planted MotifsAGTTATCGCGGCACAGGCTCCTTCTTTATAGCCATGATAGCATCAACCTAACCCTAGATATGGGATTTTTGGGATATATCGCCCCTACACTGGATGACTGGATATACATGAACACGGTGGGAAAACCCTGAC Each instance differs from ACAGGATCA by 2 mutations Remaining sequence random
    • Random Projection Algorithm Buhler and Tompa (2001) Guiding principle: Some instances of a motif agree on a subset of positions. Use information from multiple motif instances to construct model.x(1) ...ccATCCGACca...x(2) ...ttATGAGGCtc... ATGCGTC =Mx(5) ...ctATAAGTCgc...x(8) ...tcATGTGACac... (7,2) motif
    • k-Projections Choose k positions in string of length l. Concatenate nucleotides at chosen k positions to form k-tuple. In l-dimensional Hamming space, projection onto k dimensional subspace. l = 15 k=7 P ATGGCATTCAGATTC TGCTGAT P = (2, 4, 5, 7, 11, 12, 13)
    • Random Projection Algorithm Choose a projection by Input sequence x(i): selecting k positions …TCAATGCACCTAT... uniformly at random. For each l-tuple in input sequences, hash into bucket based on letters at k selected positions. Recover motif from bucket containing TGCACCT multiple l-tuples. Bucket TGCT
    • Example  l = 7 (motif size) , k = 4 (projection size)  Choose projection (1,2,5,7)Input Sequence ...TAGACATCCGACTTGCCTTACTAC... Buckets ATCCGAC GCCTTAC ATGC GCTC
    • Hashing and Buckets Hash function h(x) obtained from k positions of projection. Buckets are labeled by values of h(x). Enriched buckets: contain at least s l-tuples, for some parameter s. ATGC GCTC CATC ATTC
    • Frequency Matrix Model From Bucket A1 0 .25 .5 0 .5 0ATCCGAC   C 0 0 .25 .25 0 0 1 G 0 0ATGAGGCATAAGTC 0 .5 0 1 .25  ATGTGAC T 0  1 0 .25 0 .25 0  ATGC Frequency matrix W EM algorithm Refined matrix W*
    • Motif Refinement How do we recover the motif from the sequences in the enriched buckets? k nucleotides are known from hash value of bucket. Use information in other l-k positions as starting point for local refinement scheme, e.g. EM or Gibbs sampler ATCCGAC Local refinement algorithm ATGAGGC ATGCGTC ATAAGTC Candidate motif ATGTGAC ATGC
    • Expectation Maximization (EM) S = { x(1), …, x(N)} : set of input sequences Given:  W = An initial probabilistic motif model  P0 = background probability distribution. Find value Wmax that maximizes likelihood ratio: Pr( S | Wmax ) Pr( S | P0 ) EM is local optimization scheme. Requires starting value W
    • EM Motif Refinement For each bucket h containing more than s sequences, form weight matrix Wh Use EM algorithm with starting point Wh to obtain refined weight matrix model Wh* For each input sequence x(i), return l tuple y(i) which maximizes likelihood ratio: Pr(y(i) | Wh* )/ Pr(y(i) | P0). T = {y(1), y(2), …, y(N)} C(T ) = consensus string
    • What Is the Best Motif? Compute score S for each motif:  Generate W, an initial PSSM from the returned l-mers {y(1), y(2), …, y(N)} P( y (i ) | W ) Score = ∑ log i P( y (i ) | P0 ) Return motif with maximal score
    • Iterations Single iteration.  Choose a random k-projection.  Hash each l-mer x in input sequence into bucket labelled by h(x).  From each bucket B with at least s sequences, form weight matrix model, and perform EM/Gibbs sampler refinement.  Candidate motif is the best one found from refinement of all enriched buckets. Multiple iterations.  Repeat process for multiple projections.
    • Parameter Selection Projection size k Choose k small so several motif instances hash to same bucket. (k < l - d) Choose k large to avoid contamination by spurious l-mers. E > (N (n - l + 1))/ 4k Bucket threshold s: (s = 3, s = 4)
    • How Many Iterations? Planted bucket : bucket with hash value h(M), where M is motif. Choose m = number of iterations, such that Pr(planted bucket contains ≥ s sequences in at least one of m iterations) ≥ 0.95. Probability is readily computable since iterations form a sequence of independent trials.
    • Composite motifs detection Question Monad detection Mitra
    • monad patterns Short contiguous strings Appear surprisingly many times( in a statistically significant way) S= AGTCTTGCTAGTCCGTAATATCCGGATAGAATAATGATC AGTC AGTC GTAGCATCGTACGTAGCTATCGATCTGAAGCTAGCAGC AAGATGTACTAGAGTCACGTAGCTAGTCATCTATACGAG AGTC AGTC TCGATGTAGTAGCTATCGATCGTAGCTAGAGTCCGTAGC TC AGTC AGCTAGTATCGTAGTGAGCAACATGAGTCCAGTGCATA AGTC GTCAGCTCATGAGTCGCATAGTC GTC AGTC P = AGTC
    • Introduction However, many of the actual regulatory signals are composite patterns.  Groups of monad patterns  Occur relatively near each other An example of a composite pattern is a dyad signal.
    • Composite Pattern S=ACGTAAATCACGTTGACTAGCTAGCACGAG CTAGCATAATCACACTTTGACGAGTCGACTGC ATGCATTGACGCAGTGCATTGCTAGCATGGG TAATCAAACGTTGGCTAGCTAGCATGCATCTG AGCATGCTAGCTACGTACTAGCGCGATAGTC TACTACAAATCACCCATTGCGAGCTACGTAG CTAGCTAGCTAGCTAGCTAGTGATGCATGCTA GAATCCGATCTTGCGATCGAT CP = AATCxxxxTTG
    • Introduction A possible approach is to find each part of the pattern separately and reconstruct the composite pattern. However, they often fail to output composite regulatory patterns consisting of weak monad parts.
    • Introduction A better approach would be to detect both parts of a composite pattern at the same time. Two steps in the proposed algorithm:  Preprocessing the sample creates a set of ‘virtual’ monads.  Apply an exhaustive monad discovery algorithm to the ’virtual’ monad problem. By preprocessing, original problem can be transformed into a larger monad discovery problem.
    • Monad Pattern Discovery Canonical pattern lmer 3mer: A C A  A continuous string of length l (l, d)-neighbourhood of an lmer P  all possible lmers with up to d mismatches as compared to P  The number of such lmers is : d l  i ∑  i 3   i =0   (l,d)-k patterns  Given a sequence S, find all lmers that occur with up to d mismatches at least k times in the sample  A variant : the sample is split into several sequence, to find all lmers, d mismatches, in at least k sequences
    • Pattern Driven Approach(PDA) (Prvzner, 2000)  Examine all 4 l patterns of fixed length l in lexical order, compares each pattern to every lmer in the sample, and return all (l, d)-k pattern (Waterman et al., 1984 and Galas et al.,1985)  Bypass excessive time requirement  Most of all 4 l examines not worth since neither these patterns nor their neighbours appear in the sample  SDA was therefore designed only explores the lmer appearing in the sample and their neighbours.
    • Sample Driven Approach(SDA) First initializes a table of size 4l  Each table entry corresponds to a pattern SDA generate the (l, d)-neighbourhood of lmer  Incremented by a certain amount  After all lmers processed, SDA return all pattern whose table entries have scores exceed the threshold AAAAA 3 4l AAAAC 1 AAACC 2 … ..
    • Sample Driven Approach(SDA) Faster but requires a large 4l table still  not practical for long pattern in mid 1980  Not mainstream and no tool  (Today gigabytes of RAM memory available thus l increased without a memory-efficient algorithm)
    • SDA Iterations First, explore all neighbour of the first lmer from the sample. Second, explore all neighbour of the second lmer If an lmer P belongs to the neighbour of the lmers appearing at positions i1 ,…ik in the sample  info about P collected at iteration i1 ,…ik . So the Waterman approach update info about P k times  memory slot for P is occupied during the course time even if P is not “interesting” lmer Most of lmers explored are not interesting—waste memory slot
    • To improve SDA Better solution:  Collect info about all P at the same time  to remove the need to keep the info in memory  but require a new approach to navigate the space of all lmers MITRA runs faster than PDA and SDA, and uses only a fraction of the memory of the SDA
    • Pattern-finding vs. profile-based Profile-based is more biologically relevant for finding motifs in biological samples?  Probably the reason Waterman algorithm not popular in the last decade Sagot and colleagues were the first to rebut this opinion  Develop an efficient version of Waterman’s
    • Pattern-based vs. profile-based Similarities  Pattern-based generate the profile  Every profile of length l corresponds to a pattern of length l formed by the most frequent nucleotides in every position.  Pattern-driven at least as good as profile-based Even better on simulated samples with implanted patterns  Though profile-implantation model is somehow limited  Today little evidence profile-based perform any better on either biological or simulated samples
    • MitraMismatch Tree Algorithm
    • Mismatch Tree Algorithm (MITRA) MITRA uses a mismatch tree data structure to split the space of all possible patterns into disjoint subspaces that start with a given prefix.  For reducing the pattern discovery into smaller sub-problems.  MITRA also takes advantage of pair-wise similarity between instances.
    • Splitting Pattern Space A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample. A subspace is called weak if all patterns in this subspace are weak.
    • Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.Sequence = AGTATCAGTTP= GTC Not weakl = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }d =1 ; k =2( l ,d )-neighbours in the sample = { GTA, ATC, GTT }
    • Splitting pattern space  A pattern is called weak if it has less than k ( l ,d )-neighbours in the sample.Sequence = AGTATCAGTTP= CAG weakl = 3 …. sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }d =1 ; k =2( l ,d )-neighbours in the sample = { CAG }
    • Splitting pattern space  A subspace is called weak if all patterns in this subspace are weak.Sequence = AGTATCAGTT•subspaceA = { AAA, AAT, AAC, AAG ………..AGG }•subspaceT = { TAA, TAT, TAC, TAG ………..TGG }•subspaceC = { CAA, CAT, CAC, CAG ………..CGG }•subspaceGG = { GGA, GGT, GGC, GGG}
    • Question Input:  S, l, d, k Output:  All l mers that occur with up to d mismatches at least k times in the sample.
    • Solution Naïve :  Test all l mer in the space  If occur with up to d mismatches at least k times in the sample than output this l mer.space = { AAA, AAT, AAC, AAG ………..AGG TAA, TAT, TAC, TAG ………..TGG CAA, CAT, CAC, CAG ………..CGG GAA, GAT, GAC, GAG ………..GGG }sample = { AGT, GTA, TAT, ATC, TCA, CAG, AGT, GTT }
    • Splitting pattern space if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.  Subspace of all l mers starting with A,  Subspace of all l mers starting with T,  Subspace of all l mers starting with C,  Subspace of all l mers starting with G,
    • Splitting pattern space if we are looking for patterns of length l we would first split the space of all l mers into 4 disjoint subspaces.Space: A* T* C* G* SubspaceA
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: A* Can’t rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: AA* AC* AT* AG* Can rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.Space: Can’t rule out
    • Splitting pattern space we further determine whether the subspace contains a ( l ,d )-k pattern.  If we can rule out this subspace contains such a pattern  we stop searching in this subspace;  release the memory slot;  If we can’t rule out this subspace contains such a pattern  we split this subspace again on the next symbol;  and repeat;
    • Mismatch tree data structure A mismatch tree is a rooted tree where each internal node has 4 branches labeled with a symbol in {A,C,T,G} The maximum depth of the tree is l. Each node in the mismatch tree corresponds to the subspace of patterns P with a fixed prefix. Each node contains pointers to all l mers instances from the sample that are within d mismatches from a pattern p.
    • Mismatch tree data structure MITRA start with examining the root node of the mismatch tree that corresponds to the space of all patterns.  When examining a node, MITRA tries to prove that it corresponds to a weak subspace.  If (we can’t prove it)  we expand the node’s children and examine each of them.  Whenever we reach a node corresponding to a weak subspace, we backtrack. The intuition is that many of the nodes correspond to weak subspaces and can be rule out. This allows us to avoid searching much of the pattern space.
    • Mismatch tree data structure If we reach depth l and the number of instances is not less than k.  the l mer corresponding to the path from the root to the leaf .  the pointers from this node correspond to the instances of this pattern.
    • Example Consider a very simple example of finding the pattern of length 4 with up to 1 mismatch and at least 2 times in the sample S = Not for exam “AGTATCAGTT”. The substrings (4mers) in S are { AGTA, GTAT, TATC, ATCA, TCAG, CAGT, AGTT }
    • 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T1 2 1 1 2 1 1 AA G T A T C A k=5 1 1 2 2 1 A T A C AG T A T C A G A T G A T A GT A T C A G T 2 2 2 2 2 C G 2 2 T2 T 1 C2 G TA T C A G T T k = 1T= 3 2C Ak A 2A 1 2 2 A T AA CC AA T T G A T A TG A A C A G A T A G k=0 k=1A T T C GG AT T T G A T T C G T 1 2 2 A C A TT C TT G C G T A C A T T 22 2 2 2 1 A T A 2 1 2 A C A T T A T AA T A G A G A T A 0k =A1 G A GG A Tk = G k=1 T k=1 T G G T T TT T T A C T T T T A C AT C T A C T
    • 0 1 1 0 1 1 0 A G T A T C A 0 0 0 0 0 0 0 G T A T C A G A G T A T C A T A T C A G T A G T A T C A G A T C A G T T T A T C A G T k =7 A T C A G T T T 0 2 2 1 2 2 0 G k=3 A G T A T C A 0 2 0 G T A T C A G A A A T A T C A G T T G T G A T C A G T T k=2 T C T A A T A T 0 1Output: AGTA CA G 1 1 A 1 0 1 1 A A AGTC G G A A A A G G AGTG k=2 T Tk = 2k = 2 G G k = 2T A G G T T T T AGTT T T A T A T A T
    • Overall complexity 0 0 0 0 0 0 0  Space = A G T A T C A  Time = O(l2 × |S|) A G T A T C A G l T A T C A G T O(4l × |S|) A T C A G T T G O(|S|) O(|S|) T O(l)l 0 1 1 0 1 1 0 . . . A G T A T C A . . . G T A T C A G . . . T A T C A G T T A T C A G T T  Number of nodes = O(4l) – Number of comparisons in each node = O(|S|)
    • Take a Closer Look In mismatch tree algorithm, we can not start ruling out a node until traverse to depth . d +1 0 1 1 0 1 1 0 0 0 0 0 0 0 0 A G T A T C A A G T A T C A G T A T C A G A G T A T C A G T A T C A G T T A T C A G T A T C A G T T k =7 A T C A G T T 1 2 1 1 2 1 1 A A G T A T C A k=5 G T A T C A G T A T C A G T A T C A G T T
    • MITRA Graph Information about pairwise similarities between instances of the pattern can significantly the sample-driven approach. speed up The graph that is constructed to model this pairwise similarity is called MITRA-Graph
    • MITRA Graph Given a pattern P and sample S we can construct a graph G(P, S) where each vertex is an lmer in the sample and there is an edge connecting two lmers if P is within d mismatches from both lmers. S = TAACA P = TAC AAC (d=1) TAA ACA (d=1) (d=3)
    • MITRA Graph For an (l,d) – k pattern P the corresponding graph contains a clique of size k. S = TAACA P = AAA AAC (d=1) TAA ACA (d=1) (d=3)
    • MITRA Graph Given a set of patterns P and a sample S, define a graph G(P , S) whose edge set is a union of edge sets of graphs G(P, S) for P∈P . Each vertex of G(P , S) is an lmer in the sample and there is an edge connecting two lmers if there is a pattern P∈P that is within d mismatches from both lmers. If for a subspace of patterns we can rule out an existence of a clique of size k, then the subspace has no (l,d)-k
    • The WINNOWER Algorithm The WINNOWER algorithm by Pevzner and Sze (2000) constructs the following graph: Each lmer in the sample is a vertex, and an edge connects two vertices if the corresponding lmers have less than d mismatches. Instances of a (l,d)-k pattern form a clique of size k in this graph.
    • The WINNOWER Algorithm (con’t) Since clique are difficult to find, WINNOWER takes the approach of trying to remove edges that do not corresponding to a clique. k=4
    • Improvements by MITRA-Graph1. Construct a graph at each node in the mismatch tree. 0 1 1 0 1 1 0 A G T A T C A A G T A T C A G T A T C A G T A T C A G T T
    • Improvements by MITRA-Graph2. Remove edges which are not part of a clique. A
    • Improvements by MITRA-Graph3. If no potential clique remains, rule out the subspace corresponding to the node and backtrack. A A
    • Improvements by MITRA-Graph4. If we cannot rule out a clique, split the subspace of patterns and examine the child nodes A
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER At each node of the tree, we remove edges by computing the degree of each vertex. If the degree of the vertex is less than k-1, we can remove all edges incident to it since we know it is not part of a clique. We repeat this procedure until we cannot remove any more edges. If the number of edges remaining is less than the minimum number of edges in a clique, we can rule out the existence of a clique and backtrack.
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER The problem with this approach is how to efficiently construct the graph at each node in the mismatch tree. Instead of constructing the graph from scratch, we construct it based on the graph at the parent node  an edge connecting two l mers  the first l mer matches the prefix of the pattern subspace with d1 mismatches  the second l mer matches with d2 mismatches
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER  the number of mismatches between the tail of the first and the second l mers as m.  The edge between these two l mers exists in the pattern subspace if and only if d1 <= d, d2 <= d and d1+d2+m <= 2d.The prefix of thepattern subspacethe first lmerthe second lmer
    • MISMATCH TREE ALGORITHM —Improvements over WINNOWER (cont’d) In the root node since d1 = d2 = 0, an edge exists only if m <= 2d which is the equivalent graph to WINNOWER. With moving down the tree, the condition becomes much stronger than the WINNOWER. We can compute the edges of a node based on the edges of the node’s parents by keeping track of the quantities d1, d2, and m for each edge.
    • MISMATCH TREE ALGORITHM — Improvements over WINNOWER To summarize, the MITRA-Graph algorithm works as follows  We first compute the set of edges at the root node by performing pairwise comparisons between all l mers due to d1 = d2 = 0.  We traverse the tree in a depth first order, passing on the valid edges and keeping track of the quantities d1, d2, and m for each of them.  At each node, we prune the graph by eliminating any edges incident to vertices that have degrees of less than k-1.  If there are less than the minimum number of edges for a clique, we backtrack.  If we reach a leaf of the tree (depth l), then we output the corresponding pattern.
    • Discovering dyad signals
    • DISCOVERING DYNAD SIGNALS For dyad signals, we are interested in discovering two monads that occur a certain length apart  We use the notation (l1-(s1,s2)-l2,d)-k pattern to denote a dyad signal l1 s l2 l1 s l2 l1 s l2
    • DISCOVERING DYNAD SIGNALS The MITRA-Dyad algorithm casts the dyad discovery problem into a monad discovery problem by preprocessing the input and creating a “virtual” sample to solve the (l1+l2,d)-k monad pattern discovery problem in this sample  For each l1mer in the sample and for each s in [s1,s2], we create an l1+l2 mer which is the l1mer concatenated with the l2 mer upstream s nucleotides of the l1mer.
    • DISCOVERING DYNAD SIGNALS  The number of elements in the “virtual” sample will be approximately (s1-s2+1) times larger.  An (l1+l2,d)-k pattern in the “virtual” sample will correspond to a (l1-(s1,s2)-l2,d)-k pattern in the original sample, and we can easily map the solution from the monad problem to the dyad one. An important feature of MITRA-Dyad is an ability to search for long patterns.
    • DISCOVERING DYNAD SIGNALS If the range s1-s2+1 of acceptable distances between monad parts in a composite pattern is large, the MITRA-Dyad algorithm becomes inefficient  A simple approach to detect these patterns is to generate a long ranked list of candidate monad patterns using MITRA.  Then check each occurrence of each pair from the list to see if they occur within the acceptable distance.