Parallel Random Projection for Motif Discovery on GPUs
Upcoming SlideShare
Loading in...5
×
 

Parallel Random Projection for Motif Discovery on GPUs

on

  • 560 views

 

Statistics

Views

Total Views
560
Views on SlideShare
359
Embed Views
201

Actions

Likes
0
Downloads
8
Comments
0

1 Embed 201

http://magpakatasyo.wordpress.com 201

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Parallel Random Projection for Motif Discovery on GPUs Parallel Random Projection for Motif Discovery on GPUs Presentation Transcript

  • Finding Planted (l, d)-Motifs in Parallel using Random Projection on GPUs Jhoirene Barasi Clemente Algorithms and Complexity Laboratory Department of Computer Science University of the Philippines-Diliman jbclemente@up.edu.ph March 31, 2012J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 1 / 88
  • OverviewOverview Introduction Definitions and Notations Finding Motifs using Random Projection (FMURP) Parallel Implementations of CUDA-FMURP Results and Analysis ConclusionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 2 / 88
  • Introduction In this work, we are interested in solving Planted (l, d)-Motif Problem using Random Projection (FMURP). The focus of this study is on parallelization of FMURP, where we present three versions of the parallel algorithm. Correctness of the parallelization is also discussed. We implement two of these parallel algorithms on GPUs. Theoretical and actual performance analyses are also presented.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 3 / 88
  • IntroductionIntroduction A DNA motif is defined as a nucleic acid sequence pattern that has some biological significance such as being DNA binding sites for a regulatory protein. i.e., a transcription factor [Das,2007].J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 4 / 88
  • IntroductionIntroduction DNA Sequences as StringsJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 5 / 88
  • IntroductionIntroductionThe pattern is fairly short (5 to 20 base-pairs (bp) long) and is known to recur in different genes or several times within gene [Rombauts,1999]. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 6 / 88
  • Introduction NotationsNotations Set of t sequences S.Example 1 (Sequences S = {S0 , S1 , . . . , S(t−1) })S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T AS1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A AS2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C CS3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T GS4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T CS5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A CS6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G CSet of sequences S = {S0 , S1 , S2 , S3 , S4 , S5 , S6 }defined over ΣDNA = {A, C, T, G},where each sequence Si in S has length ni = 40 for all i ∈ {0, . . . , (t − 1)} J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 7 / 88
  • Introduction NotationsNotations An l-mer is a string of length l defined over ΣDNA . To denote an l-mer in S, we use Si,j , where i ∈ {0, 1, . . . , (t − 1)} is the sequence number and j ∈ {0, 1, . . . , (n − l)} is the starting position in Si .Example 2 (Si,j in S)For instance, an 8-mer S0,7 is ATGGAACTS0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 8 / 88
  • Introduction NotationsNotations Let s = (a0 , a1 , . . . , a(t−1) ) be the set of starting positions in S, where ai ∈ {0, 1, . . . , (n − l)}. Let A(s) denotes the alignment made by l-mers in the set {S0,a0 , S1,a1 , . . . , S(t−1),a(t−1) }.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 9 / 88
  • Introduction NotationsNotationsExample 3 (Alignment matrix A(s))Suppose we have a starting position vector s = (7, 18, 2, 4, 30, 26, 14) S0,7 : A T G G A A C T S1,18 : A T G C C A C T S2,2 : A T G C A A C T A(s) S3,4 : A T G C A A C T S4,30 : A T G C A A C T S5,26 : A T G C A A C T S6,14 : A T G C A A C GS0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T AS1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A AS2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C CS3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T GS4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T CS5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A CS6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 10 / 88
  • Introduction NotationsNotations A profile matrix P(s) with dimension equal to (|ΣDNA | × l) is derived from the frequency of each letter in each column of the A(s).Example 4 (Profile Matrix P(s)) S0,7 : A T G G A A C T S1,18 : A T G C C A C T S2,2 : A T G C A A C T A(s) S3,4 : A T G C A A C T S4,30 : A T G C A A C T S5,26 : A T G C A A C T S6,14 : A T G C A A C G A: 7 0 0 0 6 7 0 0 T: 0 7 0 0 0 0 0 6 P(s) C: 0 0 0 6 1 0 7 0 G: 0 0 7 1 0 0 0 1 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 11 / 88
  • Introduction NotationsNotations From P(s), we define MP(s) (j), where 0 ≤ j ≤ (l − 1), be the maximum number at jth column of the profile matrix.Example 5 (MP(s),j ) S0,7 : A T G G A A C T S1,18 : A T G C C A C T S2,2 : A T G C A A C T A(s) S3,4 : A T G C A A C T S4,30 : A T G C A A C T S5,26 : A T G C A A C T S6,14 : A T G C A A C G A: 7 0 0 0 6 7 0 0 T: 0 7 0 0 0 0 0 6 P(s) C: 0 0 0 6 1 0 7 0 G: 0 0 7 1 0 0 0 1 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 12 / 88
  • Introduction NotationsNotations A consensus string is an l-mer, where each of its elements is the nucleotide base corresponding to MP(s) (i).Example 6 (Consensus String) S0,7 : A T G G A A C T S1,18 : A T G C C A C T S2,2 : A T G C A A C T A(s) S3,4 : A T G C A A C T S4,30 : A T G C A A C T S5,26 : A T G C A A C T S6,14 : A T G C A A C G A: 7 0 0 0 6 7 0 0 T: 0 7 0 0 0 0 0 6 P(s) C: 0 0 0 6 1 0 7 0 G: 0 0 7 1 0 0 0 1 Consensus String A T G C A A C T J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 13 / 88
  • Introduction NotationsNotations We define the Score(s,S) to be equal to l−1 Score(s, S) = MP(s) (i). (1) i=0Example 7 (Consensus Score()) A: 7 0 0 0 6 7 0 0 T: 0 7 0 0 0 0 0 6 P(s) C: 0 0 0 6 1 0 7 0 G: 0 0 7 1 0 0 0 1 Score(s, S) = 7 + 7 + 7 + 6 + 6 + 7 + 7 + 6 = 53 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 14 / 88
  • Introduction NotationsNotations We define the Score(s,S) to be equal to l−1 Score(s, S) = MP(s) (i). (1) i=0Example 7 (Consensus Score()) A: 7 0 0 0 6 7 0 0 T: 0 7 0 0 0 0 0 6 P(s) C: 0 0 0 6 1 0 7 0 G: 0 0 7 1 0 0 0 1 Score(s, S) = 7 + 7 + 7 + 6 + 6 + 7 + 7 + 6 = 53 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 14 / 88
  • Introduction Motif Finding ProblemMotif Finding ProblemDefinition 8 (Motif Finding Problem [Pevzner,2004])INPUT: A motif length l A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) }, where each Si is of length niOUTPUT: An array of starting positions s = (a0 , a1 , . . . , a(t−1) ) maximizing consensus Score(s,S) J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 15 / 88
  • Introduction Motif Finding ProblemNaive MFP Solver [Pevzner,2004]Input: DNA (sequences), motif length lOutput: Starting position s and consensus string corresponding to s 1 For each possible starting position in S, i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}. 1 Get alignment A(s). 2 Compute for P(s). 3 Evaluate Score(s, S). 2 From s with the maximum Score, get the consensus string. 3 Output consensus string.Step 1 needs to iterate (n − l + 1)t times because all possible startingpositions s is equal to s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
  • Introduction Motif Finding ProblemNaive MFP Solver [Pevzner,2004]Input: DNA (sequences), motif length lOutput: Starting position s and consensus string corresponding to s 1 For each possible starting position in S, i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}. 1 Get alignment A(s). 2 Compute for P(s). 3 Evaluate Score(s, S). 2 From s with the maximum Score, get the consensus string. 3 Output consensus string.Step 1 needs to iterate (n − l + 1)t times because all possible startingpositions s is equal to s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
  • Introduction Planted (l, d)-Motif Finding ProblemDefinitionsDefinition 9 (Challenge Problem [Pevzner,2000])INPUT: Motif length l = 15, Expected mismatches d, 20 DNA sequences each with ni = 600 nucleotide basesOUTPUT: A consensus string M from an alignment A(s), where each l-mer in A(s) has Si,ai dE (M, Si,ai ) = 4, for all i ∈ {0, . . . , (t − 1)}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 17 / 88
  • Introduction Planted (l, d)-Motif Finding ProblemWhy challenging?Suppose we have A(s), S0,a0 A C T T G G G G C A A G A G G S1,a1 G G A C G G G G C A G A C T G S2,a2 A C T T G C T A A A G A C T G S3,a3 A C T G C G G G C A C A G T G S4,a4 A C C T G G G T C G T A C T G A: 4 0 1 0 0 0 0 1 1 4 1 4 1 0 0 C: 0 4 1 1 1 1 0 0 4 0 1 0 2 0 0 T: 0 0 3 3 0 0 1 1 0 0 1 0 1 4 0 G: 1 1 0 1 4 4 4 3 0 1 2 1 1 1 5 A C T T G G G G C A G A C T G dE (S0,a0 , S1,a1 ) = 2d = 8Score(s, S) = 4 + 4 + 3 + 3 + 4 + 4 + 4 + 3 + 4 + 4 + 2 + 4 + 2 + 4 + 5 = 54 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 18 / 88
  • Introduction Planted (l, d)-Motif Finding ProblemDefinitionsDefinition 10 (Planted (l, d)-Motif Finding Problem [Tompa,2001])INPUT: Motif length l, Expected number of mismatches d, and A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) }, where each Si is of length niOUTPUT: A consensus string M from an alignment A(s), where each l-mer in A(s) has Si,ai dE (M, Si,ai ) = d, for all i ∈ {0, . . . , (t − 1)}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 19 / 88
  • Introduction Planted (l, d)-Motif Finding ProblemSolutions for Planted (l, d)-Motif Finding SP-STAR [Pevzner,2000] Winnower [Pevzner,2000] Random Projection [Tompa,2001] Aggregation [Mohammed,2004] GibbsDST [Shida,2006]J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 20 / 88
  • Finding Motifs using Random Projection (FMURP)Finding Motifs using Random Projection (FMURP)INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 Projection 1 Get all l-mer Si,j s in S. 2 Get projection hI (Si,j ) for each Si,j in S. 3 Hash each Si,j to buckets with identifier hI (Si,j ). 4 Get enriched buckets. 2 Refine each enriched bucket using EM 3 Refine each enriched bucket using SP-STARσ 4 Maximize score to output best motif J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 21 / 88
  • Finding Motifs using Random Projection (FMURP)Definition 11Random Projection Given an l-mer Si,j , projection dimension k, and a setI ⊂ L = {0, . . . , (l − 1)}, where |I| = k, elements in I are sorted in increasingorder and are randomly chosen from the set L, a k-dimensional projection ofSi,j is hI (Si,j ) = Si,j (I0 ), Si,j (I1 ), . . . , Si,j (I(k−1) ),where hI (Si , j) is a k-mer and Ii denotes the ith element in I. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 22 / 88
  • Finding Motifs using Random Projection (FMURP)FMURP: ExampleExample 12Given a set of DNA sequences S, pattern length l = 4, projection dimensionk = 2, and bucket threshold δ = 3. S0 : C G G T C A G G S1 : T T C G A C A T S2 : A C G A T G A A Figure: Set of t = 3 sequences each with n = 8Let I = {0, 1}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 23 / 88
  • Finding Motifs using Random Projection (FMURP)ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 24 / 88
  • Finding Motifs using Random Projection (FMURP)ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 25 / 88
  • Finding Motifs using Random Projection (FMURP)ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 26 / 88
  • Finding Motifs using Random Projection (FMURP)ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 27 / 88
  • Parallel Motif Finding using Random ProjectionHow do we parallelize FMURP? 1 Projection1 Projection 1 Get all l-mer Si,j s in S in 1 Get all l-mer Si,j s in S. parallel. 2 Get projection hI (Si,j ) for each 2 Get projection hI (Si,j ) for each Si,j in S. Si,j in S in parallel. 3 Hash each Si,j to buckets with 3 Hash each Si,j to buckets with identifier hI (Si,j ). identifier hI (Si,j ) in parallel. 4 Get enriched buckets. 4 Get enriched buckets in2 Refine each enriched bucket parallel. using EM 2 Refine each enriched bucket3 Refine each enriched bucket using EM in parallel using SP-STARσ 3 Refine each enriched bucket4 Maximize score to output best using SP-STARσ in parallel motif 4 Maximize score to output best motif.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 28 / 88
  • Parallel Motif Finding using Random ProjectionParallel Algorithms for Motif Finding CUDA-MEME CUDA-Gibbs SamplingJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 29 / 88
  • Parallel Motif Finding using Random ProjectionCUDAJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 30 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Computing Framework Figure: Flowchart showing the processes done in the CPU and GPUJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 31 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-FMURP v1Figure: Thread ID is denoted by an ordered pair (i, j), 0 ≤ i ≤ w and 0 ≤ j ≤ v, where v isthe maximum thread per block and w is the number of allocated thread blocks in the grid. Thealgorithm uses a total of x = t · (n − l + 1) threads that are linearly arranged in GPU. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 32 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-FMURP v1INPUT: Set of sequences S, motif length l, expected mismatches d, projection dimension k,and bucket threshold δOUTPUT: Motif 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (x − 1)}, 1 Get hI (Si,j )s for each Si,j in S, ∗ 2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j . ∗ 3 Perform a linear search over all ki,j s to determine which l-mers ∗ are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead of the actual l-mer. 3 In CPU, identify the set of enriched buckets, and prune duplicates in preparation for EM refinement. 4 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 33 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Integer ConversionStep 2.2 represents each hI (Si,j ) to their corresponding integer representation ∗ki,j . Given a unique k-mer from projection, a corresponding integer iscomputed using the following mapping. Let us define f : ΣDNA → {0, 1, 2, 3}, A → 0 C → 1 G → 2 T → 3where each symbol in the DNA alphabet is mapped to a unique integer.For a string v of length k, f∗ : Σ+ DNA → Z+ ∪ {0} k−1 i v → i=0 f (vi )4where vi denotes the symbol at ith position starting from the least significantdigit and the integer representation is only defined on the positive integersincluding {0}. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 34 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-Projection v1: ExampleGiven a set of DNA sequences, pattern length l = 4, projection dimensionk = 2, and bucket threshold δ = 3. Projection in parallel is shown as follows J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 35 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-Projection v1: Integer Conversion exampleJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 36 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-Projection: Parallel Integer Conversion ExampleJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 37 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-Projection: Getting enriched bucketsJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 38 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-Projection: Getting enriched bucketsJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 39 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-EMJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 40 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)CUDA-SP-STARσJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 41 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (x − 1)}, 1 Get hI (Si,j )s for each Si,j in S, ∗ 2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j . ∗ 3 Perform a linear search over all ki,j s to determine which l-mers ∗ are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead of the actual l-mer. 3 In CPU, identify the set of enriched buckets, and prune duplicates in preparation for EM refinement. 4 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 42 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1The uniqueness of the representation we defined using f ∗ follows from theresults below.Let Σk = {0, 1, 2, . . . , k − 1}, and let Ck a regular language such that, Ck = { } ∪ (Σk − {0})Σ∗ . kTheorem 4.1 (Fundamental Theorem of base-k Representation[Allouche,2003])Let k ≥ 2 be an integer. Then every non-negative integer has a uniquerepresentation of the form t N= ai ki , i=0where at = 0 and 0 ≤ ai < k for 0 ≤ i ≤ t. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 43 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1In the case of our representation f ∗ , we have k = 4 and ai = f (vi ), wherevi ∈ ΣDNA . Note that the mapping f is one-to-one and onto by definition. Thuswe have the following:Proposition 4.1f ∗ provides a unique representation of hI (Si,j ), for each i, j, and element of I. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 44 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (x − 1)}, 1 Get hI (Si,j )s for each Si,j in S, ∗ 2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j . ∗ 3 Perform a linear search over all ki,j s to determine which l-mers ∗ are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead of the actual l-mer. 3 In CPU, identify the set of enriched buckets, and prune duplicates in preparation for EM refinement. 4 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 45 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1We have to show that the set of enriched buckets EB obtained in FMURP is ¯equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1. EB = {B| |B| ≥ δ}.Two elements Si,j and Si ,j belongs to the same bucket B if it follows therelation R defined below.Definition 13 (Relation R) (Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R (Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )Proposition 4.2 R is an equivalence relation. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1We have to show that the set of enriched buckets EB obtained in FMURP is ¯equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1. EB = {B| |B| ≥ δ}.Two elements Si,j and Si ,j belongs to the same bucket B if it follows therelation R defined below.Definition 13 (Relation R) (Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R (Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )Proposition 4.2 R is an equivalence relation. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1We have to show that the set of enriched buckets EB obtained in FMURP is ¯equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1. EB = {B| |B| ≥ δ}.Two elements Si,j and Si ,j belongs to the same bucket B if it follows therelation R defined below.Definition 13 (Relation R) (Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R (Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )Proposition 4.2 R is an equivalence relation. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1We have to show that the set of enriched buckets EB obtained in FMURP is ¯equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1. EB = {B| |B| ≥ δ}.Two elements Si,j and Si ,j belongs to the same bucket B if it follows therelation R defined below.Definition 13 (Relation R) (Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R (Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )Proposition 4.2 R is an equivalence relation. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1In CUDA-FMURP v1, an enriched bucket is defined as ¯ ¯ ¯ EB = {B| |B| ≥ δ}. ¯where B is a bucket in CUDA-FMURP and two elements p and q belongs to ¯ ¯the same bucket B if it follows the relation R defined below. ¯Definition 14 (Relation R) ¯ (p, q) ∈ B ⇔ (p, q) ∈ R ¯ ¯ (p, q) ∈ R ⇔ ∗ = k∗ ki,j ¯¯ i,jwhere i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and i¯ = q mod (n − l + 1).jLemma 15 ¯ Relation R and R are equivalent. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1In CUDA-FMURP v1, an enriched bucket is defined as ¯ ¯ ¯ EB = {B| |B| ≥ δ}. ¯where B is a bucket in CUDA-FMURP and two elements p and q belongs to ¯ ¯the same bucket B if it follows the relation R defined below. ¯Definition 14 (Relation R) ¯ (p, q) ∈ B ⇔ (p, q) ∈ R ¯ ¯ (p, q) ∈ R ⇔ ∗ = k∗ ki,j ¯¯ i,jwhere i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and i¯ = q mod (n − l + 1).jLemma 15 ¯ Relation R and R are equivalent. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness v1In CUDA-FMURP v1, an enriched bucket is defined as ¯ ¯ ¯ EB = {B| |B| ≥ δ}. ¯where B is a bucket in CUDA-FMURP and two elements p and q belongs to ¯ ¯the same bucket B if it follows the relation R defined below. ¯Definition 14 (Relation R) ¯ (p, q) ∈ B ⇔ (p, q) ∈ R ¯ ¯ (p, q) ∈ R ⇔ ∗ = k∗ ki,j ¯¯ i,jwhere i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and i¯ = q mod (n − l + 1).jLemma 15 ¯ Relation R and R are equivalent. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness ¯Note that elements in B involves Si,j s while elements in B involves the set ofintegers p ∈ {0, . . . , (x − 1)}. Using Equations tid = i × (n − l + 1) + j (2) tid i= (3) (n − l + 1) j = tid mod (n − l + 1) (4)we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem ¯below follows from the fact that R and R are equivalent.Theorem 4.2 ¯ Set of enriched buckets EB and EB are equivalent. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
  • Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)Correctness ¯Note that elements in B involves Si,j s while elements in B involves the set ofintegers p ∈ {0, . . . , (x − 1)}. Using Equations tid = i × (n − l + 1) + j (2) tid i= (3) (n − l + 1) j = tid mod (n − l + 1) (4)we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem ¯below follows from the fact that R and R are equivalent.Theorem 4.2 ¯ Set of enriched buckets EB and EB are equivalent. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)CUDA-FMURP v2INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (x − 1)}, 1 Get hI (Si,j )s for all Si,j s in S, where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l). 2 Convert each k-mer hI (Si,j ) to its corresponding ∗ integer representation ki,j . 3 ∗ In CPU, hash the list of ki,j s . 4 In CPU, identify the set of enriched buckets. 5 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 49 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)CUDA-FMURP v2INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (x − 1)}, 1 Get hI (Si,j )s for all Si,j s in S, where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l). 2 Convert each k-mer hI (Si,j ) to its corresponding ∗ integer representation ki,j . 3 ∗ In CPU, hash the list of ki,j s. 4 In CPU, identify the set of enriched buckets. 5 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 50 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPUJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 51 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPUJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 52 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPUJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 53 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPUJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 54 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPU ∗To avoid collision between two items with different ki,j s, linear probing isimplemented.Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is notempty,i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j . ∗ ∗ ∗We have to look for empty positions in table where we can place item p.We explore positions h (ki∗ ,j , i) = (h(ki,j ) + i) ∗ mod xfor i from 0 to (m − 1), until an empty hash table position is found. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPU ∗To avoid collision between two items with different ki,j s, linear probing isimplemented.Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is notempty,i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j . ∗ ∗ ∗We have to look for empty positions in table where we can place item p.We explore positions h (ki∗ ,j , i) = (h(ki,j ) + i) ∗ mod xfor i from 0 to (m − 1), until an empty hash table position is found. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)Hash Table in CPU ∗To avoid collision between two items with different ki,j s, linear probing isimplemented.Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is notempty,i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j . ∗ ∗ ∗We have to look for empty positions in table where we can place item p.We explore positions h (ki∗ ,j , i) = (h(ki,j ) + i) ∗ mod xfor i from 0 to (m − 1), until an empty hash table position is found. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)CUDA-FMURP v3INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (t − 1)}, 1 Get hI (Stid,j )s for all Stid,j s in S, where j ∈ 0, . . . , (n − l). 2 Convert each k-mer hI (Stid,j ) to its corresponding ∗ integer representation ktid,j . 3 ∗ In CPU, hash the list of ki,j s. 4 In CPU, identify the set of enriched buckets. 5 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 56 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)CUDA-FMURP v3INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 In CPU, generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 In GPU, for each thread tid in {0, . . . , (t − 1)}, 1 Get hI (Stid,j )s for all Stid,j s in S, where j ∈ 0, . . . , (n − l). 2 Convert each k-mer hI (Stid,j ) to its corresponding ∗ integer representation ktid,j . 3 ∗ In CPU, hash the list of ki,j s. 4 In CPU, identify the set of enriched buckets. 5 In GPU, for each tid in {0, . . . , (e − 1)}, 1 Perform EM refinement for each enriched bucket. 2 Perform SP-STARσ for each enriched bucket. 3 Maximize σ score to output best motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 57 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)CUDA-Projection v3J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 58 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)CUDA-Projection v3J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 59 / 88
  • Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)Integer ConversionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 60 / 88
  • Result and AnalysisRunning Time and Space Complexity Algorithm Time Space Number of Processors FMURP O(log(x)) O(x) 1 SEQ-FMURP O(x2 ) Oe(n − l + 1) 1 CUDA-FMURP v1 O(x) O(e(n − l + 1)) x CUDA-FMURP v2 O(x) O(e(n − l + 1)) x CUDA-FMURP v3 O(x) O(e(n − l + 1)) tTable: Total running time and space complexity of the three parallel algorithms forCUDA-FMURP in comparison with the two sequential implementations. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 61 / 88
  • Result and AnalysisSpeedup and EfficiencyFMURP: O(x log x)The computation of Speedup is the ratio of sequential and parallel runningtime. Sequential SP = ParallelComparison of Speedups SP , SP , and SP for CUDA-FMURP versions 1 to 3,respectively is shown below. O(x log x) SP = SP = SP = = O(log x) O(x) J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 62 / 88
  • Result and AnalysisSpeedup and EfficiencyComputation of processor Efficiency makes use of the speedup SP andnumber of processors used ˆ. p 1 · SPEP = ˆ pComparison of Efficiencies EP , EP , and EP for CUDA-FMURP versions 1 to3, respectively is shown below. 1 log x EP = · O(log x) = (5) x x 1 log x EP = · O(log x) = (6) x x 1 log x EP = · O(log x) = (7) t t EP = EP < EP J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 63 / 88
  • Result and Analysis DatasetDataset t n l d Instances generated 20 600 10 2 100 20 600 11 2 100 20 600 12 3 100 20 600 13 3 100 20 600 14 4 100 20 600 15 4 100 20 600 16 5 100 20 600 17 5 100 20 600 18 6 100 20 600 19 6 100Table: Summary of generated dataset that is used to determine the accuracy ofCUDA-FMURP. For each of the instance generated, the search model OOPS isassumed, that is each sequence contains exactly one occurrence of the planted motif. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 64 / 88
  • Result and Analysis DatasetAccuracy t n l d FMURP FMURP∗ SEQ-FMURP CUDA-FMURP m 20 600 10 2 13 100 98 98 72 20 600 11 2 99 100 100 100 16 20 600 12 3 3 96 83 83 259 20 600 13 3 81 100 100 100 62 20 600 14 4 1 86 79 79 645 20 600 15 4 49 100 100 100 172 20 600 16 5 0 77 53 53 1292 20 600 17 5 19 98 98 98 378 20 600 18 6 0 82 38 38 2217 20 600 19 6 9 98 94 94 711Table: The table shows the number of correctly identified planted motif over 100random input instances. For each of the instances, parameters k = 7 and s = 4 areused. The column labelled FMURP∗ is based from the result presented in[Tompa,2001] using the dataset they generated. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 65 / 88
  • Result and Analysis Machine SetupsMachine Setups System specifications ValuesSystem specifications Values Host processors (procs) Core(TM) i7-2600 CPU 3.40GHzHost processors (procs) 2 × Intel Quad-core 2.26GHz Total number of cores 4 × 2 (hyperthreaded) = 8Total number of cores 8 Max host RAM 8GBMax host RAM 12GB Device/s (GPU/s) 1 × NVIDIA GeForce GTX 580Device/s (GPU/s) 2 × NVIDIA GT120 Compute capability 2.0Compute capability 1.1 CUDA Cores/GPU 16 (multiprocs) × 32 (cores/proc) = 512CUDA Cores/GPU 4 (multiprocs) × 8 (cores/proc) = 32 GPU clock rate 1.54 GHzGPU clock rate 1.40 GHz Memory clock rate 2004 MhzMemory clock rate 500 Mhz Max device global memory 1535MBMax device global memory 512MB Operating system 64-bit Ubuntu 10.0.4Operating system Mac OS X 10.6.8 CUDA version 4.1CUDA version 3.2 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 66 / 88
  • Result and Analysis Actual SpeedupActual speed of CUDA-Projection v3 with respect toCUDA-Projection v1J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 67 / 88
  • Result and Analysis Actual SpeedupActual speed of CUDA-FMURP v1 and CUDA-Projectionv3J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 68 / 88
  • Result and Analysis Actual SpeedupActual Speed Result: Setup1J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 69 / 88
  • Result and Analysis Actual SpeedupMemory RequirementJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 70 / 88
  • Result and Analysis Actual SpeedupActual speed comparison and speedup of CUDA-FMURPv1 with respect to SEQ-FMURP and FMURP using Setup 2J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 71 / 88
  • ConclusionConclusionIn this work, we presented three versions of parallel algorithms for FMURP. Algorithm Processors SP wrt FMURP SP wrt SEQ-FMURP Efficiency CUDA-FMURP v1 x O(log x) O(x) (log x/x) CUDA-FMURP v2 x O(log x) O(x) (log x/x) CUDA-FMURP v3 t O(log x) O(x) (log x/t)We implemented CUDA-FMURP v1 and CUDA-FMURP v2 and achieved amaximum actual speedup of 6.8 and 6.6 respectively with respect to theSEQ-FMURP. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 72 / 88
  • Conclusion curtainJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 73 / 88
  • ReferencesReferences J.P. Allouche and J. Shallit, “Automatic Sequences: Theory Applications and Generalizations”, Cambridge University Press,Chapter 3: Numeration Systems, pp 70-73, 2003 P. Pevzner and S. H. Sze, “Combinatorial Approaches to Finding Subtle Signals in DNA Sequences”, Proceedings of 8th Int. Conf. Intelligent Systems for Molecular Biology (ISMB), 269-78, 2000 J. Buhler, M. Tompa, “Finding Motifs Using Random Projections”, RECOMB ’01 Proceedings of the fifth annual international conference on Computational biology, 2001 D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands On Approach, 1st ed. MA, USA: Morgan Kaufmann, 2010 M. Harris, “Mapping computational concepts to GPUs”, ACM SIGGRAPH 2005 Courses, NY, USA, 2005 N. Jones, P. Pevzner,“An Introduction to Bioinformatics Algorithms”, Massachusetts Institute of Technology Press, 2004J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 74 / 88
  • Extra SlidesFinding Motifs using Random Projection (FMURP)INPUT: Set of sequences S, motif length l, expected mismatches d, projectiondimension k, and bucket threshold δOUTPUT: Motif 1 Projection 1 Generate k random positions for projection. Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k. i|i 2 For each Si,j in S, 1 Get hI (Si,j )s from all Si,j s in S, where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l). 2 Sort Si,j s with respect to hI (Si,j ). 3 Perform a linear search over all hI (Si,j )s to determine which l-mers are ‘hashed’ in the same bucket. 2 Refine each enriched bucket using Expectation Maximization (EM) 3 Refine each enriched bucket using SP-STARσ 4 Maximize score to output best motif J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 75 / 88
  • Extra Slides ProjectionProjection: ExampleGiven a set of DNA sequences S, pattern length l = 4, projection dimensionk = 2, and bucket threshold δ = 3. S0 : C G G T C A G G S1 : T T C G A C A T S2 : A C G A T G A A Figure: Set of t = 3 sequences each with n = 8 We generate the set of k random positions used in the actual projection. Suppose we have the set I = {0, 1}. For all Si,j in S, we get hI (Si,j ) using the random positions in I generated in step 1. To hash Si,j s to corresponding buckets using its hI (Si,j ), the list defined above is sorted lexicographically in terms of hI (Si,j ) together with their corresponding Si,j s .The sorted list is obtained. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 76 / 88
  • Extra Slides ProjectionProjection: Example Label Si,j hI (Si,j ) Label Sorted Si,j Sorted hI (Si,j ) S0,0 CGGT CG S2,0 ACGA AC S0,1 GGTC GG S1,4 ACAT AC S0,2 GTCA GT S2,3 ATCA AT S0,3 TCAG TC S0,4 CAGG CA S0,4 CAGG CA S0,0 CGGT CG S1,0 TTCG TT S2,1 CGAT CG S1,1 TCGA TC S1,2 CGAC CG S1,2 CGAC CG S1,3 GACA GA S1,3 GACA GA S2,2 GATC GA S1,4 ACAT AC S0,1 GGTC GG S2,0 ACGA AC S0,2 GTCA GT S2,1 CGAT CG S1,1 TCGA TC S2,2 GATC GA S0,3 TCAG TC S2,3 ATCA AT S2,4 TGAA TG S2,4 TGAA TG S1,0 TTCG TT J.B. Clemente (ACLab, DCS, UPD) h (S )s computed from step 2. March 31, 2012Figure: Illustration showing the set of CUDA-FMURP The sorted 77 / 88
  • Extra Slides ProjectionProjection: Example To get the list of buckets, we will perform a linear search over hI (Si,j )s to get the corresponding Si,j with equivalent hI (Si,j )s. hI (Si,j ) Count Si,j AC 2 { ACGA, ACAT } AT 1 { ATCA } CA 1 {CAGG } CG 3 {CGGT, CGAT , CGAC } GA 2 {GACA, GATC } GG 1 {GGTC } GT 1 {GTCA } TC 2 {TCGA, TCAG } TG 1 {TGAA } TT 1 {TTCG} Figure: Buckets obtained from ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 78 / 88
  • Extra Slides ProjectionProjection: Example From the set of buckets obtained, we identify which of those contains at least δ l-mers hashed and consider them enriched. hI (Si,j ) Count Si,j AC 2 { ACGA, ACAT } AT 1 { ATCA } CA 1 {CAGG } CG 3 {CGGT, CGAT , CGAC } GA 2 {GACA, GATC } GG 1 {GGTC } GT 1 {GTCA } TC 2 {TCGA, TCAG } TG 1 {TGAA } TT 1 {TTCG} Figure: Buckets obtained from ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 79 / 88
  • Extra Slides ProjectionProjection: Example From the set of buckets obtained, we identify which of those contains at least δ l-mers hashed and consider them enriched. hI (Si,j ) Count Si,j AC 2 { ACGA, ACAT } AT 1 { ATCA } CA 1 {CAGG } CG 3 {CGGT, CGAT , CGAC } GA 2 {GACA, GATC } GG 1 {GGTC } GT 1 {GTCA } TC 2 {TCGA, TCAG } TG 1 {TGAA } TT 1 {TTCG} Figure: Buckets obtained from ProjectionJ.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 80 / 88
  • Extra Slides Expectation Maximization (EM)Expectation Maximization (EM)INPUT: Motif model θ0 from one enriched bucket, maximum number ofiterations, and threshold for convergence δEMOUTPUT: Motif model θy 1 For j in {1, . . . , y} or until convergence 1 E-step For all l-mer in each sequence Si , compute E(Si,ai |θj ) given the current motif model. 2 (M-step) For all Si in S, get starting positions s such that for each ai ∈ s, E(Si,ai |θj ) is maximum ∀ ai in {0, . . . , (n − l)}. 3 (Test for Convergence) Compute L(θj ). Compare previous likelihood L(θj−1 ) to current L(θj ). If the difference satisfies the threshold δEM , stop iteration. 4 (Update step) For the alignment made by starting position vector s identified in M-step, get motif model θj+1 . J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 81 / 88
  • Extra Slides Expectation Maximization (EM)EM: ExampleFrom the set of enriched bucket from Projection, EM performs the followingoperations. From EB , get the alignment made by hashed l-mers. C G G T C G A C C G A T From the alignment made, a profile matrix is computed. C G G T C G A C C G A T A: 0 0 2 0 C: 3 0 0 1 G: 0 3 1 0 T: 0 0 0 2 J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 82 / 88
  • Extra Slides Expectation Maximization (EM)EM: Example Normalize the profile matrix obtained. A: 0.00 0.00 0.33 0.00 C: 1.00 0.00 0.00 0.33 G: 0.00 1.00 0.66 0.00 T: 0.00 0.00 0.00 0.66 To avoid zero values for Pr(Si,j |θ), [Tompa,2001] performed Laplace correction. For each row corresponding to a symbol say a, the probability pa that symbol a appears in the sequence is added to its corresponding row. Since all symbols in ΣDNA has uniform frequency distribution, 0.25 is added for each row. A: 0.25 0.25 0.58 0.25 C: 1.25 0.25 0.25 0.58 G: 0.25 1.25 0.91 0.25 T: 0.25 0.25 0.25 0.91J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 83 / 88
  • Extra Slides Expectation Maximization (EM)EM: Example Normalize the matrix obtained and let the resulting matrix be the initial motif model θ0 . A: 0.125 0.125 0.290 0.125 C: 0.625 0.125 0.125 0.290 G: 0.125 0.625 0.455 0.125 T: 0.125 0.125 0.125 0.455 For each Si in S get j such that for all j ∈ {0, . . . , (n − l)}, E(Si,j |θ0 ) is maximum. For instance, let’s identify an l-mer in sequence S0 with maximum expectation E(S0,j |θ0 ). E(S0,0 |θ0 ) = E(CGGT|θ0 ) = ((0.625)(0.625)(0.455)(0.455))/(0.254 ) = 20.725 E(S0,1 |θ0 ) = E(GGTC|θ0 ) = ((0.125)(0.625)(0.125)(0.125))/(0.254 ) = 00.313 E(S0,2 |θ0 ) = E(GTCA|θ0 ) = ((0.125)(0.125)(0.125)(0.125))/(0.254 ) = 00.063 E(S0,3 |θ0 ) = E(TCAG|θ0 ) = ((0.125)(0.125)(0.455)(0.290))/(0.254 ) = 00.528 E(S0,4 |θ0 ) = E(CAGG|θ0 ) = ((0.625)(0.125)(0.455)(0.125))/(0.254 ) = 01.138 From all S0,j s in S0 , l-mer S0,0 obtains the highest expectation.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 84 / 88
  • Extra Slides Expectation Maximization (EM)EM: Example The set of l-mers with the highest expectation in each sequence will define another alignment, like in Step 1. From this set of l-mers, we can obtain the next motif model θ1 . S0,0 : C G G T : 20.73 S1,2 : C G A C : 08.41 S2,1 : C G A T : 13.20 We compute the likelihood of a motif model θy using the best expectations. L(θ) = 20.73 + 08.41 + 13.20 = 42.34 Update the motif model θ0 to get θ1 , using the set of l-mers from each sequence that maximize the expectation. Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .The output of EM in this example is the consensus string CGAT. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
  • Extra Slides Expectation Maximization (EM)EM: Example The set of l-mers with the highest expectation in each sequence will define another alignment, like in Step 1. From this set of l-mers, we can obtain the next motif model θ1 . S0,0 : C G G T : 20.73 S1,2 : C G A C : 08.41 S2,1 : C G A T : 13.20 We compute the likelihood of a motif model θy using the best expectations. L(θ) = 20.73 + 08.41 + 13.20 = 42.34 Update the motif model θ0 to get θ1 , using the set of l-mers from each sequence that maximize the expectation. Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .The output of EM in this example is the consensus string CGAT. J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
  • Extra Slides Expectation Maximization (EM)SP-STARσINPUT: Consensus string M from θy and expected mismatches dOUTPUT: Refined consensus string M ∗ 1 For j in {1, . . . , y } or until convergence 1 Compute for Sb , where Sb is the set of all l-mers from each sequence that has the least Edit distance from M. Sb = {Si,j |dE (M, Si,j ) is minimum ∀Si,j in Si } 2 Compute for score σ(Sb ), where it is equal to the number of sequences in Sb such that dE (M, Si,j ) ≤ d 3 Compute the consensus string M from alignment made by Sb . 4 Compute Sb from M . 5 Compute σ(Sb ). 6 If σ(Sb ) > σ(Sb ), continue iteration using M = M , else M ∗ = M . J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 86 / 88
  • Extra Slides Expectation Maximization (EM)SP-STARσ: ExampleUsing M =CGAT and expected mismatches d = 1. Compute for Sb . For S0 the S0,j is identified as follows. dE (M, S0,0 ) = dE (CGAT, CGGT) = 1 dE (M, S0,1 ) = dE (CGAT, GGTC) = 3 dE (M, S0,2 ) = dE (CGAT, GTCA) = 4 dE (M, S0,3 ) = dE (CGAT, TCAG) = 3 dE (M, S0,4 ) = dE (CGAT, CAGG) = 3 The set Sb contains Sb = {S0,0 , S1,2 , S2,1 } Sb = CGGT, CGAC, CGAT J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 87 / 88
  • Extra Slides Expectation Maximization (EM)SP-STARσ: Example Score for Sb is σ(Sb ) = 3 because the least edit distance in each sequence is 1, 1, 0. That is all 3 sequences satisfies dE (M, Si,j ) ≤ 1 Consensus string from Sb is M = CGAT. Sb from M is similar to Sb . Sb = {S0,0 , S1,2 , S2,1 } Sb = {CGGT, CGAC, CGAT} Since σ(Sb ) = σ(Sb ), M ∗ = M = CGAT.J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 88 / 88