The document discusses parallelizing the Finding Motifs using Random Projection (FMURP) algorithm for solving the Planted (l, d)-Motif Problem using random projection on GPUs. It presents three versions of the parallel FMURP algorithm, discusses correctness of the parallelization, and implements two versions on GPUs with theoretical and experimental performance analyses.
This document discusses probabilistic Turing machines and complexity classes related to bounded error probabilistic polynomial time (BPP). It defines probabilistic Turing machines that make random choices, and complexity classes like RP, coRP and ZPP based on one-sided error bounds. It also covers error reduction techniques that allow reducing the error probability for BPP algorithms. Finally, it examines the relationships between BPP and other complexity classes.
Reoptimization techniques for solving hard problemsJhoirene Clemente
Unless P=NP, we cannot obtain a polynomial-time algorithm solving hard combinatorial problems. One practical approach in solving this kind of problem is to relax the condition of always finding the optimal solution for an instance and settle for “good enough” solutions. The kind of algorithms which are guaranteed to obtain a solution with a certain quality are called approximative algorithms. However, not all hard problems are approximable, i.e., we can obtain a polynomial-time algorithm that can guarantee the goodness of the solution for a problem.
In this lecture, we will present the concept of reoptimization. In this approach, given an instance I of some problem Π, an optimal solution OPT for Π in I, and a modified instance I' resulting from a local perturbation of I, we wish to use OPT in order to solve Π in I'. With this additional information, reoptimization may help to improve the approximability of the problem or the running time of the solution to it. In fact, we can obtain a polynomial-time approximation scheme (PTAS) for a reoptimization variant of a problem given that the unmodified problem is approximable.
This document discusses analyzing and visualizing gene expression data. It defines key terms like genes and gene expression data. It also describes clustering gene expression data using k-means clustering to group genes based on similarity in a dataset of yeast cell cycle genes. Finally, it discusses visualizing gene expression data using techniques like vector fusion, nMDS, and PCA to project high-dimensional gene expression datasets into 2D or 3D spaces.
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...Shu Tanaka
Our paper entitled “Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering" was published in Neurocomputing. This work was done in collaboration with Dr. Issei Sato (Univ. of Tokyo), Dr. Kenichi Kurihara (Google), Professor Seiji Miyashita (Univ. of Tokyo), and Prof. Hiroshi Nakagawa (Univ. of Tokyo).
http://www.sciencedirect.com/science/article/pii/S0925231213005535
The preprint version is available:
http://arxiv.org/abs/1305.4325
佐藤一誠さん(東京大学)、栗原賢一さん(Google)、宮下精二教授(東京大学)、中川裕志教授(東京大学)との共同研究論文 “Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering" が Neurocomputing に掲載されました。
http://www.sciencedirect.com/science/article/pii/S0925231213005535
プレプリントバージョンは
http://arxiv.org/abs/1305.4325
からご覧いただけます。
This document discusses probabilistic models and string transducers for pairwise sequence alignment and phylogenetic tree construction. It introduces hidden Markov models (HMMs) and the Jukes-Cantor model for nucleotide substitution. UPGMA and neighbor-joining methods are described for building rooted and unrooted phylogenetic trees from distance matrices. Maximum parsimony is also summarized as a method for phylogenetic tree inference based on identifying the smallest number of character state changes.
This document discusses state-space realizations of linear time-invariant (LTI) systems. It begins by introducing state-space representations using matrices A, B, C, and D. It then discusses the concept of equivalent state-space representations that have the same transfer function through transformations. The document also introduces the concepts of zero-state equivalence and companion forms. It concludes by discussing conditions for a transfer function to have a state-space realization and provides a method to obtain a realization using a block companion form.
1. The document discusses molecular biology concepts including DNA, mutations, molecular evolution, and phylogenetic analysis methods.
2. It provides examples of different types of DNA mutations like transitions, transversions, synonymous and nonsynonymous substitutions.
3. Common phylogenetic analysis methods are described briefly, including Neighbor-Joining, Maximum Parsimony, and Maximum Likelihood. Distances between DNA sequences are represented in examples.
1. The document provides examples of set theory questions that could be asked in an exam with explanations of the solutions.
2. Various types of set theory questions are presented involving Venn diagrams, determining sizes of different sets based on given information, and representing set relationships diagrammatically.
3. Step-by-step solutions are provided for each example involving setting up equations from the given information and solving to find the required values.
This document discusses probabilistic Turing machines and complexity classes related to bounded error probabilistic polynomial time (BPP). It defines probabilistic Turing machines that make random choices, and complexity classes like RP, coRP and ZPP based on one-sided error bounds. It also covers error reduction techniques that allow reducing the error probability for BPP algorithms. Finally, it examines the relationships between BPP and other complexity classes.
Reoptimization techniques for solving hard problemsJhoirene Clemente
Unless P=NP, we cannot obtain a polynomial-time algorithm solving hard combinatorial problems. One practical approach in solving this kind of problem is to relax the condition of always finding the optimal solution for an instance and settle for “good enough” solutions. The kind of algorithms which are guaranteed to obtain a solution with a certain quality are called approximative algorithms. However, not all hard problems are approximable, i.e., we can obtain a polynomial-time algorithm that can guarantee the goodness of the solution for a problem.
In this lecture, we will present the concept of reoptimization. In this approach, given an instance I of some problem Π, an optimal solution OPT for Π in I, and a modified instance I' resulting from a local perturbation of I, we wish to use OPT in order to solve Π in I'. With this additional information, reoptimization may help to improve the approximability of the problem or the running time of the solution to it. In fact, we can obtain a polynomial-time approximation scheme (PTAS) for a reoptimization variant of a problem given that the unmodified problem is approximable.
This document discusses analyzing and visualizing gene expression data. It defines key terms like genes and gene expression data. It also describes clustering gene expression data using k-means clustering to group genes based on similarity in a dataset of yeast cell cycle genes. Finally, it discusses visualizing gene expression data using techniques like vector fusion, nMDS, and PCA to project high-dimensional gene expression datasets into 2D or 3D spaces.
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...Shu Tanaka
Our paper entitled “Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering" was published in Neurocomputing. This work was done in collaboration with Dr. Issei Sato (Univ. of Tokyo), Dr. Kenichi Kurihara (Google), Professor Seiji Miyashita (Univ. of Tokyo), and Prof. Hiroshi Nakagawa (Univ. of Tokyo).
http://www.sciencedirect.com/science/article/pii/S0925231213005535
The preprint version is available:
http://arxiv.org/abs/1305.4325
佐藤一誠さん(東京大学)、栗原賢一さん(Google)、宮下精二教授(東京大学)、中川裕志教授(東京大学)との共同研究論文 “Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering" が Neurocomputing に掲載されました。
http://www.sciencedirect.com/science/article/pii/S0925231213005535
プレプリントバージョンは
http://arxiv.org/abs/1305.4325
からご覧いただけます。
This document discusses probabilistic models and string transducers for pairwise sequence alignment and phylogenetic tree construction. It introduces hidden Markov models (HMMs) and the Jukes-Cantor model for nucleotide substitution. UPGMA and neighbor-joining methods are described for building rooted and unrooted phylogenetic trees from distance matrices. Maximum parsimony is also summarized as a method for phylogenetic tree inference based on identifying the smallest number of character state changes.
This document discusses state-space realizations of linear time-invariant (LTI) systems. It begins by introducing state-space representations using matrices A, B, C, and D. It then discusses the concept of equivalent state-space representations that have the same transfer function through transformations. The document also introduces the concepts of zero-state equivalence and companion forms. It concludes by discussing conditions for a transfer function to have a state-space realization and provides a method to obtain a realization using a block companion form.
1. The document discusses molecular biology concepts including DNA, mutations, molecular evolution, and phylogenetic analysis methods.
2. It provides examples of different types of DNA mutations like transitions, transversions, synonymous and nonsynonymous substitutions.
3. Common phylogenetic analysis methods are described briefly, including Neighbor-Joining, Maximum Parsimony, and Maximum Likelihood. Distances between DNA sequences are represented in examples.
1. The document provides examples of set theory questions that could be asked in an exam with explanations of the solutions.
2. Various types of set theory questions are presented involving Venn diagrams, determining sizes of different sets based on given information, and representing set relationships diagrammatically.
3. Step-by-step solutions are provided for each example involving setting up equations from the given information and solving to find the required values.
This document discusses network analysis and measures of centrality and communicability in networks. It provides mathematical definitions and formulas for quantifying properties like betweenness centrality, clustering coefficient, communicability between nodes, and the number of walks and routes connecting nodes in a network. Examples of applying these metrics to real-world networks like social and biological networks are also mentioned.
This document discusses different types of asymptotic analyses used to analyze algorithms: worst-case, average-case, and best-case. It also introduces asymptotic notation such as big-O, Ω, Θ, o, and ω notation used to describe upper and lower time complexity bounds of algorithms. Examples are provided to illustrate how to determine if a function belongs to a specific asymptotic notation class. Common functions and standard notations used in asymptotic analysis like logarithms and factorials are also covered.
1) The document provides information about a mathematics exam including 10 multiple choice questions from section 1. It provides the questions, solutions, and choices for 4 questions ranging in topics from complex numbers, probabilities, functions, and geometry.
2) Question 50 asks the reader to determine vectors for a parallelogram and parallelepiped to calculate the volume, which is 10.
The Smith-Waterman algorithm finds the best local alignment between two sequences. It involves filling a matrix using a recurrence relation to score matches, mismatches, and gaps. The highest scoring cell represents the best local alignment, which can be traced back through the matrix. For example, the best local alignment between sequences "TCAGTTGCC" and "AGGTTG" is "GTTG" with a score of 4.
1) The document presents a method called the "spiral concatenation unfolding operator" for expressing trigonometric and hyperbolic functions as rational expressions consisting of powers of the original function.
2) The method involves: a) extracting the function into real and imaginary parts using binomial coefficients, b) tagging each term as numerator or denominator, and c) rearranging the terms into a "spiral concatenation" rational expression.
3) Examples are provided to demonstrate the method for expressing tg(n), ctg(n), tgh(n), and ctgh(n) as rational functions of the respective base function.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
This document discusses network analysis and measures of centrality and communicability in networks. It provides mathematical definitions and formulas for quantifying properties like betweenness centrality, clustering coefficient, communicability between nodes, and the number of walks and routes connecting nodes in a network. Examples of applying these metrics to real-world networks like social and biological networks are also mentioned.
This document discusses different types of asymptotic analyses used to analyze algorithms: worst-case, average-case, and best-case. It also introduces asymptotic notation such as big-O, Ω, Θ, o, and ω notation used to describe upper and lower time complexity bounds of algorithms. Examples are provided to illustrate how to determine if a function belongs to a specific asymptotic notation class. Common functions and standard notations used in asymptotic analysis like logarithms and factorials are also covered.
1) The document provides information about a mathematics exam including 10 multiple choice questions from section 1. It provides the questions, solutions, and choices for 4 questions ranging in topics from complex numbers, probabilities, functions, and geometry.
2) Question 50 asks the reader to determine vectors for a parallelogram and parallelepiped to calculate the volume, which is 10.
The Smith-Waterman algorithm finds the best local alignment between two sequences. It involves filling a matrix using a recurrence relation to score matches, mismatches, and gaps. The highest scoring cell represents the best local alignment, which can be traced back through the matrix. For example, the best local alignment between sequences "TCAGTTGCC" and "AGGTTG" is "GTTG" with a score of 4.
1) The document presents a method called the "spiral concatenation unfolding operator" for expressing trigonometric and hyperbolic functions as rational expressions consisting of powers of the original function.
2) The method involves: a) extracting the function into real and imaginary parts using binomial coefficients, b) tagging each term as numerator or denominator, and c) rearranging the terms into a "spiral concatenation" rational expression.
3) Examples are provided to demonstrate the method for expressing tg(n), ctg(n), tgh(n), and ctgh(n) as rational functions of the respective base function.
Similar to Parallel Random Projection for Motif Discovery on GPUs (6)
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...PECB
Denis is a dynamic and results-driven Chief Information Officer (CIO) with a distinguished career spanning information systems analysis and technical project management. With a proven track record of spearheading the design and delivery of cutting-edge Information Management solutions, he has consistently elevated business operations, streamlined reporting functions, and maximized process efficiency.
Certified as an ISO/IEC 27001: Information Security Management Systems (ISMS) Lead Implementer, Data Protection Officer, and Cyber Risks Analyst, Denis brings a heightened focus on data security, privacy, and cyber resilience to every endeavor.
His expertise extends across a diverse spectrum of reporting, database, and web development applications, underpinned by an exceptional grasp of data storage and virtualization technologies. His proficiency in application testing, database administration, and data cleansing ensures seamless execution of complex projects.
What sets Denis apart is his comprehensive understanding of Business and Systems Analysis technologies, honed through involvement in all phases of the Software Development Lifecycle (SDLC). From meticulous requirements gathering to precise analysis, innovative design, rigorous development, thorough testing, and successful implementation, he has consistently delivered exceptional results.
Throughout his career, he has taken on multifaceted roles, from leading technical project management teams to owning solutions that drive operational excellence. His conscientious and proactive approach is unwavering, whether he is working independently or collaboratively within a team. His ability to connect with colleagues on a personal level underscores his commitment to fostering a harmonious and productive workplace environment.
Date: May 29, 2024
Tags: Information Security, ISO/IEC 27001, ISO/IEC 42001, Artificial Intelligence, GDPR
-------------------------------------------------------------------------------
Find out more about ISO training and certification services
Training: ISO/IEC 27001 Information Security Management System - EN | PECB
ISO/IEC 42001 Artificial Intelligence Management System - EN | PECB
General Data Protection Regulation (GDPR) - Training Courses - EN | PECB
Webinars: https://pecb.com/webinars
Article: https://pecb.com/article
-------------------------------------------------------------------------------
For more information about PECB:
Website: https://pecb.com/
LinkedIn: https://www.linkedin.com/company/pecb/
Facebook: https://www.facebook.com/PECBInternational/
Slideshare: http://www.slideshare.net/PECBCERTIFICATION
A review of the growth of the Israel Genealogy Research Association Database Collection for the last 12 months. Our collection is now passed the 3 million mark and still growing. See which archives have contributed the most. See the different types of records we have, and which years have had records added. You can also see what we have for the future.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
How to Fix the Import Error in the Odoo 17Celine George
An import error occurs when a program fails to import a module or library, disrupting its execution. In languages like Python, this issue arises when the specified module cannot be found or accessed, hindering the program's functionality. Resolving import errors is crucial for maintaining smooth software operation and uninterrupted development processes.
Main Java[All of the Base Concepts}.docxadhitya5119
This is part 1 of my Java Learning Journey. This Contains Custom methods, classes, constructors, packages, multithreading , try- catch block, finally block and more.
Azure Interview Questions and Answers PDF By ScholarHat
Parallel Random Projection for Motif Discovery on GPUs
1. Finding Planted (l, d)-Motifs in Parallel
using Random Projection on GPUs
Jhoirene Barasi Clemente
Algorithms and Complexity Laboratory
Department of Computer Science
University of the Philippines-Diliman
jbclemente@up.edu.ph
March 31, 2012
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 1 / 88
2. Overview
Overview
Introduction
Definitions and Notations
Finding Motifs using Random Projection (FMURP)
Parallel Implementations of CUDA-FMURP
Results and Analysis
Conclusion
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 2 / 88
3. Introduction
In this work, we are interested in solving Planted (l, d)-Motif Problem
using Random Projection (FMURP).
The focus of this study is on parallelization of FMURP, where we
present three versions of the parallel algorithm. Correctness of the
parallelization is also discussed.
We implement two of these parallel algorithms on GPUs. Theoretical
and actual performance analyses are also presented.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 3 / 88
4. Introduction
Introduction
A DNA motif is defined as a nucleic acid sequence pattern that has some
biological significance such as being DNA binding sites for a regulatory
protein. i.e., a transcription factor [Das,2007].
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 4 / 88
5. Introduction
Introduction
DNA Sequences as Strings
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 5 / 88
6. Introduction
Introduction
The pattern is fairly short (5 to 20 base-pairs (bp) long) and is known to recur
in different genes or several times within gene [Rombauts,1999].
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 6 / 88
7. Introduction Notations
Notations
Set of t sequences S.
Example 1 (Sequences S = {S0 , S1 , . . . , S(t−1) })
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C
Set of sequences S = {S0 , S1 , S2 , S3 , S4 , S5 , S6 }
defined over ΣDNA = {A, C, T, G},
where each sequence Si in S has length ni = 40 for all i ∈ {0, . . . , (t − 1)}
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 7 / 88
8. Introduction Notations
Notations
An l-mer is a string of length l defined over ΣDNA .
To denote an l-mer in S, we use
Si,j , where i ∈ {0, 1, . . . , (t − 1)} is the sequence number
and j ∈ {0, 1, . . . , (n − l)} is the starting position in Si .
Example 2 (Si,j in S)
For instance, an 8-mer S0,7 is
ATGGAACT
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 8 / 88
9. Introduction Notations
Notations
Let s = (a0 , a1 , . . . , a(t−1) ) be the set of starting positions in S,
where ai ∈ {0, 1, . . . , (n − l)}.
Let A(s) denotes the alignment made by l-mers in the set
{S0,a0 , S1,a1 , . . . , S(t−1),a(t−1) }.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 9 / 88
10. Introduction Notations
Notations
Example 3 (Alignment matrix A(s))
Suppose we have a starting position vector s = (7, 18, 2, 4, 30, 26, 14)
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
S0 : C G G G G C T A T G G A A C T G G G T C G T C A C A T T C C C C T T T C G A T A
S1 : T T T G A G G G T G C C C A A T A A A T G C C A C T C C A A A G C G G A C A A A
S2 : G G A T G C A A C T G A T G C C G T T T G A C G A C C T A A A T C A A C G G C C
S3 : A A G G A T G C A A C T C C A G G A G C G C C T T T G C T G G T T C T A C C T G
S4 : A A T T T T C T A A A A A G A T T A T A A T G T C G G T C C A T G C A A C T T C
S5 : C T G C T G T A C A A C T G A G A T C A T G C T G C A T G C A A C T T T C A A C
S6 : T A C A T G A T C T T T T G A T G C A A C G T G G A T G A G G G A A T G A T G C
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 10 / 88
11. Introduction Notations
Notations
A profile matrix P(s) with dimension equal to (|ΣDNA | × l) is derived
from the frequency of each letter in each column of the A(s).
Example 4 (Profile Matrix P(s))
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 11 / 88
12. Introduction Notations
Notations
From P(s), we define MP(s) (j), where 0 ≤ j ≤ (l − 1), be the maximum
number at jth column of the profile matrix.
Example 5 (MP(s),j )
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 12 / 88
13. Introduction Notations
Notations
A consensus string is an l-mer, where each of its elements is the
nucleotide base corresponding to MP(s) (i).
Example 6 (Consensus String)
S0,7 : A T G G A A C T
S1,18 : A T G C C A C T
S2,2 : A T G C A A C T
A(s) S3,4 : A T G C A A C T
S4,30 : A T G C A A C T
S5,26 : A T G C A A C T
S6,14 : A T G C A A C G
A: 7 0 0 0 6 7 0 0
T: 0 7 0 0 0 0 0 6
P(s) C: 0 0 0 6 1 0 7 0
G: 0 0 7 1 0 0 0 1
Consensus String A T G C A A C T
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 13 / 88
16. Introduction Motif Finding Problem
Motif Finding Problem
Definition 8 (Motif Finding Problem [Pevzner,2004])
INPUT:
A motif length l
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) },
where each Si is of length ni
OUTPUT:
An array of starting positions s = (a0 , a1 , . . . , a(t−1) )
maximizing consensus Score(s,S)
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 15 / 88
17. Introduction Motif Finding Problem
Naive MFP Solver [Pevzner,2004]
Input: DNA (sequences), motif length l
Output: Starting position s and consensus string corresponding to s
1 For each possible starting position in S,
i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}.
1 Get alignment A(s).
2 Compute for P(s).
3 Evaluate Score(s, S).
2 From s with the maximum Score, get the consensus string.
3 Output consensus string.
Step 1 needs to iterate (n − l + 1)t times because all possible starting
positions s is equal to
s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
18. Introduction Motif Finding Problem
Naive MFP Solver [Pevzner,2004]
Input: DNA (sequences), motif length l
Output: Starting position s and consensus string corresponding to s
1 For each possible starting position in S,
i.e. s ∈ {(0, 0, . . . , 0), . . . , ((n − l), (n − l) . . . , (n − l))}.
1 Get alignment A(s).
2 Compute for P(s).
3 Evaluate Score(s, S).
2 From s with the maximum Score, get the consensus string.
3 Output consensus string.
Step 1 needs to iterate (n − l + 1)t times because all possible starting
positions s is equal to
s = (a0 , a1 , . . . , a(t−1) ), ∀ ai ∈ {0, . . . , (n − l)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 16 / 88
19. Introduction Planted (l, d)-Motif Finding Problem
Definitions
Definition 9 (Challenge Problem [Pevzner,2000])
INPUT:
Motif length l = 15,
Expected mismatches d,
20 DNA sequences each with ni = 600 nucleotide bases
OUTPUT:
A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai
dE (M, Si,ai ) = 4,
for all i ∈ {0, . . . , (t − 1)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 17 / 88
20. Introduction Planted (l, d)-Motif Finding Problem
Why challenging?
Suppose we have A(s),
S0,a0 A C T T G G G G C A A G A G G
S1,a1 G G A C G G G G C A G A C T G
S2,a2 A C T T G C T A A A G A C T G
S3,a3 A C T G C G G G C A C A G T G
S4,a4 A C C T G G G T C G T A C T G
A: 4 0 1 0 0 0 0 1 1 4 1 4 1 0 0
C: 0 4 1 1 1 1 0 0 4 0 1 0 2 0 0
T: 0 0 3 3 0 0 1 1 0 0 1 0 1 4 0
G: 1 1 0 1 4 4 4 3 0 1 2 1 1 1 5
A C T T G G G G C A G A C T G
dE (S0,a0 , S1,a1 ) = 2d = 8
Score(s, S) = 4 + 4 + 3 + 3 + 4 + 4 + 4 + 3 + 4 + 4 + 2 + 4 + 2 + 4 + 5 = 54
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 18 / 88
21. Introduction Planted (l, d)-Motif Finding Problem
Definitions
Definition 10 (Planted (l, d)-Motif Finding Problem [Tompa,2001])
INPUT:
Motif length l,
Expected number of mismatches d, and
A set of t sequences S = {S0 , S1 , S2 , . . . , S(t−1) }, where each Si is of
length ni
OUTPUT:
A consensus string M from an alignment A(s), where each l-mer in A(s)
has Si,ai
dE (M, Si,ai ) = d,
for all i ∈ {0, . . . , (t − 1)}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 19 / 88
22. Introduction Planted (l, d)-Motif Finding Problem
Solutions for Planted (l, d)-Motif Finding
SP-STAR [Pevzner,2000]
Winnower [Pevzner,2000]
Random Projection [Tompa,2001]
Aggregation [Mohammed,2004]
GibbsDST [Shida,2006]
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 20 / 88
23. Finding Motifs using Random Projection (FMURP)
Finding Motifs using Random Projection (FMURP)
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 Projection
1 Get all l-mer Si,j s in S.
2 Get projection hI (Si,j ) for each Si,j in S.
3 Hash each Si,j to buckets with identifier hI (Si,j ).
4 Get enriched buckets.
2 Refine each enriched bucket using EM
3 Refine each enriched bucket using SP-STARσ
4 Maximize score to output best motif
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 21 / 88
24. Finding Motifs using Random Projection (FMURP)
Definition 11
Random Projection Given an l-mer Si,j , projection dimension k, and a set
I ⊂ L = {0, . . . , (l − 1)}, where |I| = k, elements in I are sorted in increasing
order and are randomly chosen from the set L, a k-dimensional projection of
Si,j is
hI (Si,j ) = Si,j (I0 ), Si,j (I1 ), . . . , Si,j (I(k−1) ),
where hI (Si , j) is a k-mer and Ii denotes the ith element in I.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 22 / 88
25. Finding Motifs using Random Projection (FMURP)
FMURP: Example
Example 12
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.
S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8
Let I = {0, 1}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 23 / 88
26. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 24 / 88
27. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 25 / 88
28. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 26 / 88
29. Finding Motifs using Random Projection (FMURP)
Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 27 / 88
30. Parallel Motif Finding using Random Projection
How do we parallelize FMURP?
1 Projection
1 Projection 1 Get all l-mer Si,j s in S in
1 Get all l-mer Si,j s in S. parallel.
2 Get projection hI (Si,j ) for each 2 Get projection hI (Si,j ) for each
Si,j in S. Si,j in S in parallel.
3 Hash each Si,j to buckets with 3 Hash each Si,j to buckets with
identifier hI (Si,j ). identifier hI (Si,j ) in parallel.
4 Get enriched buckets. 4 Get enriched buckets in
2 Refine each enriched bucket parallel.
using EM 2 Refine each enriched bucket
3 Refine each enriched bucket using EM in parallel
using SP-STARσ 3 Refine each enriched bucket
4 Maximize score to output best using SP-STARσ in parallel
motif 4 Maximize score to output best
motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 28 / 88
31. Parallel Motif Finding using Random Projection
Parallel Algorithms for Motif Finding
CUDA-MEME
CUDA-Gibbs Sampling
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 29 / 88
32. Parallel Motif Finding using Random Projection
CUDA
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 30 / 88
33. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Computing Framework
Figure: Flowchart showing the processes done in the CPU and GPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 31 / 88
34. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-FMURP v1
Figure: Thread ID is denoted by an ordered pair (i, j), 0 ≤ i ≤ w and 0 ≤ j ≤ v, where v is
the maximum thread per block and w is the number of allocated thread blocks in the grid. The
algorithm uses a total of x = t · (n − l + 1) threads that are linearly arranged in GPU.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 32 / 88
35. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-FMURP v1
INPUT: Set of sequences S, motif length l, expected mismatches d, projection dimension k,
and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 33 / 88
36. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Integer Conversion
Step 2.2 represents each hI (Si,j ) to their corresponding integer representation
∗
ki,j . Given a unique k-mer from projection, a corresponding integer is
computed using the following mapping. Let us define
f : ΣDNA → {0, 1, 2, 3},
A → 0
C → 1
G → 2
T → 3
where each symbol in the DNA alphabet is mapped to a unique integer.
For a string v of length k,
f∗ : Σ+
DNA → Z+ ∪ {0}
k−1 i
v → i=0 f (vi )4
where vi denotes the symbol at ith position starting from the least significant
digit and the integer representation is only defined on the positive integers
including {0}.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 34 / 88
37. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection v1: Example
Given a set of DNA sequences, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3. Projection in parallel is shown as follows
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 35 / 88
38. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection v1: Integer Conversion example
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 36 / 88
39. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Parallel Integer Conversion Example
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 37 / 88
40. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Getting enriched buckets
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 38 / 88
41. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-Projection: Getting enriched buckets
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 39 / 88
42. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-EM
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 40 / 88
43. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
CUDA-SP-STARσ
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 41 / 88
44. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 42 / 88
45. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
The uniqueness of the representation we defined using f ∗ follows from the
results below.
Let Σk = {0, 1, 2, . . . , k − 1}, and let Ck a regular language such that,
Ck = { } ∪ (Σk − {0})Σ∗ .
k
Theorem 4.1 (Fundamental Theorem of base-k Representation
[Allouche,2003])
Let k ≥ 2 be an integer. Then every non-negative integer has a unique
representation of the form
t
N= ai ki ,
i=0
where at = 0 and 0 ≤ ai < k for 0 ≤ i ≤ t.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 43 / 88
46. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In the case of our representation f ∗ , we have k = 4 and ai = f (vi ), where
vi ∈ ΣDNA . Note that the mapping f is one-to-one and onto by definition. Thus
we have the following:
Proposition 4.1
f ∗ provides a unique representation of hI (Si,j ), for each i, j, and element of I.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 44 / 88
47. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for each Si,j in S,
∗
2 Convert each k-mer hI (Si,j ) to its corresponding integer representation ki,j .
∗
3 Perform a linear search over all ki,j s to determine which l-mers
∗
are ‘hashed’ in the same bucket. The tid of matched ki,j s are noted instead
of the actual l-mer.
3 In CPU, identify the set of enriched buckets,
and prune duplicates in preparation for EM refinement.
4 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 45 / 88
48. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
49. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
50. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
51. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
We have to show that the set of enriched buckets EB obtained in FMURP is
¯
equivalent to the set of enriched buckets EB obtained in CUDA-FMURP v1.
EB = {B| |B| ≥ δ}.
Two elements Si,j and Si ,j belongs to the same bucket B if it follows the
relation R defined below.
Definition 13 (Relation R)
(Si,j , Si ,j ) ∈ B ⇔ (Si,j , Si ,j ) ∈ R
(Si,j , Si ,j ) ∈ R ⇔ hI (Si,j ) = hI (Si ,j )
Proposition 4.2
R is an equivalence relation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 46 / 88
52. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
53. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
54. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness v1
In CUDA-FMURP v1, an enriched bucket is defined as
¯ ¯ ¯
EB = {B| |B| ≥ δ}.
¯
where B is a bucket in CUDA-FMURP and two elements p and q belongs to
¯ ¯
the same bucket B if it follows the relation R defined below.
¯
Definition 14 (Relation R)
¯
(p, q) ∈ B ⇔ (p, q) ∈ R ¯
¯
(p, q) ∈ R ⇔ ∗ = k∗
ki,j ¯¯
i,j
where i = p/(n − l + 1) , j = p mod (n − l + 1), ¯ = q/(n − l + 1) , and
i
¯ = q mod (n − l + 1).
j
Lemma 15
¯
Relation R and R are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 47 / 88
55. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness
¯
Note that elements in B involves Si,j s while elements in B involves the set of
integers p ∈ {0, . . . , (x − 1)}. Using Equations
tid = i × (n − l + 1) + j (2)
tid
i= (3)
(n − l + 1)
j = tid mod (n − l + 1) (4)
we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem
¯
below follows from the fact that R and R are equivalent.
Theorem 4.2
¯
Set of enriched buckets EB and EB are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
56. Parallel Motif Finding using Random Projection Parallel Motifs Finding using Random Projection (CUDA-FMURP v1)
Correctness
¯
Note that elements in B involves Si,j s while elements in B involves the set of
integers p ∈ {0, . . . , (x − 1)}. Using Equations
tid = i × (n − l + 1) + j (2)
tid
i= (3)
(n − l + 1)
j = tid mod (n − l + 1) (4)
we can retrieve the l-mer Si,j corresponding to tid and vice versa. The theorem
¯
below follows from the fact that R and R are equivalent.
Theorem 4.2
¯
Set of enriched buckets EB and EB are equivalent.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 48 / 88
57. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
CUDA-FMURP v2
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s .
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 49 / 88
58. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
CUDA-FMURP v2
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (x − 1)},
1 Get hI (Si,j )s for all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Si,j ) to its corresponding
∗
integer representation ki,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 50 / 88
59. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 51 / 88
60. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 52 / 88
61. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 53 / 88
62. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 54 / 88
63. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
64. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
65. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v2)
Hash Table in CPU
∗
To avoid collision between two items with different ki,j s, linear probing is
implemented.
Suppose, we will hash item p with key ki∗ ,j , and found out that h(ki∗ ,j ) is not
empty,
i.e. ∃ ki,j , such that h(ki,j ) = h(ki∗ ,j ) and ki,j = ki∗ ,j .
∗ ∗ ∗
We have to look for empty positions in table where we can place item p.
We explore positions
h (ki∗ ,j , i) = (h(ki,j ) + i)
∗
mod x
for i from 0 to (m − 1), until an empty hash table position is found.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 55 / 88
66. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-FMURP v3
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (t − 1)},
1 Get hI (Stid,j )s for all Stid,j s in S,
where j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Stid,j ) to its corresponding
∗
integer representation ktid,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 56 / 88
67. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-FMURP v3
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 In CPU, generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 In GPU, for each thread tid in {0, . . . , (t − 1)},
1 Get hI (Stid,j )s for all Stid,j s in S,
where j ∈ 0, . . . , (n − l).
2 Convert each k-mer hI (Stid,j ) to its corresponding
∗
integer representation ktid,j .
3 ∗
In CPU, hash the list of ki,j s.
4 In CPU, identify the set of enriched buckets.
5 In GPU, for each tid in {0, . . . , (e − 1)},
1 Perform EM refinement for each enriched bucket.
2 Perform SP-STARσ for each enriched bucket.
3 Maximize σ score to output best motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 57 / 88
68. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-Projection v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 58 / 88
69. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
CUDA-Projection v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 59 / 88
70. Parallel Motif Finding using Random Projection Parallel Motif Finding using Random Projection (CUDA-FMURP v3)
Integer Conversion
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 60 / 88
71. Result and Analysis
Running Time and Space Complexity
Algorithm Time Space Number of Processors
FMURP O(log(x)) O(x) 1
SEQ-FMURP O(x2 ) Oe(n − l + 1) 1
CUDA-FMURP v1 O(x) O(e(n − l + 1)) x
CUDA-FMURP v2 O(x) O(e(n − l + 1)) x
CUDA-FMURP v3 O(x) O(e(n − l + 1)) t
Table: Total running time and space complexity of the three parallel algorithms for
CUDA-FMURP in comparison with the two sequential implementations.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 61 / 88
72. Result and Analysis
Speedup and Efficiency
FMURP: O(x log x)
The computation of Speedup is the ratio of sequential and parallel running
time.
Sequential
SP =
Parallel
Comparison of Speedups SP , SP , and SP for CUDA-FMURP versions 1 to 3,
respectively is shown below.
O(x log x)
SP = SP = SP = = O(log x)
O(x)
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 62 / 88
73. Result and Analysis
Speedup and Efficiency
Computation of processor Efficiency makes use of the speedup SP and
number of processors used ˆ.
p
1
· SPEP =
ˆ
p
Comparison of Efficiencies EP , EP , and EP for CUDA-FMURP versions 1 to
3, respectively is shown below.
1 log x
EP = · O(log x) = (5)
x x
1 log x
EP = · O(log x) = (6)
x x
1 log x
EP = · O(log x) = (7)
t t
EP = EP < EP
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 63 / 88
74. Result and Analysis Dataset
Dataset
t n l d Instances generated
20 600 10 2 100
20 600 11 2 100
20 600 12 3 100
20 600 13 3 100
20 600 14 4 100
20 600 15 4 100
20 600 16 5 100
20 600 17 5 100
20 600 18 6 100
20 600 19 6 100
Table: Summary of generated dataset that is used to determine the accuracy of
CUDA-FMURP. For each of the instance generated, the search model OOPS is
assumed, that is each sequence contains exactly one occurrence of the planted motif.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 64 / 88
75. Result and Analysis Dataset
Accuracy
t n l d FMURP FMURP∗ SEQ-FMURP CUDA-FMURP m
20 600 10 2 13 100 98 98 72
20 600 11 2 99 100 100 100 16
20 600 12 3 3 96 83 83 259
20 600 13 3 81 100 100 100 62
20 600 14 4 1 86 79 79 645
20 600 15 4 49 100 100 100 172
20 600 16 5 0 77 53 53 1292
20 600 17 5 19 98 98 98 378
20 600 18 6 0 82 38 38 2217
20 600 19 6 9 98 94 94 711
Table: The table shows the number of correctly identified planted motif over 100
random input instances. For each of the instances, parameters k = 7 and s = 4 are
used. The column labelled FMURP∗ is based from the result presented in
[Tompa,2001] using the dataset they generated.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 65 / 88
76. Result and Analysis Machine Setups
Machine Setups
System specifications Values
System specifications Values Host processors (procs) Core(TM) i7-2600 CPU 3.40GHz
Host processors (procs) 2 × Intel Quad-core 2.26GHz Total number of cores 4 × 2 (hyperthreaded) = 8
Total number of cores 8 Max host RAM 8GB
Max host RAM 12GB Device/s (GPU/s) 1 × NVIDIA GeForce GTX 580
Device/s (GPU/s) 2 × NVIDIA GT120 Compute capability 2.0
Compute capability 1.1 CUDA Cores/GPU 16 (multiprocs) × 32 (cores/proc) = 512
CUDA Cores/GPU 4 (multiprocs) × 8 (cores/proc) = 32 GPU clock rate 1.54 GHz
GPU clock rate 1.40 GHz Memory clock rate 2004 Mhz
Memory clock rate 500 Mhz Max device global memory 1535MB
Max device global memory 512MB Operating system 64-bit Ubuntu 10.0.4
Operating system Mac OS X 10.6.8 CUDA version 4.1
CUDA version 3.2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 66 / 88
77. Result and Analysis Actual Speedup
Actual speed of CUDA-Projection v3 with respect to
CUDA-Projection v1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 67 / 88
78. Result and Analysis Actual Speedup
Actual speed of CUDA-FMURP v1 and CUDA-Projection
v3
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 68 / 88
79. Result and Analysis Actual Speedup
Actual Speed Result: Setup1
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 69 / 88
80. Result and Analysis Actual Speedup
Memory Requirement
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 70 / 88
81. Result and Analysis Actual Speedup
Actual speed comparison and speedup of CUDA-FMURP
v1 with respect to SEQ-FMURP and FMURP using Setup 2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 71 / 88
82. Conclusion
Conclusion
In this work, we presented three versions of parallel algorithms for FMURP.
Algorithm Processors SP wrt FMURP SP wrt SEQ-FMURP Efficiency
CUDA-FMURP v1 x O(log x) O(x) (log x/x)
CUDA-FMURP v2 x O(log x) O(x) (log x/x)
CUDA-FMURP v3 t O(log x) O(x) (log x/t)
We implemented CUDA-FMURP v1 and CUDA-FMURP v2 and achieved a
maximum actual speedup of 6.8 and 6.6 respectively with respect to the
SEQ-FMURP.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 72 / 88
84. References
References
J.P. Allouche and J. Shallit, “Automatic Sequences: Theory Applications
and Generalizations”, Cambridge University Press,Chapter 3:
Numeration Systems, pp 70-73, 2003
P. Pevzner and S. H. Sze, “Combinatorial Approaches to Finding Subtle
Signals in DNA Sequences”, Proceedings of 8th Int. Conf. Intelligent
Systems for Molecular Biology (ISMB), 269-78, 2000
J. Buhler, M. Tompa, “Finding Motifs Using Random Projections”,
RECOMB ’01 Proceedings of the fifth annual international conference on
Computational biology, 2001
D. Kirk, W. Hwu, Programming Massively Parallel Processors: A Hands
On Approach, 1st ed. MA, USA: Morgan Kaufmann, 2010
M. Harris, “Mapping computational concepts to GPUs”, ACM
SIGGRAPH 2005 Courses, NY, USA, 2005
N. Jones, P. Pevzner,“An Introduction to Bioinformatics Algorithms”,
Massachusetts Institute of Technology Press, 2004
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 74 / 88
85. Extra Slides
Finding Motifs using Random Projection (FMURP)
INPUT: Set of sequences S, motif length l, expected mismatches d, projection
dimension k, and bucket threshold δ
OUTPUT: Motif
1 Projection
1 Generate k random positions for projection.
Let this be the set I = {ˆ ˆ ∈ {0, . . . , (l − 1)}} and |I| = k.
i|i
2 For each Si,j in S,
1 Get hI (Si,j )s from all Si,j s in S,
where i ∈ {0, . . . , (t − 1)}, and j ∈ 0, . . . , (n − l).
2 Sort Si,j s with respect to hI (Si,j ).
3 Perform a linear search over all hI (Si,j )s to determine which l-mers
are ‘hashed’ in the same bucket.
2 Refine each enriched bucket using Expectation Maximization (EM)
3 Refine each enriched bucket using SP-STARσ
4 Maximize score to output best motif
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 75 / 88
86. Extra Slides Projection
Projection: Example
Given a set of DNA sequences S, pattern length l = 4, projection dimension
k = 2, and bucket threshold δ = 3.
S0 : C G G T C A G G
S1 : T T C G A C A T
S2 : A C G A T G A A
Figure: Set of t = 3 sequences each with n = 8
We generate the set of k random positions used in the actual projection.
Suppose we have the set I = {0, 1}.
For all Si,j in S, we get hI (Si,j ) using the random positions in I generated
in step 1.
To hash Si,j s to corresponding buckets using its hI (Si,j ), the list defined
above is sorted lexicographically in terms of hI (Si,j ) together with their
corresponding Si,j s .The sorted list is obtained.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 76 / 88
87. Extra Slides Projection
Projection: Example
Label Si,j hI (Si,j ) Label Sorted Si,j Sorted hI (Si,j )
S0,0 CGGT CG S2,0 ACGA AC
S0,1 GGTC GG S1,4 ACAT AC
S0,2 GTCA GT S2,3 ATCA AT
S0,3 TCAG TC S0,4 CAGG CA
S0,4 CAGG CA S0,0 CGGT CG
S1,0 TTCG TT S2,1 CGAT CG
S1,1 TCGA TC S1,2 CGAC CG
S1,2 CGAC CG S1,3 GACA GA
S1,3 GACA GA S2,2 GATC GA
S1,4 ACAT AC S0,1 GGTC GG
S2,0 ACGA AC S0,2 GTCA GT
S2,1 CGAT CG S1,1 TCGA TC
S2,2 GATC GA S0,3 TCAG TC
S2,3 ATCA AT S2,4 TGAA TG
S2,4 TGAA TG S1,0 TTCG TT
J.B. Clemente (ACLab, DCS, UPD) h (S )s computed from step 2. March 31, 2012
Figure: Illustration showing the set of CUDA-FMURP The sorted 77 / 88
88. Extra Slides Projection
Projection: Example
To get the list of buckets, we will perform a linear search over hI (Si,j )s to
get the corresponding Si,j with equivalent hI (Si,j )s.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 78 / 88
89. Extra Slides Projection
Projection: Example
From the set of buckets obtained, we identify which of those contains at
least δ l-mers hashed and consider them enriched.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 79 / 88
90. Extra Slides Projection
Projection: Example
From the set of buckets obtained, we identify which of those contains at
least δ l-mers hashed and consider them enriched.
hI (Si,j ) Count Si,j
AC 2 { ACGA, ACAT }
AT 1 { ATCA }
CA 1 {CAGG }
CG 3 {CGGT, CGAT , CGAC }
GA 2 {GACA, GATC }
GG 1 {GGTC }
GT 1 {GTCA }
TC 2 {TCGA, TCAG }
TG 1 {TGAA }
TT 1 {TTCG}
Figure: Buckets obtained from Projection
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 80 / 88
91. Extra Slides Expectation Maximization (EM)
Expectation Maximization (EM)
INPUT: Motif model θ0 from one enriched bucket, maximum number of
iterations, and threshold for convergence δEM
OUTPUT: Motif model θy
1 For j in {1, . . . , y} or until convergence
1 E-step For all l-mer in each sequence Si ,
compute E(Si,ai |θj ) given the current motif model.
2 (M-step) For all Si in S,
get starting positions s such that for each ai ∈ s,
E(Si,ai |θj ) is maximum ∀ ai in {0, . . . , (n − l)}.
3 (Test for Convergence) Compute L(θj ). Compare previous
likelihood L(θj−1 ) to current L(θj ).
If the difference satisfies the threshold δEM , stop iteration.
4 (Update step) For the alignment made by starting position vector s
identified in M-step,
get motif model θj+1 .
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 81 / 88
92. Extra Slides Expectation Maximization (EM)
EM: Example
From the set of enriched bucket from Projection, EM performs the following
operations.
From EB , get the alignment made by hashed l-mers.
C G G T
C G A C
C G A T
From the alignment made, a profile matrix is computed.
C G G T
C G A C
C G A T
A: 0 0 2 0
C: 3 0 0 1
G: 0 3 1 0
T: 0 0 0 2
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 82 / 88
93. Extra Slides Expectation Maximization (EM)
EM: Example
Normalize the profile matrix obtained.
A: 0.00 0.00 0.33 0.00
C: 1.00 0.00 0.00 0.33
G: 0.00 1.00 0.66 0.00
T: 0.00 0.00 0.00 0.66
To avoid zero values for Pr(Si,j |θ), [Tompa,2001] performed Laplace
correction. For each row corresponding to a symbol say a, the
probability pa that symbol a appears in the sequence is added to its
corresponding row. Since all symbols in ΣDNA has uniform frequency
distribution, 0.25 is added for each row.
A: 0.25 0.25 0.58 0.25
C: 1.25 0.25 0.25 0.58
G: 0.25 1.25 0.91 0.25
T: 0.25 0.25 0.25 0.91
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 83 / 88
94. Extra Slides Expectation Maximization (EM)
EM: Example
Normalize the matrix obtained and let the resulting matrix be the initial
motif model θ0 .
A: 0.125 0.125 0.290 0.125
C: 0.625 0.125 0.125 0.290
G: 0.125 0.625 0.455 0.125
T: 0.125 0.125 0.125 0.455
For each Si in S get j such that for all j ∈ {0, . . . , (n − l)}, E(Si,j |θ0 ) is
maximum. For instance, let’s identify an l-mer in sequence S0 with
maximum expectation E(S0,j |θ0 ).
E(S0,0 |θ0 ) = E(CGGT|θ0 ) = ((0.625)(0.625)(0.455)(0.455))/(0.254 ) = 20.725
E(S0,1 |θ0 ) = E(GGTC|θ0 ) = ((0.125)(0.625)(0.125)(0.125))/(0.254 ) = 00.313
E(S0,2 |θ0 ) = E(GTCA|θ0 ) = ((0.125)(0.125)(0.125)(0.125))/(0.254 ) = 00.063
E(S0,3 |θ0 ) = E(TCAG|θ0 ) = ((0.125)(0.125)(0.455)(0.290))/(0.254 ) = 00.528
E(S0,4 |θ0 ) = E(CAGG|θ0 ) = ((0.625)(0.125)(0.455)(0.125))/(0.254 ) = 01.138
From all S0,j s in S0 , l-mer S0,0 obtains the highest expectation.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 84 / 88
95. Extra Slides Expectation Maximization (EM)
EM: Example
The set of l-mers with the highest expectation in each sequence will
define another alignment, like in Step 1. From this set of l-mers, we can
obtain the next motif model θ1 .
S0,0 : C G G T : 20.73
S1,2 : C G A C : 08.41
S2,1 : C G A T : 13.20
We compute the likelihood of a motif model θy using the best
expectations.
L(θ) = 20.73 + 08.41 + 13.20 = 42.34
Update the motif model θ0 to get θ1 , using the set of l-mers from each
sequence that maximize the expectation.
Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .
The output of EM in this example is the consensus string CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
96. Extra Slides Expectation Maximization (EM)
EM: Example
The set of l-mers with the highest expectation in each sequence will
define another alignment, like in Step 1. From this set of l-mers, we can
obtain the next motif model θ1 .
S0,0 : C G G T : 20.73
S1,2 : C G A C : 08.41
S2,1 : C G A T : 13.20
We compute the likelihood of a motif model θy using the best
expectations.
L(θ) = 20.73 + 08.41 + 13.20 = 42.34
Update the motif model θ0 to get θ1 , using the set of l-mers from each
sequence that maximize the expectation.
Stop iteration if L(θy ) − L(θy−1 ) ≤ δEM .
The output of EM in this example is the consensus string CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 85 / 88
97. Extra Slides Expectation Maximization (EM)
SP-STARσ
INPUT: Consensus string M from θy and expected mismatches d
OUTPUT: Refined consensus string M ∗
1 For j in {1, . . . , y } or until convergence
1 Compute for Sb , where Sb is the set of all l-mers from each sequence that
has the least Edit distance from M.
Sb = {Si,j |dE (M, Si,j ) is minimum ∀Si,j in Si }
2 Compute for score σ(Sb ), where it is equal to the number of sequences in
Sb such that
dE (M, Si,j ) ≤ d
3 Compute the consensus string M from alignment made by Sb .
4 Compute Sb from M .
5 Compute σ(Sb ).
6 If σ(Sb ) > σ(Sb ), continue iteration using M = M ,
else M ∗ = M .
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 86 / 88
98. Extra Slides Expectation Maximization (EM)
SP-STARσ: Example
Using M =CGAT and expected mismatches d = 1.
Compute for Sb . For S0 the S0,j is identified as follows.
dE (M, S0,0 ) = dE (CGAT, CGGT) = 1
dE (M, S0,1 ) = dE (CGAT, GGTC) = 3
dE (M, S0,2 ) = dE (CGAT, GTCA) = 4
dE (M, S0,3 ) = dE (CGAT, TCAG) = 3
dE (M, S0,4 ) = dE (CGAT, CAGG) = 3
The set Sb contains
Sb = {S0,0 , S1,2 , S2,1 }
Sb = CGGT, CGAC, CGAT
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 87 / 88
99. Extra Slides Expectation Maximization (EM)
SP-STARσ: Example
Score for Sb is
σ(Sb ) = 3
because the least edit distance in each sequence is 1, 1, 0. That is all 3
sequences satisfies
dE (M, Si,j ) ≤ 1
Consensus string from Sb is M = CGAT.
Sb from M is similar to Sb .
Sb = {S0,0 , S1,2 , S2,1 }
Sb = {CGGT, CGAC, CGAT}
Since σ(Sb ) = σ(Sb ),
M ∗ = M = CGAT.
J.B. Clemente (ACLab, DCS, UPD) CUDA-FMURP March 31, 2012 88 / 88