Motif finding plays a vital role in identifying transcription factor binding sites (TFBS) that help understand gene expression regulation. The document discusses various approaches to motif finding including:
1. Enumerative approaches that search for consensus sequences like YMF, DREME, CisFinder, Weeder, FMotif, and MCES.
2. Probabilistic approaches using position weight matrices (PWM) and algorithms like EM, MEME, STEME, EXTREME, Gibbs sampling, AlignACE, and BioProspector.
3. Other approaches like nature-inspired algorithms, combinatorial methods, and the LOGOS algorithm based on Bayesian modeling.
The document provides details on the methodology,
The Protein Information Resource, is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies & contains protein sequences databases
Protein databases contain information on protein sequences, structures, and functions. The major protein databases are:
- Protein Data Bank (PDB) which contains 3D protein structures determined via X-ray crystallography or NMR.
- Swiss-Prot which contains manually annotated protein sequences and functions.
- TrEMBL which supplements Swiss-Prot with automatically annotated translations of DNA sequences.
Protein databases are important for comparing proteins, understanding relationships between proteins, and aiding the study of new proteins. Searching databases is often the first step in protein research.
This document discusses gene identification and discovery. It begins by describing gene identification in prokaryotes, unicellular eukaryotes, and multicellular eukaryotes. It then discusses the components of protein-coding genes and different approaches to gene prediction for prokaryotes and eukaryotes. The document also covers gene structure in prokaryotes and eukaryotes, as well as software tools and methods for gene prediction. Finally, it discusses several approaches to classifying genes by function.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
Contents:
What does sequence mean?
Examples of sequences
Sequence Homology
Sequence Alignment
What is the use of sequence alignment?
Alignment methods
Tools for Sequence Alignment
FASTA Format
BLAST
Principle of BLAST
Variants of BLAST Program
BLAST input
BLAST output
Multiple sequence alignment
What is the use of multiple alignments?
Multiple Alignment Method
Tool for multiple alignments
ClustalW input
ClustalW output
E (Expectation) value
Demerits of progressive alignment
This document provides information about several nucleotide and protein sequence databases including:
- INSDC (International Nucleotide Sequence Database Collaboration) which includes GenBank, EMBL, and DDBJ.
- GenBank which contains over 80 billion nucleotide bases from 76 million sequences and doubles in size every 18 months. The top species represented are human, mouse, rat, cattle, and maize.
- EMBL and DDBJ which are similar to GenBank in content and format but maintained by different collaborations. Secondary databases like UniProt, PROSITE and PRINTS/BLOCKS provide additional annotation and analysis of sequences.
The Protein Information Resource, is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies & contains protein sequences databases
Protein databases contain information on protein sequences, structures, and functions. The major protein databases are:
- Protein Data Bank (PDB) which contains 3D protein structures determined via X-ray crystallography or NMR.
- Swiss-Prot which contains manually annotated protein sequences and functions.
- TrEMBL which supplements Swiss-Prot with automatically annotated translations of DNA sequences.
Protein databases are important for comparing proteins, understanding relationships between proteins, and aiding the study of new proteins. Searching databases is often the first step in protein research.
This document discusses gene identification and discovery. It begins by describing gene identification in prokaryotes, unicellular eukaryotes, and multicellular eukaryotes. It then discusses the components of protein-coding genes and different approaches to gene prediction for prokaryotes and eukaryotes. The document also covers gene structure in prokaryotes and eukaryotes, as well as software tools and methods for gene prediction. Finally, it discusses several approaches to classifying genes by function.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
Genomic databases are referred to as online repositories of genomic variants, described for a single (locus-specific) or more (general) genes or specifically for a population or ethnic group (national/ethnic).
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
Contents:
What does sequence mean?
Examples of sequences
Sequence Homology
Sequence Alignment
What is the use of sequence alignment?
Alignment methods
Tools for Sequence Alignment
FASTA Format
BLAST
Principle of BLAST
Variants of BLAST Program
BLAST input
BLAST output
Multiple sequence alignment
What is the use of multiple alignments?
Multiple Alignment Method
Tool for multiple alignments
ClustalW input
ClustalW output
E (Expectation) value
Demerits of progressive alignment
This document provides information about several nucleotide and protein sequence databases including:
- INSDC (International Nucleotide Sequence Database Collaboration) which includes GenBank, EMBL, and DDBJ.
- GenBank which contains over 80 billion nucleotide bases from 76 million sequences and doubles in size every 18 months. The top species represented are human, mouse, rat, cattle, and maize.
- EMBL and DDBJ which are similar to GenBank in content and format but maintained by different collaborations. Secondary databases like UniProt, PROSITE and PRINTS/BLOCKS provide additional annotation and analysis of sequences.
UniProt is a comprehensive, freely accessible database that is a central repository for protein data. It is produced through collaboration between the European Bioinformatics Institute, Swiss Institute of Bioinformatics, and Protein Information Resource. UniProt contains protein sequences, functional information, evolutionary data, and details about biological processes, post-translational modifications, interactions, and subcellular locations to characterize proteins.
Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
The CATH database hierarchically classifies protein domains obtained from protein structures deposited in the Protein Data Bank. Domain identification and classification uses both manual and automated procedures. CATH includes domains from structures determined at 4 angstrom resolution or better that are at least 40 residues long with 70% or more residues having defined side chains. Submitted protein chains are divided into domains, which are then classified in CATH.
The Protein Data Bank (PDB) is a single worldwide database that stores 3D structural data of proteins and nucleic acids. It is operated by Rutgers University, the San Diego Supercomputer Center, and the Research Collaboratory for Structural Bioinformatics. The PDB is freely accessible online and contains over 76,000 biomolecular structure entries as of 2011. It uses a common file format to represent structural data and is updated weekly as new entries are submitted by researchers.
The document discusses various methods for predicting protein function, including homology-based transfer of annotation and prediction of functional motifs and domains. Homology-based transfer can infer molecular function from sequence similarity, but biological process is only transferable between orthologs. Orthologs can be detected through phylogenetic trees or automated methods like InParanoid. Each protein domain contributes to molecular function, while short motifs like phosphorylation sites are also important. Functional annotation involves describing proteins at the molecular, biological process, and cellular component levels.
Protein structure classification/domain prediction: SCOP and CATH (Bioinforma...SELF-EXPLANATORY
This pdf is about the protein structure classification/domain prediction: SCOP and CATH (Bioinformatics).
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
An integrated publicly accessible bioinformatics resource to support genomic/proteomic research and scientific discovery.
Established in 1984, by the National Biomedical Research Foundation (NBRF) Georgetown University Medial Center, Washington D.C., USA.
It is the source of annotated protein databases and analysis tools for the researchers.
Serve as primary resource for the exploration of protein information.
Accessible by text search for entry and list retrieval, and also BLAST search and peptide match.
This document provides an overview of bioinformatics. It defines bioinformatics as the science of collecting, analyzing and conceptualizing biological data through computational techniques. It discusses that bioinformatics involves managing, organizing and processing biological information from databases, as well as analyzing, visualizing and sharing biological data over the internet. It also outlines some of the goals of bioinformatics like organizing the human and mouse genomes, as well as some applications like genomic and protein sequence analysis, protein structure prediction, and characterizing genomes.
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC.
The document summarizes the process a molecular biologist would take to identify the gene causing a new disease phenotype in a patient. It involves collecting ESTs from the patient's tissue, searching for matching sequences in the human genome using BLAST, and analyzing the matching genes and variants like SNPs to see if any are known to cause the observed phenotype. The document walks through this process using a real example where an EST matches the HFE gene, and a variant changing a cysteine to tyrosine is known to cause hemochromatosis.
1. BLAST is a program that uses computer algorithms to compare a query DNA or protein sequence to sequence databases and identify sequences that resemble the query sequence above a certain threshold.
2. BLAST works by searching for short, exact matches between the query and database sequences, then extends the matches to find similar though not exact alignments.
3. Analyzing the BLAST results can provide information about the evolutionary relationship between the query sequence and matched sequences, such as whether they come from the same gene or protein family.
The document discusses several sequence similarity tools: BLAST, FASTA, and CLUSTAL. It describes BLAST as an algorithm that finds locally and globally alignable sequences by calculating segment pairs between a query and database sequences above a scoring threshold. FASTA is described as a fast protein or nucleotide comparison tool that is more sensitive than BLAST but slower. CLUSTAL W is introduced as a tool that produces meaningful multiple sequence alignments of divergent sequences and can reveal evolutionary relationships.
UniProt is a comprehensive database of protein sequences and functional information that is curated by the European Bioinformatics Institute, SIB Swiss Institute of Bioinformatics, and the Protein Information Resource. It contains three main components: UniProtKB for functional information, UniRef for clustered sequences, and UniParc for comprehensive protein sequences. As genome sequencing increases, UniProt provides centralized storage and interconnection of protein data from various sources to further scientific understanding of proteins.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
This document provides an introduction to biological databases and bioinformatics tools. It defines biological sequences and databases, and describes the types of bioinformatics databases including primary, secondary, and composite databases. Examples of specific biological databases like GenBank, EMBL, and SwissProt are outlined. Common bioinformatics tools for sequence analysis, structural analysis, protein function analysis, and homology/similarity searches are listed, including BLAST, FASTA, EMBOSS, ClustalW, and RasMol. Finally, important bioinformatics resources on the web are highlighted.
History and devolopment of bioinfomatics.ppt (1)Madan Kumar Ca
Dear Sir, Madam
Name: Madan Kumar C A
Topic: History and Development of Bioinformatics
Guide: Dr. Ramesh C K
Associate Professor
Dept of Biotechnnology
Sahyadri Science College
Shivamogga
The document discusses global sequence alignment and the Needleman-Wunsch algorithm. Global alignment finds the optimal alignment of two sequences over their entire lengths by allowing gaps. It uses a dynamic programming approach to fill a matrix to find the highest scoring alignment based on matches, mismatches and gap penalties. The algorithm allows obtaining an alignment score that accounts for the total number of matches and mismatches as well as gap penalties.
Secondary structure prediction tools analyze a protein's amino acid sequence to predict its 3D structure and function. These tools use various methods like Chou-Fasman, GOR, neural networks, and hidden Markov models to identify alpha helices and beta sheets based on characteristics like residue propensity values, sequence homology, and patterns in windows of amino acids. Accurate prediction of secondary structure is important for determining a protein's tertiary structure and biological role.
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET Journal
This document presents a novel approach for motif discovery using data mining algorithms. It proposes the MDWB algorithm which uses a word-based approach and MapReduce functions to identify motifs from large ChIP-seq datasets. The algorithm labels control and test datasets to determine thresholds and extracts frequently occurring substrings. MapReduce is used to reduce memory and time costs during the mining stage. Experimental results on ChIP-seq datasets show the MDWB algorithm achieves higher accuracy and runs faster than previous methods for identifying transcription factor binding sites.
A general frame for building optimal multiple SVM kernelsinfopapers
Dana Simian, Florin Stoica, A General Frame for Building Optimal Multiple SVM Kernels, Large-Scale Scientific Computing, Lecture Notes in Computer Science, 2012, Volume 7116/2012, 256-263, DOI: 10.1007/978-3-642-29843-1_29
UniProt is a comprehensive, freely accessible database that is a central repository for protein data. It is produced through collaboration between the European Bioinformatics Institute, Swiss Institute of Bioinformatics, and Protein Information Resource. UniProt contains protein sequences, functional information, evolutionary data, and details about biological processes, post-translational modifications, interactions, and subcellular locations to characterize proteins.
Automated sequencing of genomes require automated gene assignment
Includes detection of open reading frames (ORFs)
Identification of the introns and exons
Gene prediction a very difficult problem in pattern recognition
Coding regions generally do not have conserved sequences
Much progress made with prokaryotic gene prediction
Eukaryotic genes more difficult to predict correctly
The CATH database hierarchically classifies protein domains obtained from protein structures deposited in the Protein Data Bank. Domain identification and classification uses both manual and automated procedures. CATH includes domains from structures determined at 4 angstrom resolution or better that are at least 40 residues long with 70% or more residues having defined side chains. Submitted protein chains are divided into domains, which are then classified in CATH.
The Protein Data Bank (PDB) is a single worldwide database that stores 3D structural data of proteins and nucleic acids. It is operated by Rutgers University, the San Diego Supercomputer Center, and the Research Collaboratory for Structural Bioinformatics. The PDB is freely accessible online and contains over 76,000 biomolecular structure entries as of 2011. It uses a common file format to represent structural data and is updated weekly as new entries are submitted by researchers.
The document discusses various methods for predicting protein function, including homology-based transfer of annotation and prediction of functional motifs and domains. Homology-based transfer can infer molecular function from sequence similarity, but biological process is only transferable between orthologs. Orthologs can be detected through phylogenetic trees or automated methods like InParanoid. Each protein domain contributes to molecular function, while short motifs like phosphorylation sites are also important. Functional annotation involves describing proteins at the molecular, biological process, and cellular component levels.
Protein structure classification/domain prediction: SCOP and CATH (Bioinforma...SELF-EXPLANATORY
This pdf is about the protein structure classification/domain prediction: SCOP and CATH (Bioinformatics).
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
An integrated publicly accessible bioinformatics resource to support genomic/proteomic research and scientific discovery.
Established in 1984, by the National Biomedical Research Foundation (NBRF) Georgetown University Medial Center, Washington D.C., USA.
It is the source of annotated protein databases and analysis tools for the researchers.
Serve as primary resource for the exploration of protein information.
Accessible by text search for entry and list retrieval, and also BLAST search and peptide match.
This document provides an overview of bioinformatics. It defines bioinformatics as the science of collecting, analyzing and conceptualizing biological data through computational techniques. It discusses that bioinformatics involves managing, organizing and processing biological information from databases, as well as analyzing, visualizing and sharing biological data over the internet. It also outlines some of the goals of bioinformatics like organizing the human and mouse genomes, as well as some applications like genomic and protein sequence analysis, protein structure prediction, and characterizing genomes.
The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It is located at the National Institute of Genetics (NIG) in the Shizuoka prefecture of Japan. It is also a member of the International Nucleotide Sequence Database Collaboration or INSDC.
The document summarizes the process a molecular biologist would take to identify the gene causing a new disease phenotype in a patient. It involves collecting ESTs from the patient's tissue, searching for matching sequences in the human genome using BLAST, and analyzing the matching genes and variants like SNPs to see if any are known to cause the observed phenotype. The document walks through this process using a real example where an EST matches the HFE gene, and a variant changing a cysteine to tyrosine is known to cause hemochromatosis.
1. BLAST is a program that uses computer algorithms to compare a query DNA or protein sequence to sequence databases and identify sequences that resemble the query sequence above a certain threshold.
2. BLAST works by searching for short, exact matches between the query and database sequences, then extends the matches to find similar though not exact alignments.
3. Analyzing the BLAST results can provide information about the evolutionary relationship between the query sequence and matched sequences, such as whether they come from the same gene or protein family.
The document discusses several sequence similarity tools: BLAST, FASTA, and CLUSTAL. It describes BLAST as an algorithm that finds locally and globally alignable sequences by calculating segment pairs between a query and database sequences above a scoring threshold. FASTA is described as a fast protein or nucleotide comparison tool that is more sensitive than BLAST but slower. CLUSTAL W is introduced as a tool that produces meaningful multiple sequence alignments of divergent sequences and can reveal evolutionary relationships.
UniProt is a comprehensive database of protein sequences and functional information that is curated by the European Bioinformatics Institute, SIB Swiss Institute of Bioinformatics, and the Protein Information Resource. It contains three main components: UniProtKB for functional information, UniRef for clustered sequences, and UniParc for comprehensive protein sequences. As genome sequencing increases, UniProt provides centralized storage and interconnection of protein data from various sources to further scientific understanding of proteins.
The document discusses different text-based database retrieval systems for accessing biological data, including Entrez, SRS, and DBGET/LinkDB. It describes their key features and how each system allows users to search text databases using queries, with Entrez providing linked related data across multiple databases. An example shows how each system can be used to retrieve and view related information for a SwissProt protein entry.
This document provides an introduction to biological databases and bioinformatics tools. It defines biological sequences and databases, and describes the types of bioinformatics databases including primary, secondary, and composite databases. Examples of specific biological databases like GenBank, EMBL, and SwissProt are outlined. Common bioinformatics tools for sequence analysis, structural analysis, protein function analysis, and homology/similarity searches are listed, including BLAST, FASTA, EMBOSS, ClustalW, and RasMol. Finally, important bioinformatics resources on the web are highlighted.
History and devolopment of bioinfomatics.ppt (1)Madan Kumar Ca
Dear Sir, Madam
Name: Madan Kumar C A
Topic: History and Development of Bioinformatics
Guide: Dr. Ramesh C K
Associate Professor
Dept of Biotechnnology
Sahyadri Science College
Shivamogga
The document discusses global sequence alignment and the Needleman-Wunsch algorithm. Global alignment finds the optimal alignment of two sequences over their entire lengths by allowing gaps. It uses a dynamic programming approach to fill a matrix to find the highest scoring alignment based on matches, mismatches and gap penalties. The algorithm allows obtaining an alignment score that accounts for the total number of matches and mismatches as well as gap penalties.
Secondary structure prediction tools analyze a protein's amino acid sequence to predict its 3D structure and function. These tools use various methods like Chou-Fasman, GOR, neural networks, and hidden Markov models to identify alpha helices and beta sheets based on characteristics like residue propensity values, sequence homology, and patterns in windows of amino acids. Accurate prediction of secondary structure is important for determining a protein's tertiary structure and biological role.
IRJET-A Novel Approaches for Motif Discovery using Data Mining AlgorithmIRJET Journal
This document presents a novel approach for motif discovery using data mining algorithms. It proposes the MDWB algorithm which uses a word-based approach and MapReduce functions to identify motifs from large ChIP-seq datasets. The algorithm labels control and test datasets to determine thresholds and extracts frequently occurring substrings. MapReduce is used to reduce memory and time costs during the mining stage. Experimental results on ChIP-seq datasets show the MDWB algorithm achieves higher accuracy and runs faster than previous methods for identifying transcription factor binding sites.
A general frame for building optimal multiple SVM kernelsinfopapers
Dana Simian, Florin Stoica, A General Frame for Building Optimal Multiple SVM Kernels, Large-Scale Scientific Computing, Lecture Notes in Computer Science, 2012, Volume 7116/2012, 256-263, DOI: 10.1007/978-3-642-29843-1_29
The document discusses different types of sequence analysis and alignment methods. It describes analyzing DNA, RNA, and protein sequences to understand their features, functions, and evolution. Methods include aligning sequences globally or locally to identify similar regions. Pairwise alignment involves two sequences while multiple sequence alignment incorporates more sequences using techniques like dynamic programming, progressive alignment, and motif finding. Structural alignments also use 3D protein or RNA structure information.
International Journal of Computer Science, Engineering and Information Techno...IJCSEIT Journal
In the field of proteomics because of more data is added, the computational methods need to be more
efficient. The part of molecular sequences is functionally more important to the molecule which is more
resistant to change. To ensure the reliability of sequence alignment, comparative approaches are used. The
problem of multiple sequence alignment is a proposition of evolutionary history. For each column in the
alignment, the explicit homologous correspondence of each individual sequence position is established. The
different pair-wise sequence alignment methods are elaborated in the present work. But these methods are
only used for aligning the limited number of sequences having small sequence length. For aligning
sequences based on the local alignment with consensus sequences, a new method is introduced. From NCBI
databank triticum wheat varieties are loaded. Phylogenetic trees are constructed for divided parts of
dataset. A single new tree is constructed from previous generated trees using advanced pruning technique.
Then, the closely related sequences are extracted by applying threshold conditions and by using shift
operations in the both directions optimal sequence alignment is obtained.
Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification? IJORCS
An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.
This document discusses methods for constructing phylogenetic trees including distance-based and character-based approaches. Distance-based methods include UPGMA, Neighbor-Joining (NJ), and Fitch-Margoliash (FM) which use genetic distances between sequences. Character-based methods include Maximum Parsimony (MP) which finds the tree requiring the fewest evolutionary changes, and Maximum Likelihood (ML) which calculates the probability of the observed sequence changes. NJ is the fastest method while ML is the slowest but uses all available sequence data. The appropriate method depends on factors like number of sequences and computational requirements.
A Methodology For Motif Discovery Employing Iterated Cluster Re-AssignmentAngela Tyger
The document proposes a methodology for iterative motif discovery that combines cluster-first and regression approaches. It begins with an initial clustering of genes based on expression data. Motifs are discovered in each cluster and optionally filtered via regression. Genes are then reassigned to clusters based on motif profiles, and the process iterates until cluster assignments converge. The goal is to obtain homogeneous gene subsets for improved motif discovery performance compared to using random subsets or a single clustering.
Holistic Approach for Arabic Word RecognitionEditor IJCATR
Optical Character Recognition (OCR) is one of the important branches. One segmenting words into character is one of the
most challenging steps on OCR. As the results of advances in machine speeds and memory sizes as well as the availability of large
training dataset, researchers currently study Holistic Approach “recognition of a word without segmentation”. This paper describes a
method to recognize off-line handwritten Arabic names. The classification approach is based on Hidden Markov models.. For each
Arabic word many HMM models with different number of states have been trained. The experiments result are encouraging, it also
show that best number of state for each word need careful selection and considerations.
Myanmar Named Entity Recognition with Hidden Markov Modelijtsrd
This document presents a study on Named Entity Recognition for the Myanmar language using Hidden Markov Models. It discusses how the HMM approach works in three phases: annotation to tag text, training to estimate model parameters, and testing to apply the model. Parameters like transition probabilities between tags and emission probabilities of words for each tag are estimated from annotated training data. The model achieves 95.2% accuracy, 99.3% precision, 95.2% recall, and 97.2% F-measure on test data, showing the HMM approach is effective for Myanmar NER.
This document provides an overview of machine learning methods, including supervised and unsupervised learning. It discusses commonly used machine learning algorithms like support vector machines (SVM), hidden Markov models, decision trees, random forests, Bayesian networks, and neural networks. It also covers datasets, assessment metrics, and caveats to consider when using machine learning.
An efficient algorithm for sequence generation in data miningijcisjournal
Data mining is the method or the activity of analyzing data from different perspectives and summarizing it
into useful information. There are several major data mining techniques that have been developed and are
used in the data mining projects which include association, classification, clustering, sequential patterns,
prediction and decision tree. Among different tasks in data mining, sequential pattern mining is one of the
most important tasks. Sequential pattern mining involves the mining of the subsequences that appear
frequently in a set of sequences. It has a variety of applications in several domains such as the analysis of
customer purchase patterns, protein sequence analysis, DNA analysis, gene sequence analysis, web access
patterns, seismologic data and weather observations. Various models and algorithms have been developed
for the efficient mining of sequential patterns in large amount of data. This research paper analyzes the
efficiency of three sequence generation algorithms namely GSP, SPADE and PrefixSpan on a retail dataset
by applying various performance factors. From the experimental results, it is observed that the PrefixSpan
algorithm is more efficient than other two algorithms.
The document discusses exact comparative motif discovery in monocots. It describes using a Branch Length Speller algorithm to detect sequence motifs across multiple plant genomes in an alignment-free manner. The algorithm uses a statistical validation approach involving control motifs to evaluate motif conservation. It indexes the genome sequences with generalized suffix trees to allow exhaustive motif searching and handles imperfect data robustly. The dataset involves promoters from four monocot species, including maize and rice.
This document describes a study on deep keyphrase generation from scientific papers. The researchers developed a recurrent neural network model with an encoder-decoder architecture and a copy mechanism to generate keyphrases. The model reads the input text, understands the context, summarizes meaningful phrases, and copies relevant phrases from the text. It outperforms previous methods on four benchmark datasets for predicting present keyphrases and shows improved recall for predicting absent keyphrases. The researchers also demonstrate the model can transfer to generating keyphrases for news articles.
Deep Learning for Biomedical Unstructured Time SeriesPetteriTeikariPhD
1D Convolutional neural networks (CNNs) for time series analysis, and inspiration from beyond biomedical field. Short intro for various different steps involved in Time Series Analysis including outlier detection, imputation, denoising, segmentation, classification and forecasting.
Available also from:
https://www.dropbox.com/s/cql2jhrt5mdyxne/timeSeries_deepLearning.pdf?dl=0
To mine out relevant facts at the time of need from web has been a tenuous task. Research on diverse fields
are fine tuning methodologies toward these goals that extracts the best of information relevant to the users
search query. In the proposed methodology discussed in this paper find ways to ease the search complexity
tackling the severe issues hindering the performance of traditional approaches in use. The proposed
methodology find effective means to find all possible semantic relatable frequent sets with FP Growth
algorithm. The outcome of which is the further source of fuel for Bio inspired Fuzzy PSO to find the optimal
attractive points for the web documents to get clustered meeting the requirement of the search query
without losing the relevance.
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...ijcseit
To mine out relevant facts at the time of need from web has been a tenuous task. Research on diverse fields
are fine tuning methodologies toward these goals that extracts the best of information relevant to the users
search query. In the proposed methodology discussed in this paper find ways to ease the search complexity
tackling the severe issues hindering the performance of traditional approaches in use. The proposed
methodology find effective means to find all possible semantic relatable frequent sets with FP Growth
algorithm. The outcome of which is the further source of fuel for Bio inspired Fuzzy PSO to find the optimal
attractive points for the web documents to get clustered meeting the requirement of the search query
without losing the relevance. On the whole the proposed system optimizes the objective function of
minimizing the intra cluster differences and maximizes the inter cluster distances along with retention of all
possible relationships with the search context intact. The major contribution being the system finds all
possible combinations matching the user search transaction and thereby making the system more
meaningful. These relatable sets form the set of particles for Fuzzy Clustering as well as PSO and thus
being unbiased and maintains a innate behaviour for any number of new additions to follow the herd
behaviour’s evaluations reveals the proposed methodology fares well as an optimized and effective
enhancements over the conventional approaches.
Mining closed sequential patterns in large sequence databasesijdms
Sequential pattern mining is studied widely in the data mining community. Finding sequential patterns is a basic data mining method with broad applications. Closed sequential pattern mining is an important technique among the different types of sequential pattern mining, since it preserves the details of the full pattern set and it is more compact than sequential pattern mining. In this paper, we propose an efficient algorithm CSpan for mining closed sequential patterns. CSpan uses a new pruning method called occurrence checking that allows the early detection of closed sequential patterns during the mining process. Our extensive performance study on various real and synthetic datasets shows that the proposed algorithm CSpan outperforms the CloSpan and a recently proposed algorithm ClaSP by an order of magnitude.
Lecture 9 slides: Machine learning for Protein Structure ...butest
This document introduces machine learning approaches for protein structure prediction. It discusses using machine learning to predict a protein's structure given its sequence by looking at regions of the protein and learning to classify them. Two main questions are addressed: how to describe protein structures and how to train predictors on examples. Common machine learning techniques for this problem include decision trees, neural networks, and logic programs. The importance of testing predictors on unseen data to avoid overfitting is also covered.
This document contains lecture slides about software quality assurance from a course at Auburn University. It discusses definitions of software quality, important quality attributes, approaches to ensuring quality like establishing organizational and project quality standards and plans. It also covers quality assurance activities like reviews, testing, measurement and topics like defect detection and removal, reliability improvements. The slides include descriptions and checklists for different stages of the software development life cycle.
This document discusses two modes of operation for block ciphers: electronic codebook (ECB) mode and cipher block chaining (CBC) mode. ECB encrypts each plaintext block independently with the same key, resulting in the same ciphertext if the same plaintext is repeated. CBC improves on ECB by XORing each plaintext block with the previous ciphertext block before encryption to prevent repetitions in the ciphertext. The document outlines the encryption and decryption processes for ECB and CBC, noting that CBC is more secure for long messages.
This document summarizes key aspects of the Data Encryption Standard (DES) block cipher. It describes how DES operates on 64-bit blocks using a 56-bit key in 16 rounds based on a Feistel network structure. Each round uses 48-bit subkeys generated from the main key. The document also discusses DES modes of operation like ECB, CBC, CFB, OFB and CTR and how they encrypt blocks of plaintext. Finally, it notes NIST's role in establishing encryption standards and the history of DES adoption as a standard in 1977.
The document provides an agenda for an introduction to computers lab covering repetition structures (loops), introduction to Visual Studio, and C++ programs. It discusses while, do-while, and for loops through flowcharts and C++ code examples. It also covers solutions containers, projects containers, program components in Visual Studio, and provides examples of C++ programs for various algorithms including determining if a number is positive/negative, even/odd, in a range, and baby weight classification. Exercises include finding maximum of numbers, GCD calculation, decimal to binary conversion, and factorial calculation.
This document contains 5 problems and solutions related to writing C++ programs. Problem 1 asks to write a program that repeatedly collects positive integers from the user until a negative number is entered, then outputs the product of the positive numbers. Problem 2 asks to write a program that counts the even and odd numbers entered by the user until a negative number is entered. Problem 3 asks to output all divisors of a number entered by the user in ascending order. Problem 4 asks to output the two largest numbers from a list of numbers entered by the user until a negative number is entered. Problem 5 asks to convert a number of seconds after midnight to hours:minutes:seconds format. Solutions in C++ are provided for each problem.
This document contains solutions to exam problems on introductory computer programming concepts. It includes sample pseudocode and flowcharts for programs that:
1) Calculate employee bonuses based on years of service and percentage tables
2) Determine company anniversary and corresponding raise amounts based on number of years in business
3) Calculate the absolute difference between two numbers
4) Calculate net salary for managers versus employees
5) Represent a bank withdrawal process
6) Simulate a doctor's diagnosis based on a child's temperature
This document discusses programming and the program development process. It defines programming as giving a computer instructions to accomplish a task. The typical development process involves two phases: problem solving to design an algorithm of steps to solve the problem, and implementation to code the algorithm into a programming language. The key steps of program development are: 1) analyzing the problem, 2) designing the algorithm, 3) checking the algorithm, 4) coding it into a program, 5) testing the program, and 6) documenting and maintaining the program. Popular programming languages include C++, Visual Basic, C#, Java, and Python.
Cloud computing introduces new security concerns due to multi-tenancy and users losing direct control over their data. The main security factors are confidentiality, integrity, and availability. Traditional threats like authentication issues, DDoS attacks, and SQL injection remain problems but are amplified in cloud environments. Availability of cloud services and third-party control of data are also security risks. Legal protections for cloud users in contracts with cloud service providers are important.
MapReduce is a programming model for processing large datasets in parallel across clusters of machines. It involves splitting the input data into independent chunks which are processed by the "map" step, and then grouping the outputs of the maps together and inputting them to the "reduce" step to produce the final results. The MapReduce paper presented Google's implementation which ran on a large cluster of commodity machines and used the Google File System for fault tolerance. It demonstrated that MapReduce can efficiently process very large amounts of data for applications like search, sorting and counting word frequencies.
The document summarizes two distributed storage systems developed by Google: the Google File System (GFS) and Bigtable. GFS was developed in the late 1990s to provide petabytes of storage for large files across thousands of machines. It uses a master/slave architecture with chunk replication for fault tolerance. Bigtable is a distributed storage system for structured data that scales to petabytes of data and thousands of machines. It uses a table abstraction with rows, columns, and timestamps to store data in a sparse, sorted, multi-dimensional map.
This document discusses improving the performance of MapReduce jobs in heterogeneous computing environments. The original MapReduce scheduler assumes tasks will progress at a constant rate, but in clouds with diverse hardware, tasks can finish at different speeds. The proposed LATE scheduler estimates each task's finish time and prioritizes speculative execution of the task predicted to finish last. This approach cut response times by half compared to the original scheduler in tests on Amazon EC2.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
2. What is Motif finding?
motif discovery is a primary step in many systems
for studying gene function. It plays a vital role in
identification of Transcription Factor Binding Sites
(TFBSs) that help in learning the mechanisms for
regulation of gene expression.
A DNA motif refers to a short similar repeated
pattern of nucleotides that has biological meaning.
3. Motif sizes and types
Sequence motifs have constant size and are often
repeated and conserved, but at the same time, they
are tiny (about 6–12 bp) and the intergenic regions
are very long and highly variable that make motif
discovery a problematic task.
Different types of motifs are planted motifs,
structured motifs, sequence motifs, gapped motifs
and network motifs.
4. Motif discovery difficulties
As sequencing technology has improved, the volume of
biological sequence data in public databases increases
and this increases the importance of motif discovery in
computer science and molecular biology. Motif
discovery has some difficulties:
(1) Motifs are not identical to each other.
(2) The motif sequence is unknown.
(3) Motif location is unknown.
(4) Existing of random motifs.
(5) The location of a motif in each sequence is unrelated
to other ones.
6. A. Pre-processing
It is preparing the DNA sequences for accurate motif
discovery by assembling and clean steps.
In assembling step, it is advised to select as many target
sequences as possible that may contain motifs, try to
keep sequences as short as possible, and remove
sequences that are unlikely to contain any motifs.
Assembling step is done by clustering the input
sequences based on some information and then
extracting the desired sequences in an appropriate
sequence database.
Then, cleaning the input sequences to mask or remove
confounding sequences is necessary.
7. B. Discovering
The middle stage is the motif discovery approach
that begins by representing the sequences.
There are two ways to represent the motifs:
consensus string and Position-specific Weight
Matrices (PWM).
After motif representation, the suitable objective
function is determined and finally appropriate
search algorithm is applied.
There are hundreds of algorithms for motif
extraction.
11. Types of motif discovery algorithms
There are principal types of motif discovery
algorithms:
1. Enumeration approach.
2. Probabilistic technique.
3. Natured inspired approach.
4. Combinatorial approach.
12. Enumeration approach
Enumeration approach searches for consensus
sequences. It based on the enumeration of words
and computing word similarities.
This approach is sometimes called the word
enumeration approach to solve Planted (l, d) Motif
Problem (PMP) with motif length (l) and a maximum
number of mismatches (d).
13. Enumerative approach adv. & dis.
Disadvantages
Advantages
(1) It needs to be post-processed with
some clustering systems as the
typical transcription factor motifs
often have several weak
constrained positions.
(2) producing too many spurious
motifs.
(3) It represents certain numbers of
motifs.
(4) Exhaustive Long time processing.
(5) it is very slow, and requires a lot of
parameters.
(6) The degenerative positions are
limited because of restricted
representation of motifs.
(1) Global optimality.
(2) It is appropriate for short motifs;
therefore, it is useful for motif
finding in eukaryotic genomes.
(3) With optimized data structures, it
becomes fast.
(4) It is the only technique that ensures
to find all motifs (Except weak
motifs).
14. Probability approach
Disadvantages
Advantages
(1) It is a complex concept.
(2) Not guaranteed to find globally
optimal solutions, since they
employ some form of local search
as Gibbs sampling and
Expectation Maximization (EM), so
it can’t find all motifs.
(1) It overcomes many weak points of
the enumerative approach like
speed, dealing with long motifs
and big data, numbers of required
parameters and degenerated
positions.
(2) It constructs a probabilistic model
called position-Specific Weight
Matrix (PSWM).
(3) It can find weak motifs.
(4) More appropriate for prokaryotes.
15. Natured inspired approach
Disadvantages
Advantages
(1) It is harder to code.
(2) Premature convergence
happen.
(3) Possibility of trapping into
local optima.
(4) Lower precision factor.
(1) It is a simple concept and a global search,
so it combines the main features of the first
two approaches.
(2) They solve complex and dynamic problems
with appropriate time and optimal cost.
(3) Can deal with the big data and long
motifs.
(4) They simulate the behavior of insects or
other animals for problem-solving.
(5) It has a flexible representation of motifs
and this lead to an unlimited number of
degenerated positions.
(6) provide flexibility in evaluating the
solutions by using fitness functions that
score the solutions.
16. The combinatorial approach
Disadvantages
Advantages
(1) It mixes multiple algorithms
together.
(2) its ability depends on the hybrid
algorithms that combine to form
the required algorithm.
(3) The common features of all
algorithms are the flexibility of
objective function.
19. 1. Simple (YMF)
YMF (Yeast Motif Finder) algorithm.
YMF algorithm
Disadvantages
Advantages
1. it can’t detect motifs with large length
or the number of degenerate positions is
significant.
1. YMF algorithm detects short motifs
with a small number of degenerate
positions in yeast genomes using
consensus representation.
2. YMF enumerates all motifs in the
search space approach.
3. It calculates the z-score to produce
those motifs with greatest z-scores.
20. 1. Simple (DREME)
DREME algorithm also calculates the significance of motifs using Fisher’s Exact
test.
It starts with generating a set of short k-mers, followed by applying Fisher’s
Exact test on two sets of DNA sequences (Input set and background set) using
a significance threshold to calculate the significance of each word of length 3
to 8 that occurs in the positive sequences and select the 100 most significant
words for being used in the inner loop where they passed as “seed” REs to
perform a beam search that determines the most significant generalizations
of them (One wildcard).
To find multiple, non-redundant motifs in a set of sequences, outer loop
determines the most significant motif using the heuristic search of RE motifs
and the best motif found replaces its occurrences by a special letter; then the
search process is repeated again many times until E-value of the new motif is
less than the determined significance threshold.
DREME is compared to MEME algorithm and the results show that DREME
algorithm can correctly predict motifs on ChIPseq experiment sequences in a
shorter runtime than MEME.
21. 2. Clustering-based method (CisFinder)
Instead of using two loops for finding multiple motifs,
this second class was proposed as the small
enhancement of simple-based method.
CisFinder algorithm
Disadvantages
Advantages
1. it does not support outputting motifs
of a specified length.
1. CisFinder detect short motif with high
processing speed in large sequences.
2. CisFinder technique extended to deal
with whole ChIP-seq peak data sets.
3. CisFinder can accurately identify PFMs
(position frequency matrices) of TF
binding motifs and it is faster than
MEME, WeederH, and RSAT.
4. CisFinder can find motifs with a low
level of enrichment.
22. 2. Clustering-based method (CisFinder)
Firstly, one should define nucleotide substitution
matrix for each n-mer word, then calculate
Position Frequency Matrices (PFMs) for n-mer
word counts with and without gaps in both test
and control sets.
To generate non-redundant motifs, PFMs are
extended over flanking and gap regions
followed by means of clustering.
23. 3. Tree-based method (Weeder algorithm)
Weeder algorithm accelerates the word enumeration
technique. It based on count matching patterns with
specific and most extreme mismatches.
At first, the motifs are represented using consensus
sequence and based on the difference between the k-
mers of the input sequences and the consensus under a
limited number of substitutions, k-mers are assembled
and each group is evaluated with a specific measure of
significance.
Weeder algorithm
Disadvantages
Advantages
1. it operates with a low efficiency for
long motifs.
1. Weeder algorithm accelerates the
word enumeration technique by using a
suffix tree.
24.
25. 3. Tree-based method (FMotif )
The constructed suffix tree was proposed in FMotif
algorithm for finding long (l, d) motifs in large DNA
sequences under the ZOMOPS (Zero, one or
multiple occurrence(s) of the motif instance(s) per
sequence) constraints.
This proposed tree is faster than standard suffix
tree as it avoids a large number of repeated scans
of sequences on the suffix tree.
26. 3. Tree-based method (MCES)
MCES algorithm is proposed for a PMP (Planted Motif
Problem) that used both suffix tree and parallel processing
to deal with large datasets.
MCES algorithm starts with mining step that constructs the
Suffix Array (SA) and the Longest Common Prefix array
(LCP) for the input datasets.
At that point, combining step clusters substrings of various
lengths to get predicted motifs.
MCES has many advantages like:
1. identifying motifs without OOPS constraint.
2. handling very large data sets, handling the full-size input
sequences and making full use of the information contained
in the sequences.
3. completing the computation with a good time performance
and good identification accuracy.
27. 4. Graph theoretic-based method
The graph-theoretic method represents a motif instance,
as a clique to facilitate the search strategy.
the graph G is built by representing each l-mer in the
input sequences by vertex and the edge between a
pair of vertices representing a pair of l-mer in different
input sequences having the Hamming distance between
the substrings which is less than or equal to 2d.
Cliques of size N are searched for in this graph.
Popular graph-theoretic methods are WINNOWER,
Pruner, and cWINNOWER.
28. 4. Graph theoretic-based method
The Hamming distance between:
"karolin" and "kathrin" is 3.
"karolin" and "kerstin" is 3.
1011101 and 1001001 is 2.
2173896 and 2233796 is 3.
29. 5. Hashing-based method
It is a random projection algorithm for a PMP
(Planted Motif Problem) that projects every l-mer in
the input data into a smaller space by hashing.
30. 6. Fixed candidates & modified candidate-based methods
In fixed candidates and modified candidate-
based techniques, the technique scans all input
sequences to get the matched motifs.
Fixed selects candidate motifs from input
sequences and use them for motif scanning.
The modified class is modified candidate that
selects one candidate from the input sequence and
modifies it letter by letter.
31. Other
Finally, there is a proposed algorithm called RefSelect to
select reference sequences for PMP.
The reference sequences are the sequences that don’t
contain motif instances.
This method tries to select the reference sequences that
generate a small number of candidate motifs as possible.
The algorithm consists of two steps; firstly, for every two
sequences in input dataset D, the number of candidate
motifs generated from them should be computed using the
Hamming distance between every two l-mers. Then, the set
with candidate motifs, as small as possible, is selected as a
reference set.
33. B. Probabilistic approach
In the probabilistic approach, the probability of each
nucleotide base to be present in that position of the
sequence is multiplied to yield the probability of the
sequence.
PWM is an appealing model due to its simplicity and
wide application and it can represent an infinite
number of motifs but it has some problems:
(1) It scales poorly with dataset size.
(2) PWM representation assumes the independence of
each position within a binding site, while this may be
not true in reality.
(3) It converges to locally optimal solution.
34. 1. Deterministic approach (EM)
EM for motif finding consists of two main steps, the
first called “Expectation step” that estimates the
values of some set of unknowns based on a set of
parameters.
The second step is “Maximization step” that uses
those estimated values to refine the parameters
over several iterations.
EM is used to identify conserved areas in unaligned
DNA and proteins with an assumption that each
sequence must contain one common site.
35. 1. Deterministic approach (EM)
The parameters are the entries in the PWM and the
background nucleotide probabilities while our
unknowns are the scores for each possible motif
position in all of the sequences.
it has some limitations:
(1) It converges to a local maximum.
(2) It is extremely sensitive to initial conditions.
(3) It assumes one motif per sequence.
(4) The running time of EM is linear with the length of
the input sequences.
39. 1. Deterministic approach (MEME)
MEME (Multiple EM for Motif Elicitation) is a popular motif
recognition program that optimizes PWMs using the EM
algorithm and it has several versions.
The idea is to find an initial motif and then use expectation
and maximization steps to improve the motif until the values
in the PWM do not improve or the maximum number of
iterations is reached.
The MEME algorithm starts from a single site, i.e. k-mer
(Random or specified) and estimates motif model (PWM).
For every possible location in every input sequence, the
probability, given the PWM model, should be identified to
detect examples of the model; then, the motif model should
be re-estimated by calculating a new PWM.
40. 1. Deterministic approach (STEME) &(EXTREME)
STEME (Suffix Tree EM for Motif Elicitation) algorithm
developed to accelerate the MEME algorithm and to be
the first application of suffix trees to EM algorithm.
it finds motifs up to width 8 as its efficiency decreases
quickly as the motif width increases.
EXTREME (Online EM algorithm for motif discovery)
algorithm developed also to accelerate the MEME
algorithm.
EXTREME is an online web server that implements the
MEME algorithm and can work on a large dataset
without discarding any sequences.
EXTREME achieves better running time, but at the same
time, it requires too much storage space for processing
large data.
41. 2. Stochastic approach (Gibbs sampling)
Gibbs sampling is a famous stochastic approach, similar to EM
algorithm. Pseudocode of the Gibbs sampling algorithm for motif
detection follows these steps:
1. Random initializing of motif positions in the input N sequences with
an assumption of the presence of one motif per sequence.
2. Choosing one sequence at random.
3. Computing PWM for the other N-1 sequences using starting positions
of motifs and background probabilities for each base using the non-
motif positions.
4. Calculating probability of each possible motif location in the
removed sequence using PWM and background probabilities.
5. For the removed sequence, choosing a new starting position based
on step 4.
Steps 2–5 should be iterated until the values in the PWM do not
improve or the maximum number of iterations has been reached.
42. 2. Stochastic approach (Align ACE)
Many methods have been developed that implement the
concept of Gibbs’ sampling to extend its functionality like
Align ACE (Aligns Nucleic Acid Conserved Elements)
algorithm which based on Gibbs’ sampling with some
improvements:
(1) The motif model was changed to fit the source genome
because the base frequencies for non-site sequence is
fixed.
(2) Both strands of the input sequence are considered and no
circumstance overlapping is allowed.
(3) Iteratively, aligned sites were masked out to find multiple
different motifs.
(4) It uses an improved near-optimum sampling method.
43. 2. Stochastic approach (BioProspector)
BioProspector algorithm is also based on Gibbs’
sampling with several improvements:
(1) It uses a Mar-kov model estimated from all
promoter noncoding sequences to represent the
non-motif background in order to improve the
motif specificity.
(2) It can find two-block motifs with variable gap.
(3) Sampling with two thresholds allows every input
sequence to include zero or multiple copies of the
motif.
44. 3. Advanced approach (LOGOS)
Different algorithms were proposed based on
Bayesian approach with Markov chain Monte Carlo.
LOGOS (Integrated LOcal and GlObal motif
sequence model) algorithm combines between
HMDM (Hidden Markov Dirichlet-Multinomial) for
local alignment model for each different motif and
HMM (Hidden Markov model) for global motif
distribution model for the occurrence of multiple
motifs.
45. 3. Advanced approach (BaMM)
BaMM (Bayesian Markov Model)
BaMM algorithm
Disadvantages
Advantages
1. It is more complex than PWMs.
2. It uses EM algorithm which has
some disadvantages as described
before.
3. The running time is longer than
EM algorithm which makes it
difficult to apply on a big data
set.
4. It defines ZOOPS model only.
5. It requires motif length as an
input parameter.
1. It makes optimal use of the
available information.
2. It avoids training by the
decrease in number of
parameters.
3. BaMM was tested on human
datasets, the results show that the
precision increases by 30–40%
compared to PWM.
46. 4. Others (EPP)
EPP (Entropy-based position projection) algorithm
was proposed to escape from local optima.
This algorithm based on projection process
depends on the relative entropy in each position of
motif instead of random projection.
EPP algorithm can be applied to OOPS, ZOOPS,
and TCM sequence models and the results indicate
that it can efficiently and effectively recognize
motifs.
47. C. Nature-inspired algorithms
Evolutionary algorithms have been recognized due to
their advantages of synthesizing local search and
global search.
Nature-inspired algorithms are classified according to
the sources of inspiration into three main categories.
They are swarm-intelligence, non swarm-intelligence and
physics/chemistry.
Popular evolutionary algorithms used in motif discovery
are GA and PSO and few of them are ABC, CS, and
ACO.
49. 1. GA (Genetic Algorithm)
GA is a probabilistic optimization algorithm based on
evolutionary computing.
Inspired from biological evolution processes like
selection, crossover, and mutation.
The motivation for using GA comes from the idea of
reducing the number of searches in a high number of
DNA sequences.
The basic structure of GA consists of a population of
candidate solutions throughout several generations to
find the best solution or set of possible solutions.
50. 1. GA
The algorithm starts with the random generation of
individuals that are then evaluated by a fitness
function.
At that point, a selection process selects new
individuals called offspring that have some features
of the parents and the others are discarded, then
the genetic operators are applied on offspring.
The parents are selected based on a probabilistic
process biased by their fitness value using specified
selection techniques that keep the diversity of the
population large and prevent premature
convergence on poor solutions.
51. 1. GA
After two selections of parents, a crossover is
applied to select a random site, and the rightmost
strings are swapped to produce new children,
followed by applying the mutation process by
changing the value of some selected position.
The process is iteratively repeated until some stop
criterion is reached on the satisfactory fitness level.
53. 1. GA (FMGA & MDGA)
FMGA: Fast Messy Genetic Algorithm.
The genetic operators were specialized for motif discovery
problem; the mutation operator was applied using PWM to
reserve the completely conserved positions and the
crossover operator was implemented with specially-
designed gap penalties that optimize a scoring function.
FMGA performs better than MEME and Gibbs sampler
algorithms.
MDGA algorithm is also based on simple GA. It is compared
with a Gibbs sampling algorithm when tested on real
datasets and the results showed that it achieves higher
accuracy in short computation time; the computation time
does not explicitly depend on the sequence length.
54. 1. GA (Population clustering technique)
Clustering scheme enables to retain the diversity of population over
the generations and it can find various motifs. It is done by the
similarity between solutions using data clustering algorithms.
Population clustering technique algorithm used for PMP to detect
multiple and weak motifs.
Firstly, population initialization by random selecting of
subsequences of motif length used to form a candidate consensus
motif is done and then all input sequences are scanned to detect all
similar substrings followed by sorting them according to a number of
mismatches of each substring from the candidate motif.
Next, scoring function is applied on them. (a new scoring function
takes some considerations like, the number of mutations, and the
number of motifs per sequence.)
Finally, fitness of an individual (Cluster) is calculated to select the
parents for using in GA.
55. 1. GA (GARP, GAEM)
(GARP) algorithm optimizes GA based on the random
projection strategy (RPS) to identify planted (l, d) motifs.
The idea behind using RPS before GA is to find good
starting positions for being used in simple GA as an initial
population instead of random population.
Based on hybrid methods, RP algorithm is better than the
projection algorithm when tested on both simulated and real
biological data.
GAEM algorithm combines GA and EM for planted edited (l,
d)-motif finding problem.
The EM algorithm is used after the random initial population
to get the best starting positions to be used as a seed to
GA.
56. 1. GA (GAME)
GAME (Genetic Algorithm for Motif Elicitation) proposes two
operators called ADJUST and SHIFT to escape from local optima.
The ADJUST operator escapes from a local optimum that occurs if
any of the motif sites has not been aligned correctly by checking
every possible site position in a sequence and choosing the best
match to the sites in the other sequences.
But SHIFT operator escapes from local optima that occur when all
motif sites are slightly misaligned by shifting the subsequences in the
direction which gives the best fitness.
a new operator called addition to the standard GA operators started
with a motif length of three and then used until the length of the
optimal motif reaches to the standard level.
The result achieves a higher score than the other three methods:
Gibbs Sampler, GA and GARPS algorithm and performs better than
MEME and BioProspector when applied on simulation and real-data.
57. 2. PSO
PSO is a new global optimization technique for solving continuous
optimization problems, characterized by its simple computations and
information sharing within the algorithm.
It is an exploration-exploitation trade off.
Exploration is the ability to get the global optimum by search in various
regions.
Exploitation is the ability to locate the optimum by concentrating the search
around a promising candidate solution.
PSO is a simple concept, easily programmable, but it can easily fall in a
local optimum and low convergence rate.
It simulates the social behaviors of organisms movements in flocks of birds
or schools of fish to find food sources and defenses against predators.
Each particle uses its own flying experience and flying experience of other
particles to adjust its “flying” so it combines self-experiences with social
experiences.
In self-experiences, the particle tries to get local best particle position and
this is done by the particle stores.
58. 2. PSO
The best solution visited so far in its memory is called pbest, and it
has an attraction towards this solution as it navigates through the
solution search space.
Social experiences are utilized to get best global particle position
through the particle stores and the best solution visited by any
particle and attraction towards this solution is called gbest.
For an n-dimensional search space, the position and velocity of the
ith particle are represented by Yi= (yi1, yi2, …, yin) and Vi= (vi1,
vi2, vi3 …, vin), respectively.
The previous best position is denoted as Pi= (pi1, pi2, …, pin).
Pg is the global best particle in the swarm.
For the swarm S, the new velocity of each particle is calculated
according to the following equation:
Vin (t+1) = vin(t)+c1 r1 (pin−yin) +c2 r2 (pg−yin)
59. 2. PSO
(1) The position is updated using:
Yin (t+1) = Yin (t)+ Vin (t+1)
(2) Where i = 1, 2… S represents the particle index and
n=1, 2… N represents the dimension.
c1 and c2 are cognitive and social scaling parameters,
respectively.
At each iteration, pbest and gbest are updated for
every particle as per their fitness values.
The procedure is iteratively repeated until some stop
criterion is reached or satisfactory fitness level has been
reached.
60.
61. 2. Modified PSO
PSO has wide applications and has been proven to
be effective in motif finding problems.
In recent years, there are few numbers of
researches which utilized PSO to solve different
types of motif finding problems.
Modified PSO algorithm adjust the position and
velocity to escape from local optima.
A hybrid motif discovery approach based upon a
combination of PSO and EM algorithm, where PSO
was utilized to generate a seed for the EM
algorithm.
62. 2. PSO (PMbPSO)
PMbPSO (PSO-based algorithm for Planted Motif
Finding) algorithm selects initial positions for all
motifs by random and generates ten children for
each parent (Motif) and then computes the fitness
function for each parent and its children to get the
best position.
at that point, the best position from all particles is
got followed by updating velocity and position for
each particle for a number of iterations.
63. 2. PSO (LPBS & GSA-PSO)
Linear-PSO integrated with binary search technique
(LPBS) to minimize the execution time and increase the
validity in motif discovery of DNA sequence for specific
species.
LPBS algorithm starts with initializing the population by
selecting the target motif from the reference set and
searching for similar motifs using the binary search, then
applying the standard PSO algorithm to discover
motifs.
Recently, a hybrid GSA-PSO algorithm introduced to
combine local search capabilities of the Gravitational
Search Algorithm (GSA) and global search capabilities
of PSO algorithm.
64. GA × PSO algorithms differences
GA and PSO are the famous evolutionary
algorithms.
GA × PSO algorithms differences
PSO
GA
1. PSO is a continuous technique
that must be modified to handle
discrete design variables.
2. PSO keeps the same population.
3. PSO does not have genetic
operators and no notion of the
“survival of the fittest”.
1. GA is a discrete technique.
2. GA changes the population from
generation to another.
3. GA have genetic operators.
65. 3. ABC algorithm
ABC algorithm is a type of swarm-based algorithm that simulates
the behavior of honey bees to find a food source.
Two fundamental properties to obtain swarm intelligent behavior in
honey bee colonies are self-organizing and division of labor.
The bee colony contains two main groups which are employed and
unemployed foragers. Employed bees are going to the food source
which is visited previously and they are responsible for giving
information to unemployed foragers about the quality of the
assigned nectar supply.
Many factors determine the value of a food source like its distance
from the nest, its energy concentration and the degree of difficulty
to extract this energy.
Unemployed bees are categorized to scout and onlooker bees.
Scout bee searches around the nest randomly to find new food
sources while onlooker bee uses the information shared by employed
foragers to establish a food source.
66. 3. ABC algorithm
Employed bees’ numbers are the same for food sources’
numbers around the hive.
At first, scout bees initialize all positions of food sources
that represent possible solutions to the problem.
Then, employed bees and onlooker bees exploit the
nectar of food sources that corresponds to the quality
(Fitness) of the associated solution, and this continual
exploitation will finally cause them to become
exhausted.
The employed bee which turned into exploiting the
exhausted food source turns into a scout bee looking for
other food sources.
67. 3. ABC algorithm
The ABC algorithm involves four fundamental phases: A neighbor
food source FS is determined and calculated by the following
equation:
FSi−fi+rand (−1, 1)*(fi−fk)
Where i is a randomly selected parameter index, fk is a randomly
selected food source, rand (−1, 1) is a random number between −1,
1.
The fitness of this new food source is calculated and a greedy
selection is applied between it and its parent i.e. between FS and F.
From that point, employed bees share information about their food
source via dancing on the dancing area with onlooker bees waiting
inside the hive.
Consensus ABC algorithm achieves the highest similarity values of the
motifs with different lengths; it used the neighbor selection strategy
based on similarity values between consensus sequences and this
improved the probability of the selection of a food source.
68.
69.
70. 3. Multiobjective ABC algorithm
A multiobjective ABC algorithm aims to adapt the
ABC algorithm to multiobjective context.
The multiobjective optimizes more than one
objective function at the same time to get a set of
optimal solutions known as Pareto set.
71. 3. Multiobjective ABC algorithm (MO-ABC/DE)
Multiobjective Artificial Bee Colony with Differential
Evolution algorithm (MO-ABC/DE) combines the general
schema of ABC with Differential Evolution.
Recently, consensus ABC algorithm is developed as a
discrete model based on a similarity value between
consensus sequences of the ABC algorithm.
The algorithm starts with random initialization, and then
calculates the similarity value of the resultant consensus
sequence followed by a new neighborhood selection
method that is based totally on the similarity values of
the consensus sequences.
72. 4. ACO algorithm
The ACO (Ant Colony Optimization) algorithm is a metaheuristic
optimization technique that mimics the behavior of real ants,
which try to find the shortest path to the food from their nest.
The ants explore randomly the area surrounding their nest, and
while moving, they leave a chemical pheromone trail on the
ground that helps them to go back to the nest.
Ants interact with each other through this chemical component.
The quantity of pheromone is proportional to the quantity and
the quality of the food and this pheromone will be guided to
other ants for the food source.
Why ant prefer the shortest way? When evaporation occurs, it
reduces the attractive strength of pheromone. The evaporation
takes a long time in the shorter path than the longest.
The ants choose their way with strong pheromone
concentrations.
74. 4. ACO algorithm
The characteristics of ACO algorithm are:
(1) It depends on two variables including the amount and
the evaporation of pheromone. The amount of
pheromone is directly proportional to the richness of
the food and inversely to evaporation.
(2) Ants act concurrently and independently.
(3) The interaction between ants is indirect.
(4) Ants can explore vast areas without global view of the
ground.
(5) Starting point is selected at random.
75. 4. ACO algorithm
A lot of methods used ACO algorithm in motif discovery and
some used ACO algorithm with a Gibbs sampling algorithm.
ACO algorithm finds better starting positions of the
sequences provided as starting position for the Gibbs
sampler method instead of random initialization.
This algorithm starts with each ant choosing the path to
construct a sample with motif length (m) and that depends
on pheromone probability.
Each ant is compared between the selected sample (m) and
each substring in input sequences to get the set that
represents the best matching substrings.
Next, fitness function for each selected set is calculated.
After that, the amount of pheromone is updated and finally
iterated until no change.
76. 4. ACO algorithm Adv. & Disadv.
ACO algorithm
Disadvantages
Advantages
1. It can easily trap local optimum.
2. The consuming time is uncertain,
i.e. it may take short/long time to
converge.
3. The code is not obvious.
4. The ants must visit all points to
get a good result that is
unsuitable for motif discovery as
it takes long time to check every
substring in the input sequences.
5. ACO requires many parameters.
1. It usually avoids fault convergence
due to distributed computation.
2. It can be used in dynamic
applications.
77. 5. CS algorithm
CS is a new simple heuristic search algorithm that is more efficient
than GA and PSO.
CS is inspired from brood parasitism reproduction behavior of
some cuckoo species in combination with Lévy flight behavior.
The cuckoos lay their eggs in nests of the other birds with the
abilities of selecting the lately spawned nests and removing existing
eggs to increase the hatching probability of their eggs.
If host birds discover these eggs, they either throw them away or
abandon the nest and build a new nest.
CS is characterized by the subsequent rules:
1. Each cuckoo lays one egg at a time and disposes its egg in a
random selected nest.
2. The nests that have high quality of eggs (Solutions) are the best
and will continue to the following generations.
3. The number of accessible host nests is fixed, and a host bird can
discover a parasitic egg with a probability Pa E [0,1].
78. 5. CS algorithm
To simulate the behavior of cuckoo reproduction,
each egg in a nest is a solution and each cuckoo’s
egg is a new solution.
The aim is to supplant a not-so good solution in the
nests with newer and better solutions by Lévy flights:
y(t+1)i=yti+α⊕Levy(λ)
Where y(t+1)i is a new solution, yti is the current
location, α is the step size and Levy(λ) is the
transition probability or random walk based on the
Lévy flights.
80. 5. Modified Adaptive CS algorithm
CS is an effective global optimization algorithm
and has many applications in different fields.
(MACS) algorithm Modified Adaptive Cuckoo
search applied on PMP.
The MACS algorithm enhances the basic CS
algorithm by grouping parallel, incentive,
information and adaptive strategies.
81. 5. CS algorithm advantages
there are several advantages for CS algorithm:
(1) It usually converges to the global optimality.
(2) It combines local and global capabilities and local search
takes a quarter of the total search time and the remaining
time is for global search which makes CS algorithm more
efficient on the global scale.
(3) Levy flight is used in its global search instead of standard
random walks, so the CS can explore the search space
more efficiently.
(4) It is easy to implement, compared with another
metaheuristic search which essentially depends on only a
single parameter (pa).
Recently, Grey Wolf Optimization for motif finding (GWOMF)
was applied to get motifs in DNA input sequences.
82. DNA Motif Databases
Synthetic DNA sequences: The simulated data set
was first introduced to create planted (l, d) motif in
n sequences selected randomly from a set of N
sequences where n≤N. There are online sites to
automatically create planted (l, d) motif.
Real DNA sequences: There are several online
databases of DNA motifs listed in the next slide
with a short description of each one.
83. Real datasets of DNA motifs
Description
Link
Database
TRANSFAC is the database of eukaryotic TFs, their
genomic binding sites, and DNA-binding profiles
http://gene-
regulation.com/pub/databas
es.html
TRANSFAC
A public dataset of motifs for multicellular eukaryotes
http://jaspar.genereg.net/
JASPAR
PROSITE includes documentation sections describing
protein domains, families and functional sites in addition
to related patterns and profiles to recognize them
http://prosite.expasy.org/
http://elm.eu.org
PROSITE
& ELM
It contains predicted TFs for S. cerevisiae.
http://www.yeastract.com/
YEASTRACT
http://rulai.cshl.edu/SCPD/
SCPD
Provides curated information on the transcriptional
regulatory network of E. coli and contains both
computational as well as experimental data of
predicted objects
http://regulondb.ccg.unam.mx
RegulonDB
It contains a list of >160,000 predicted TFs from >300
species
http://cisbp.ccbr.utoronto.ca/
CisBP
It contains TFs for Bacillus subtilis
http://dbtbs.hgc.jp/
DBTBS
85. The good tool for motif discovery
From various suggested methods for motif discovery problem, a
good tool for motif discovery can be built. The tool must contain
these features:
(1) It should identify all models, i.e. OOPS, ZOOPS, TCM.
(2) It should possess global search ability.
(3) It should optimize the scoring function.
(4) Parallel processing ability is a necessity.
(5) It should have optimized data structures.
(6) It should be able to detect long and short motifs.
(7) It should have the capacity of multiple motif discoveries at the
same time, i.e. without removing the discovered motif to find the
next.
(8) Multiple motifs discovery with variable lengths.
(9) It needs to have an automatic system by decreasing the number of
required parameters determined by the user.