This document describes a new method called BlockMSA for performing local multiple sequence alignment (MSA) of non-coding RNA sequences. BlockMSA uses a biclustering approach that simultaneously clusters sequences and identifies conserved subsequences within the clusters. The authors test BlockMSA on benchmark RNA datasets and two large biological datasets, finding it outperforms other MSA tools for larger problems with highly variable sequences. BlockMSA is able to scale to larger datasets while identifying functionally conserved regions missed by other methods.
The Influence of Age Assignments on the Performance of Immune AlgorithmsMario Pavone
How long a B cell remains, evolves and matures inside a population plays a crucial role on the capability for an immune algorithm to jump out from local optima, and find the global optimum. Assigning the right age to each clone (or offspring, in general) means to find the proper balancing between the exploration and exploitation. In this research work we present an experimental study conducted on an immune algorithm, based on the clonal selection principle, and performed on eleven different age assignments, with the main aim to verify if at least one, or two, of the top 4 in the previous efficiency ranking produced on the one-max problem, still appear among the top 4 in the new efficiency ranking obtained on a different complex problem. Thus, the NK landscape model has been considered as the test problem, which is a mathematical model formulated for the study of tunably rugged fitness landscape. From the many experiments performed is possible to assert that in the elitism variant of the immune algorithm, two of the best age assignments previously discovered, still continue to appear among the top 3 of the new rankings produced; whilst they become three in the no elitism version. Further, in the first variant none of the 4 top previous ones ranks ever in the first position, unlike on the no elitism variant, where the previous best one continues to appear in 1st position more than the others. Finally, this study confirms that the idea to assign the same age of the parent to the cloned B cell is not a good strategy since it continues to be as the worst also in the new
efficiency ranking.
A recurrent neural network with a self-organizing structure based on the dynamic analysis of a task is presented in this paper. The stability of the recurrent neural network is guaranteed by design. A dynamic analysis method to sequence the subsystems of the recurrent neural network according to the fitness between the subsystems and the target system is developed. The network is trained with the network's structure self-organized by dynamically activating subsystems of the network according to tasks. The experiments showed the proposed network is capable of activating appropriate subsystems to approximate different nonlinear dynamic systems regardless of the inputs. When the network was applied to the problem of simultaneously soft measuring the chemical oxygen demand (COD) and NH3-N in wastewater treatment process, it showed its ability of avoiding the coupling influence of the two parameters and thus achieved a more desirable outcome.
The Influence of Age Assignments on the Performance of Immune AlgorithmsMario Pavone
How long a B cell remains, evolves and matures inside a population plays a crucial role on the capability for an immune algorithm to jump out from local optima, and find the global optimum. Assigning the right age to each clone (or offspring, in general) means to find the proper balancing between the exploration and exploitation. In this research work we present an experimental study conducted on an immune algorithm, based on the clonal selection principle, and performed on eleven different age assignments, with the main aim to verify if at least one, or two, of the top 4 in the previous efficiency ranking produced on the one-max problem, still appear among the top 4 in the new efficiency ranking obtained on a different complex problem. Thus, the NK landscape model has been considered as the test problem, which is a mathematical model formulated for the study of tunably rugged fitness landscape. From the many experiments performed is possible to assert that in the elitism variant of the immune algorithm, two of the best age assignments previously discovered, still continue to appear among the top 3 of the new rankings produced; whilst they become three in the no elitism version. Further, in the first variant none of the 4 top previous ones ranks ever in the first position, unlike on the no elitism variant, where the previous best one continues to appear in 1st position more than the others. Finally, this study confirms that the idea to assign the same age of the parent to the cloned B cell is not a good strategy since it continues to be as the worst also in the new
efficiency ranking.
A recurrent neural network with a self-organizing structure based on the dynamic analysis of a task is presented in this paper. The stability of the recurrent neural network is guaranteed by design. A dynamic analysis method to sequence the subsystems of the recurrent neural network according to the fitness between the subsystems and the target system is developed. The network is trained with the network's structure self-organized by dynamically activating subsystems of the network according to tasks. The experiments showed the proposed network is capable of activating appropriate subsystems to approximate different nonlinear dynamic systems regardless of the inputs. When the network was applied to the problem of simultaneously soft measuring the chemical oxygen demand (COD) and NH3-N in wastewater treatment process, it showed its ability of avoiding the coupling influence of the two parameters and thus achieved a more desirable outcome.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
Multiple Alignment Sequence using Clustal Omega/ Shumaila RiazShumailaRiaz6
Alignment of three or more biological sequences of Protein, DNA, RNA of similar length
Clustal Omega is tool for analyzing the Multiple sequence alignments of proteins
Understanding Protein Function on a Genome-scale through the Analysis of Molecular Networks
Cornell Medical School, Physiology, Biophysics and Systems Biology (PBSB) graduate program, 2009.01.26, 16:00-17:00; [I:CORNELL-PBSB] (Long networks talk, incl. the following topics: why networks w. amsci*, funnygene*, net. prediction intro, memint*, tse*, essen*, sandy*, metagenomics*, netpossel*, tyna*+ topnet*, & pubnet* . Fits easily into 60’ w. 10’ questions. PPT works on mac & PC and has many photos w. EXIF tag kwcornellpbsb .)
Date Given: 01/26/2009
Milbury C.A., Lee J.C., Cannone J.J., Gaffney P.M., and Gutell R.R. (2010).
Fragmentation of the Large Subunit Ribosomal RNA Gene in Oyster Mitochondrial Genomes.
BMC Genomics, 11(1):485.
Xia Z., Gardner D.P., Gutell R.R., and Ren P. (2010).
Coarse-Grained Model for Simulation of RNA Three-Dimensional Structures.
The Journal of Physical Chemistry B, 114(42):13497-13506.
Ozer S., Doshi K.J., Xu W., and Gutell R.R. (2011).
rCAD: A Novel Database Schema for the Comparative Analysis of RNA.
7th IEEE International Conference on e-Science, Stockholm, Sweden. December 5-8, 2011. pp 15-22.
Sequence alig Sequence Alignment Pairwise alignment:-naveed ul mushtaq
Sequence Alignment Pairwise alignment:- Global Alignment and Local AlignmentTwo types of alignment Progressive Programs for multiple sequence alignment BLOSUM Point accepted mutation (PAM)PAM VS BLOSUM
Multiple Alignment Sequence using Clustal Omega/ Shumaila RiazShumailaRiaz6
Alignment of three or more biological sequences of Protein, DNA, RNA of similar length
Clustal Omega is tool for analyzing the Multiple sequence alignments of proteins
Understanding Protein Function on a Genome-scale through the Analysis of Molecular Networks
Cornell Medical School, Physiology, Biophysics and Systems Biology (PBSB) graduate program, 2009.01.26, 16:00-17:00; [I:CORNELL-PBSB] (Long networks talk, incl. the following topics: why networks w. amsci*, funnygene*, net. prediction intro, memint*, tse*, essen*, sandy*, metagenomics*, netpossel*, tyna*+ topnet*, & pubnet* . Fits easily into 60’ w. 10’ questions. PPT works on mac & PC and has many photos w. EXIF tag kwcornellpbsb .)
Date Given: 01/26/2009
Milbury C.A., Lee J.C., Cannone J.J., Gaffney P.M., and Gutell R.R. (2010).
Fragmentation of the Large Subunit Ribosomal RNA Gene in Oyster Mitochondrial Genomes.
BMC Genomics, 11(1):485.
Xia Z., Gardner D.P., Gutell R.R., and Ren P. (2010).
Coarse-Grained Model for Simulation of RNA Three-Dimensional Structures.
The Journal of Physical Chemistry B, 114(42):13497-13506.
Ozer S., Doshi K.J., Xu W., and Gutell R.R. (2011).
rCAD: A Novel Database Schema for the Comparative Analysis of RNA.
7th IEEE International Conference on e-Science, Stockholm, Sweden. December 5-8, 2011. pp 15-22.
Jiang Y., Xu W., Thompson L.P., Gutell R., and Miranker D. (2011).
R-PASS: A Fast Structure-based RNA Sequence Alignment Algorithm.
Proceedings of 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA. November 12-15, 2011. IEEE Computer Society, Washington, DC, USA. pp. 618-622.
Xu W., Ozer S., and Gutell R.R. (2009).
Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA.
21st International Conference on Scientific and Statistical Database Management. June 2-4, 2009. Springer-Verlag. pp. 200-216.
Gardner D.P., Xu W., Miranker D.P., Ozer S., Cannone J.J., and Gutell R.R. (2012).
An Accurate Scalable Template-based Alignment Algorithm.
Proceedings of 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, PA. October 4-7, 2012. IEEE Computer Society, Washington, DC, USA. pp. 237-243.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...cscpconf
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and average positions of group mutations have been grouped into two twelve-components vectors, the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
SIMILARITY ANALYSIS OF DNA SEQUENCES BASED ON THE CHEMICAL PROPERTIES OF NUCL...csandit
The DNA sequences similarity analysis approaches have been based on the representation and the frequency of sequences components; however, the position inside sequence is important information for the sequence data. Whereas, insufficient information in sequences
representations is important reason that causes poor similarity results. Based on three classifications of the DNA bases according to their chemical properties, the frequencies and
average positions of group mutations have been grouped into two twelve-components vectors,the Euclidean distances among introduced vectors applied to compare the coding sequences of the first exon of beta globin gene of 11 species.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications rangingfrom phylogenetic analyses to domain identification. There are several ways to perform multiple sequencealignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence alignment, an important way of which is the progressive alignment approach studied in this work.Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...ijcseit
Multiple sequence alignment is increasingly important to bioinformatics, with several applications ranging
from phylogenetic analyses to domain identification. There are several ways to perform multiple sequence
alignment, an important way of which is the progressive alignment approach studied in this work.
Progressive alignment involves three steps: find the distance between each pair of sequences; construct a
guide tree based on the distance matrix; finally based on the guide tree align sequences using the concept
of aligned profiles. Our contribution is in comparing two main methods of guide tree construction in terms
of both efficiency and accuracy of the overall alignment: UPGMA and Neighbor Join methods. Our
experimental results indicate that the Neighbor Join method is both more efficient in terms of performance
and more accurate in terms of overall cost minimization.
A NEW SURVEY ON BICLUSTERING OF MICROARRAY DATAcscpconf
There are subsets of genes that have similar behavior under subsets of conditions, so we say
that they coexpress, but behave independently under other subsets of conditions. Discovering
such coexpressions can be helpful to uncover genomic knowledge such as gene networks or
gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of
genes and conditions to identify clusters of genes that are coexpressed under clusters of
conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to
approximate this problem by finding suboptimal solutions. In this paper, we make a new survey
on biclustering of gene expression data, also called microarray data.
There are subsets of genes that have similar behavior under subsets of conditions, so we say
that they coexpress, but behave independently under other subsets of conditions. Discovering
such coexpressions can be helpful to uncover genomic knowledge such as gene networks or
gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of
genes and conditions to identify clusters of genes that are coexpressed under clusters of
conditions. This type of clustering is called biclustering.
Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to
approximate this problem by finding suboptimal solutions. In this paper, we make a new survey
on biclustering of gene expression data, also called microarray data.
A Biological Sequence Compression Based on cross chromosomal similarities usi...CSCJournals
While modern hardware can provide vast amounts of inexpensive storage for biological databases, the compression of Biological sequences is still of paramount importance in order to facilitate fast search and retrieval operations through a reduction in disk traffic. This issue becomes even more important in light of the recent increase of very large data sets, such as meta genomes. The present Biological sequence compression algorithms work by finding similar repeated regions within the Biological sequence and then encode these repeated regions together for compression. The previous research on chromosome sequence similarity reveals that the length of similar repeated regions within one chromosome is about 4.5% of the total sequence length. The compression gain is often not high because of these short lengths of repeated regions. It is well recognized that similarities exist among different regions of chromosome sequences. This implies that similar repeated sequences are found among different regions of chromosome sequences. Here, we apply cross-chromosomal similarity for a Biological sequence compression. The length and location of similar repeated regions among the different Biological sequences are studied. It is found that the average percentage of similar subsequences found between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. The percentage of similar subsequences is about 18% in which only 1.2% comes from self-chromosomal prediction while the rest is from cross-chromosomal prediction among the different Biological sequences studied. This suggests the significance of cross-chromosomal similarities in addition to self-chromosomal similarities in the Biological sequence compression. An additional 23% of storage space could be reduced on average using self-chromosomal and cross-chromosomal predictions in compressing the different Biological sequences.
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSISijcseit
HMM has found its application in almost every field. Applying Hmm to biological sequences has its own
advantages. HMM’s being more systematic and specific, yield a result better than consensus techniques.
Profile HMMs use position specific scoring for the matching & substitution of a residue and for the
opening or extension of a gap. HMMs apply a statistical method to estimate the true frequency of a residue
at a given position in the alignment from its observed frequency while standard profiles use the observed
frequency itself to assign the score for that residue. This means that a profile HMM derived from only 10 to
20 aligned sequences can be of equivalent quality to a standard profile created from 40 to 50 aligned
sequences.
Deep Learning Meets Biology: How Does a Protein Helix Know Where to Start and...Melissa Moody
UVA Data Science Institute Master of Science in Data Science students Sean Mullane, Ruoyan Chen and Sri Vaishnavi Vemulapalli were motivated to apply data science tools and techniques to the problem, and see if protein structures can be quantitatively described, compared and otherwise analyzed in a more robust, efficient and automated manner. Potential applications include more effectively designed drugs to inhibit disease-related proteins, or even newly engineered ones.
The researchers received the award for Best Paper in the Data Science for Health category at the 2019 Systems & Information Design Symposium (SIEDS) meeting. Their project, "Machine Learning for Classification of Protein Helix Capping Motifs," focused on small segments of a protein called secondary structural elements. These structural elements are the basic molecular-scale building blocks that all proteins—and therefore life—build upon.
AMINO ACID INTERACTION NETWORK PREDICTION USING MULTI-OBJECTIVE OPTIMIZATIONcscpconf
Protein can be represented by amino acid interaction network. This network is a graph whose
vertices are the proteins amino acids and whose edges are the interactions between them. This
interaction network is the first step of proteins three-dimensional structure prediction. In this
paper we present a multi-objective evolutionary algorithm for interaction prediction and ant
colony probabilistic optimization algorithm is used to confirm the interaction.
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
The Chaotic Structure of Bacterial Virulence Protein Sequencescsandit
Bacterial virulence proteins, which have been classified on structure of virulence, causes
several diseases. For instance, Adhesins play an important role in the host cells. They are
inserted DNA sequences for a variety of virulence properties. Several important methods
conducted for the prediction of bacterial virulence proteins for finding new drugs or vaccines.
In this study, we propose a method for feature selection about classification of bacterial
virulence protein. The features are constituted directly from the amino acid sequence of a given
protein. Amino acids form proteins, which are critical to life, and have many important
functions in living cells. They occurring with different physicochemical properties by a vector of
20 numerical values, and collected in AAIndex databases of known 544 indices.
For all that, this approach have two steps. Firstly, the amino acid sequence of a given protein
analysed with Lyapunov Exponents that they have a chaotic structure in accordance with the
chaos theory. After that, if the results show characterization over the complete distribution in
the phase space from the point of deterministic system, it means related protein will show a
chaotic structure.
Empirical results revealed that generated feature vectors give the best performance with chaotic
structure of physicochemical features of amino acids with Adhesins and non-Adhesins data sets.
Gutell 123.app environ micro_2013_79_1803Robin Gutell
McFrederick Q.S., Cannone J.J., Gutell R.R., Kellner K., Plowes R.M., and Mueller U.G. (2013).
Specificity between Lactobacilli and Hymenopteran Hosts is the Exception Rather than the Rule.
Applied and Environmental Microbiology, 79:1803-1812.
Gutell R.R. (2013).
Comparative Analysis of the Higher-Order Structure of RNA.
in: Biophysics of RNA Folding. Volume editor: Rick Russell. Series title: Biophysics for the Life Sciences. Series editors: Norma Allewell, Ivan Rayment, Bertrand Garcia-Moreno, Jonathan Dinman, and Michael McCarthy. pp. 11-22. Publisher: Springer, New York, NY.
Lee J.C. and Gutell R.R. (2012).
A Comparison of the Crystal Structures of the Eukaryotic and Bacterial SSU Ribosomal RNAs Reveals Common Structural Features in the Hypervariable Regions.
PLoS ONE, 7(5):e38203.
Gardner D.P., Ren P., Ozer S., and Gutell R.R. (2011).
Statistical Potentials for Hairpin and Internal Loops Improve the Accuracy of the Predicted RNA Structure.
Journal of Molecular Biology, 413(2):473-483.2011. pp 15-22.
Xu W., Wongsa A., Lee J., Shang L., Cannone J.J., and Gutell R.R. (2011).
RNA2DMap: A Visual Exploration Tool of the Information in RNA's Higher-Order Structure.
Proceedings of 2011 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2011), Atlanta, GA. November 12-15, 2011. IEEE Computer Society, Washington, DC, USA. pp. 613-617.
Muralidhara C., Gross A.M., Gutell R.R., and Alter O. (2011).
Tensor Decomposition Reveals Concurrent Evolutionary Convergences and Divergences and Correlations with Structural Motifs in Ribosomal RNA.
PLoS ONE, 6(4):e18768.
Mueller U.G., Ishak H., Lee J.C., Sen R., and Gutell R.R. (2010).
Placement of attine ant-associated Pseudonocardia in a global phylogeny (Pseudonocardiaceae, Actinomycetales): a test of two symbiont-association models.
Antonie van Leeuwenhoek International Journal of General and Molecular Microbiology, 98(2):195-212.
Theriot E.C., Cannone J.J., Gutell R.R., and Alverson A.J. (2009).
The limits of nuclear encoded SSU rDNA for resolving the diatom phylogeny.
European Journal of Phycology, 44(3):277-290.
Wu J.C., Gardner D.P., Ozer S., Gutell R.R. and Ren P. (2009).
Correlation of RNA Secondary Structure Statistics with Thermodynamic Stability and Applications to Folding.
Journal of Molecular Biology, 391(4):769-783.
Chen Y.P., Evans J.D., Murphy C., Gutell R., Zuker M., Gundersen-Rindal D., and Pettis J.S. (2009).
Morphological, Molecular, and Phylogenetic Characterization of Nosema cerenae, a Microsporidian Parasite Isolated from the European Honey Bee, Apis mellifera.
The Journal of Eukaryotic Microbiology, 56(2):142-147.
Maddison D.R., Moore W., Baker M.D., Ellis T.M., Ober K.A., Cannone J.J., and Gutell R.R. (2009).
Monophyly of terrestrial adephagan beetles as indicated by three nuclear genes (Coleoptera: Carabidae and Trachypachidae).
Zoologica Scripta, 38(1):43-62.
Chandramouli P., Topf M., Ménétret J.-F., Eswar N., Cannone J.J., Gutell R.R., Sali A., and Akey C.W. (2008).
Structure of the Mammalian 80S Ribosome at 8.7 Å Resolution.
Structure, 16(4):535-548.
Lee C.-Y., Lee J.C., and Gutell R.R. (2007).
Networks of interactions in the secondary and tertiary structure of ribosomal RNA.
Physica A, 386(1):564-572.
Gillespie J.J., Johnston J.S., Cannone J.J., and Gutell R.R. (2006).
Characteristics of the nuclear (18S, 5.8S, 28S and 5S) and mitochondrial (12S and 16S) rRNA genes of Apis mellifera (Insecta:Hymenoptera): structure, organization, and retrotransposable elements.
Insect Molecular Biology, 15(5):657-686.
Weinstock et al. (81 authors), Gillespie J.J., Cannone J.J., Gutell R.R., et al. (100 authors) (2006).
Insights into social insects from the genome of the honeybee Apis mellifera.
Nature, 443(7114):931-949.
Lee J.C., Gutell R.R., and Russell R. (2006).
The UAA/GAN internal loop motif: a new RNA structural element that forms a cross-strand AAA stack and long-range tertiary interactions.
Journal of Molecular Biology, 360(5):978-988.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
Gutell 102.bioinformatics.2007.23.3289
1. Vol. 23 no. 24 2007, pages 3289–3296
BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm485
Sequence analysis
Biclustering as a method for RNA local multiple
sequence alignment
Shu Wang1,
*, Robin R. Gutell2
and Daniel P. Miranker3
1
Department of Electrical and Computer Engineering, 2
School of Biological Sciences, Section of Integrative
Biology and 3
Department of Computer Science, University of Texas At Austin, Austin, TX 78712, USA
Received on April 21, 2007; revised on August 20, 2007; accepted on September 14, 2007
Advance Access publication October 6, 2007
Associate Editor: Keith Crandall
ABSTRACT
Motivations: Biclustering is a clustering method that simultaneously
clusters both the domain and range of a relation. A challenge in
multiple sequence alignment (MSA) is that the alignment of
sequences is often intended to reveal groups of conserved functional
subsequences. Simultaneously, the grouping of the sequences can
impact the alignment; precisely the kind of dual situation biclustering
is intended to address.
Results: We define a representation of the MSA problem enabling
the application of biclustering algorithms. We develop a computer
program for local MSA, BlockMSA, that combines biclustering with
divide-and-conquer. BlockMSA simultaneously finds groups of
similar sequences and locally aligns subsequences within them.
Further alignment is accomplished by dividing both the set of
sequences and their contents. The net result is both a multiple
sequence alignment and a hierarchical clustering of the sequences.
BlockMSA was tested on the subsets of the BRAliBase 2.1
benchmark suite that display high variability and on an extension
to that suite to larger problem sizes. Also, alignments were evaluated
of two large datasets of current biological interest, T box sequences
and Group IC1 Introns. The results were compared with alignments
computed by ClustalW, MAFFT, MUCLE and PROBCONS alignment
programs using Sum of Pairs (SPS) and Consensus Count.
Results for the benchmark suite are sensitive to problem size.
On problems of 15 or greater sequences, BlockMSA is consistently
the best. On none of the problems in the test suite are there
appreciable differences in scores among BlockMSA, MAFFT and
PROBCONS. On the T box sequences, BlockMSA does the most
faithful job of reproducing known annotations. MAFFT and
PROBCONS do not. On the Intron sequences, BlockMSA, MAFFT
and MUSCLE are comparable at identifying conserved regions.
Availability: BlockMSA is implemented in Java. Source code and
supplementary datasets are available at http://aug.csres.utexas.
edu/msa/
Contact: shuwang2006@gmail.com
Supplementary information: Supplementary data are available at
Bioinformatics online.
1 INTRODUCTION
Clustering, as it is commonly used in Bioinformatics, takes
input arranged in a data matrix, where the rows correspond to
data objects and the columns correspond to their features/
attributes. Biclustering clusters both rows and columns
simultaneously (Barkow et al., 2006; Dhillon, 2001; Madeira
and Oliveira, 2004). In biclustering, each object in a cluster is
selected only using a subset of features and each feature in a
cluster is selected only using a subset of objects. In this way,
it discovers local signals/coherences in a data matrix, and
derives local clusters within the data matrix.
A challenge in multiple sequence alignment (MSA) is that
sets of sequences usually contain unknown subsets representing
phylogenetic or functional groups (Notrdame, 2002). While
aligning such a set of sequences is intended to reveal such
groupings, the sequence alignment itself is the result of the
grouping; precisely the chicken-and-egg scenario biclustering is
intended to address.
In order to apply biclustering to the MSA problem, we define
the following mapping from MSA to biclustering (Fig. 1).
Specifically, we represent the MSA problem in a biclustering
matrix. Given a set of sequences, we first identify candidate
‘blocks’, possible local alignments of multiple subsequences.
We then use them to construct a biclustering matrix where each
row corresponds to a candidate block and each column
corresponds to a sequence. The value in the matrix is ‘1’ if
a block is on the sequence, ‘0’, otherwise. Biclustering is a
two-way clustering. Instead of clustering sequences over all
blocks, biclustering can cluster sequences with respect to
subsets of blocks and vice versa (Fig. 2). For each identified
bicluster matrix, its columns consist of a subset of sequences,
corresponding to a sequence group and its rows consist of a
subset of blocks, corresponding to the conserved features for
the sequence group (Fig. 3).
We have developed a program, BlockMSA, that exploits
biclustering to align non-coding RNA (ncRNA) datasets.
We recursively apply biclustering by excluding the aligned
blocks from further considerations and continue the MSA
in a divide-and-conquer fashion, one sub-problem for each
sequence group.
ncRNA genes produce some of the cell’s most important
products, including transfer RNA (tRNA) and ribosomal RNA*To whom correspondence should be addressed.
ß 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
2. (rRNA) (Cannone et al., 2002; Griffiths-Jones et al., 2005).
However, the role of ncRNA has been underestimated for a
long time. These RNAs are now being implicated in various
regulatory functions, in addition to their roles in protein
synthesis (Economist, 2007). As a result of this recent change,
the number of ncRNAs and our understanding of their
importance in cellular metabolism will increase dramatically
in the next few years.
Improving the speed, accuracy and scalability of MSA is
becoming an increasingly important task. Large-scale MSA is
especially needed for highly variable ncRNA families to reveal
their functionally conserved regions. Experience is showing
that the larger the set of sequences considered, the more
biological details reliably emerge. For example, comparative
analysis of multiple RNA sequence alignments has revealed
new types of base pairings (e.g. U:U and C:C instead of the
canonical A:U, G:C and G:U) and new structural motifs
(Gutell et al., 1994). Current global MSA methods, such as
ClustalW, are useful for closely related RNAs (Thompson
et al., 1994). They become less effective when the number of
sequences increases and the sequences are more variable.
Current local MSA methods such as Dialign 2 also become
less effective for large and variable sequence alignment
problems (Morgenstern, 1999).
We applied BlockMSA to highly variable ncRNA for local
MSA. We tested BlockMSA on three sources of problems
and compared its results with alignments computed by
ClustalW (Thompson et al., 1994), MAFFT (Katoh et al.,
2005), MUSCLE (Edgar, 2004) and PROBCONS (Do et al.,
2005). Since the intention is to support the analysis of large sets
of highly variable ncRNAs, first we determined the subsets
of BRAliBase 2.1 (Wilm et al., 2006) that display high
variability. Second, since the maximum problem size in
BRAliBase 2.1 is 15 sequences, we increased test set size and
formed problem instances as large as 80 sequences. Third, two
large datasets of current biological interest, from the
Comparative RNA Web (CRW) Site, are evaluated,
T box sequences (Grundy and Henkin, 2003) and Group
IC1 Introns (Cannone et al., 2002).
We found on the large-scale benchmark-based testing that
the alignment programs score in a consistent fashion.
BlocksMSA consistently scores best for larger problems,
15 sequences or greater. Otherwise, BlockMSAs performance
scores competitively with the best scores produced by the other
alignment programs.
Results on the two large biological tests demonstrate
that BlocksMSA is the most effective. This is only to be
expected as BlocksMSA was explicitly developed for large-scale
problems. What was unexpected is that the formal measure of
alignment quality, Sum-of-Pairs Score (SPS), does not always
bear correlation with biological effectiveness.
1.1 Review of current MSA
The combinatorial complexity of MSA makes optimal
solutions infeasible. Thus, popular MSA programs are heur-
istic. MSA programs can be grouped into sequence or block-
based methods.
The main sequence-based method is progressive alignment.
Progressive methods such as ClustalW first estimate a guide
Fig. 1. MSA to biclustering mapping. Given a set of sequences,
{S1, S2, . . ., S7}, we first identify the possible ‘blocks’, local multiple
alignments of subsequences. We then represent the MSA problem in
a biclustering matrix M, where each row corresponds to a block and
each column corresponds to a sequence. The value in the matrix is ‘1’
if a block is on the sequence, ‘0’, otherwise.
Fig. 3. Map biclustering results back to MSA. For each bicluster (U or
V), the columns (sequences) correspond to a sequence group and the
rows (blocks) represent the local alignments in the sequence group.
Bicluster V consists of the sequence group {S1, S2, S4, S6} and its local
alignments are {Block1, Block3 and Block5}. Bicluster U consists of the
sequences {S3, S5, S7} and its local alignments on them are {Block2,
Block4}.
Fig. 2. Biclustering Matrix M. To bicluster the matrix M, we applied
the biclustering program, BiMax, to it. BiMax rearranged the rows and
columns of Matrix M and identified two biclusters U and V. In this
example, a bicluster is defined as a sub-matrix with all its elements
equal to 1.
S.Wang et al.
3290
3. tree, and then follow the tree order to add the sequences into
the MSA (Thompson et al., 1994). Iterative and/or consistency
based methods can be used to improve the performance of
progressive methods (Robert and Batzoglou, 2006; Notredame,
2007). An iterative algorithm such as MAFFT, MUSCLE and
PROBCONS, refines the initial tree and alignment through a
series of iterations. Consistency-based methods such as
PROBCONS and T-Coffee (Notredame et al., 2000), incorpo-
rate pairwise alignment consistency into the scoring matrix for
alignment accuracy improvement. However, all the sequence-
based methods still have difficulty when sequences are highly
variable. This is because conserved regions may occur within
long unaligned regions. Such regions can often only be
identified by a broad and explicit search. Sequence-based
methods tend to make local decisions.
Block-based MSA methods directly look for the vertically
conserved regions among sequences. Block-based methods can
be further characterized as consistency or probability-based.
Dialign 2 is representative of consistency-based methods. It is a
greedy method that first identifies conserved regions (blocks)
shared by two sequences, and then adds blocks to construct the
final alignment based on their consistency weights. Dialign
2’s time complexity, O(n4
l2
) for n sequences of length l, prevents
it from scaling to large problems. Probability-based methods,
such as MEME (Bailey and Elkan, 1994) and Gibbs Sampling
(Lawrence et al., 1993), start from an initial random choice
of an alignment then recursively realign the sequences.
The process repeats until the system converges. Current
formulations require conserved regions to be present in all
the sequences. They need ‘prior’ assumptions on the number of
blocks, which is rarely known a priori. Like iterative refinement
methods, probabilistic methods tend to be local in nature.
Probabilistic methods are computation intensive making them
a difficult starting point for scalable algorithms.
2 SYSTEMS AND METHODS
2.1 Alignment test sets
Currently, there is no standard benchmark for RNA local MSA.
The BRAliBase 2.1 benchmark suite is not specifically designed for
local alignment testing. It has the datasets with percentage identity
ranging from 20% to 95% and the maximum number of sequences per
test set is 15, which is not enough for testing scalability and robustness.
Further, the ratio between the average sequence length and the total
number of columns in a reference alignment is usually high, which does
not represent cases with large insertion/deletions.
Based on the analysis above, we chose three types of test sets:
(1) The subsets of BRAliBase that are highly variable and suitable
for local MSA; (2) LocalExtR, an extension of BRAliBase 2.1,
comprising larger-scale test groups and patterned on BRAliBase 2.1;
(3) LSet, a pair of large-scale test sets representative of current
biological problems.
The subsets from BRAliBase are selected from the most variable
test sets within the suite. They are from the THI, Glycine riboswitch
and Yybp-Ykoy RNA families, and contain 232 test datasets.
LocalExtR uses the same seed alignments from Rfam that BRAliBase
uses and forms larger test groups. The BRAliBase 2.1 convention is
to label a test group ki, where i is the number of sequences for each
test set in the group. We created four new test groups, k20, k40, k60 and
k80, totaling 90 test sets. The new test sets maintain the percentage
identity to be 560%. To model large insertions/deletions, the ratio of
the average sequence length to the total number of columns in the
reference alignments is as small as 0.36 (Table 1)
LSet contains a set of 248 T box leader sequences and a set of
90 Group IC1 Introns from the CRW Site. T box leader sequences
are located upstream of many aminoacyl-tRNA synthetase (AARS),
amino acid biosynthesis and amino acid transporter genes in Gram-
positive bacteria. Group IC1 Introns represent a family of RNA
molecules with a specific higher-order structure and the ability
to catalyze their own excision by a common splicing mechanism
(Cech, 1990). In addition to containing many more sequences, both
the average sequence length and the differences in sequence lengths are
much larger than for the BRAliBase test sets (Table 2).
2.2 Scoring alignments
We used two independent yet complementary scores to evaluate
alignment quality, the SPS and Consensus Count.
SPS (Thompson et al., 1999) measures the level of sequence
consistency between a test and a reference alignment by comparing
all possible pairs per column between both alignments; it ranges from
0 (no agreement) to 1 (complete agreement).
Table 1. Measures of the BRAliBase and LocalExtR test groups
Test group Number
of
datasets
Sequence length,
average of each
value over ki set
Average
ratio:
avg-length
/reference
Avg
PI(%)
Avg. Min. Max.
k5 124 109 96 125 0.79 51.1
BRAliBase k7 62 110 94 131 0.75 49.8
2.1 Subsets k10 32 108 94 129 0.72 49.3
k15 14 110 88 137 0.67 49.3
k20 30 115 90 172 0.53 48.4
LocalExtR k40 25 114 87 180 0.47 48.5
k60 20 107 81 189 0.40 50.7
k80 15 106 77 204 0.36 54.8
ki indicates a test group containing i-sequence datasets. Note there are many test
datasets for each ki, (see Supplementary material). The table details, the number
of test datasets, the average sequence length and the average of the minimum and
maximum sequence length per test sets within a group. Similarly, the ratio of the
average sequence length to the reference alignment’s length and percentage
identity (PI).
Table 2. LSet
Test group Data-set
size
Sequence
length
Ratio
avg-length/
reference
PI
(%)
Avg. Min Max
T box 248 269 78 365 0.40 40.0
Group IC1
intron
90 563 347 1910 0.12 38.0
The table details the number of sequences in a test dataset, average, minimum and
maximum sequence length, the ratio of the average sequence length to the
reference alignment’s length and the percentage identity (PI).
Biclustering as a method for RNA local MSA
3291
4. The Consensus Count is a measurement for column conservation.
It can be computed without identifying a reference alignment.
Given a threshold, a consensus sequence represents each column
of an alignment with the majority character in that column. Specifically
the major character’s percentage in the column should be greater or
equal to the given threshold. The consensus sequence represents a
column with a gap if the major character’s percentage in that column is
less than the specified threshold. After we obtain the consensus
sequence in an alignment, the Consensus Count is the number of non-
gap residues within the consensus sequence. It measures the vertical
conservation in an alignment.
2.3 Biological validation
In order to test the biological effectiveness of the different alignments
we compared the output of the five programs (BlockMSA, ClustalW,
MAFFT, MUSCLE and PROBCONS) to each other and noted
their ability to correctly align conserved areas. The T box sequences
have been studied enough that we could make a quantifiable assessment
with respect to known conserved functional subsequences. First, all the
T box gene sequences have evolved from a common ancestor and
contain a conserved 14-nt sequence, 50
-AGGGUGGNACCGCG-30
(Grundy and Henkin, 2003). In addition, the T box dataset contains
sequences from 12 major gene groups. Each gene group shares a
common triplet sequence representing a specifier sequence codon for
the amino acid matching the amino acid class of the downsteam coding
region (Grundy and Henkin, 2003). For example, all tyrosyl genes
contain a UAC tyrosine codon, leucine genes contain a CUC leucine
codon. Identification of the specifier sequence in each gene group can
provide insights into amino acid specificity. Here, a major gene group in
the reference alignment is defined as having more than four sequences
and strongly showing a specifier sequence pattern. This is because if a
gene group does not have enough representative sequences, the
specifier sequence feature may not be apparent. Thus, consistent
with the biological goals of a local multiple RNA sequence alignment,
the T box test set allows us to check if a program correctly align the
T box motif and count the number of specifier sequences successfully
identified by a program.
The Intron dataset is not yet as well annotated as the T box dataset.
We simply note the program’s relative ability to identify conserved
regions.
3 ALGORITHM
The algorithm has three main steps:
(1) Identify candidate blocks.
(2) Represent MSA as a biclustering problem. Apply
biclustering to simultaneously cluster sequences into
groups and find the conserved regions within each group.
(3) For each sequence group, recursively apply the above
two steps, until the sequence group is small enough or no
more conserved regions are found within it.
We first provide key definitions and then expand on each of
the three steps.
Definition 1. A k-block (conserved region) represents an
un-gapped alignment region in k sequences, consisting of
equal-length fragments from each of k sequences, respectively,
B ¼ {f1, f2,. . .,fk}, where fi represents a fragment from sequence i.
Definition 2. A 2-block is a special case of k-block with k ¼ 2,
which represents an un-gapped alignment region in a pairwise
alignment.
Definition 3. Block Similarity Score. Given a k-block
B ¼ {f1, f2, . . . ,fk}, the score of B is the sum of all the
similarity scores of the ð
k
2
Þ pairs of fragments(2-blocks)
within B:
SðBÞ ¼
X
1 i5j k
Sð fi, fjÞ
where S( fi, fj) is the similarity score of fragments fi and fj.
Definition 4. Fragment Order. Given two fragments a and b
on a sequence, a ¼ (a1,. . .,ak1) and b ¼ (b1,. . .,bk2), where ai and
bi represent two residues from the same sequence, a is said to be
less than b (written as a5b) if and only if the ending position of
a is less than the beginning position of b.
Definition 5. Block Order. Given two k-blocks on k
sequences, F ¼ ( f1, f2, . . ., fk) and G ¼ (g1, g2, . . ., gk), where fi
and gi represent two fragments from the same sequence i, Block
F is less than Block G (written as F5G) if and only if i 82[1, k],
fi5gi.
Definition 6. Non-overlapping Blocks. Given two k-blocks
F and G on k sequences, they are non-overlapping if and only
if F is less than G or G is less than F.
Definition 7. Chain. A set of k-blocks B ¼ {b1, b2,. . .,bn} on
k sequences is called a chain if all the blocks in B are
non-overlapped. The score of a chain is the sum of the scores of
all its blocks minus the gap costs between them.
Scorechain Bð Þ ¼
X
1 i5n
S bið Þ À Gap bi,biþ1ð Þ
" #
þ S bnð Þ
Definition 8. An Optimal Set of Blocks. Given a set of blocks
B, an optimal set of blocks is the chain with maximum score
over all the chains of B.
3.1 Identify candidate blocks
Typically a given set of RNA sequences has multiple conserved
regions within it. Our goal here is to identify the set of possible
conserved regions. Our algorithm is heuristic, consisting
of three main steps: (1) dividing the sequence into overlapping
k-length fragments, (2) calculation of fragment similarity scores
and (3) construction of candidate blocks. Each of the above
steps is detailed as follows:
Step (1) Dividing sequences into fragments
For each sequence, it is divided into overlapping k-length
fragments.
Step (2) Calculation of fragments’ similarity scores
We first generate a library of pairwise alignments for all
possible sequence pairs. The pairwise alignments can be
obtained from any external sequence alignment program.
We used ClustalW to do the pairwise alignment.
For each pairwise alignment, a window of length k is
slid through it and each position of the window produces
a k-length sub-region. We only keep those un-gapped
sub-regions whose similarity scores are greater than a pre-
specified threshold. Thus each sub-region is actually a
2-block. Hence, for each 2-block, its un-gapped alignment
region provides its fragments’ similarity score. For the
fragments not appearing in the alignments, their similarity
score is set to 0.
S.Wang et al.
3292
5. Step (3) Construction of candidate blocks
In this step, we construct block candidates from the pool of
2-blocks. To do so, we use a greedy clustering-based method.
We can think of each fragment as a point and an n-block as an
n-point cluster. To determine an n-point cluster, we start the
initial cluster with a seed point (fragment) and gradually
add one fragment from each sequence to the cluster. Hence,
an n-point cluster will consist of n fragments from n sequences
and form an n-block. To add a fragment to the cluster, we pick
the fragment that has the maximum sum of similarity scores
with the fragments in the cluster. The reason behind this is
that if a fragment has a higher similarity score with a cluster,
it means it has more alignment consistency with the
fragments in the cluster and shows more vertical conservation
consistency. By using the clustering method to construct blocks,
we directly integrate consistency calculation into the block
construction process and do not need to exhaustively calculate
all the consistency possibilities. For seed fragments, we choose
them from two closest sequences.
3.2 Represent MSA as a biclustering problem
Our biclustering procedure has two main steps:
(1) Convert to Biclustering Matrix.
(2) Biclustering with respect to the matrix to identify
sub-groups of sequences and conserved regions among
them.
3.2.1 Convert to biclustering matrix The initial clustering
matrix is defined on matrix M. The rows of M represent the
set of block candidates B ¼ {B1, B2,. . .,Bn}, the columns of M
represent the set of sequences S ¼ {S1, S2,. . .,Sm} and each
element Mij of the matrix is set as ‘1’ if the block Bi covers the
sequence Sj. Otherwise Mij is set as ‘0’.
3.2.2 Special case in biclustering matrix Before we apply
biclustering to the matrix, we need to first consider a special
case: there may exist highly conserved regions (blocks) across
all the sequences.
In the biclustering matrix M, each row corresponds to a
block’s coverage on sequences. If a block spans all the
sequences, its values in the matrix are all-1s. So before applying
biclustering, we check if there are any all-1’s rows, note the
corresponding blocks as being conserved across all the
sequences. We refine the matrix M by excluding those
all-1’s rows.
3.2.3 Bi-clustering with respect to the matrix In this step, the
matrix is biclustered. Among many biclustering packages,
BiMax (Barkow et al., 2006; Prelic et al., 2006) is the most
suitable. BiMax does not need the users to pre-specify the
number of the biclusters. It uses a divide-and-conquer approach
to find the inclusion-maximal biclusters. Although it allows
overlapped biclusters, it provides a filtering function, which we
used to filter results into non-overlapping biclusters.
After applying biclustering, for each obtained bicluster
matrix, its columns consist of a subset of sequences, corre-
sponding to a sequence group and its rows consist of a subset of
blocks, corresponding to the conserved features of the sequence
group. The blocks for each sequence group can be further
refined via block assembly and post-processing.
3.2.4 Block assembly and post-processing After we obtain
the blocks for a sequence group, which may be too short, we
extend them to both sides until their similarity score falls below
a predefined threshold. We also merge two blocks if they are
within a relatively short distance. After extending and merging,
we can identify an optimal set of non-overlapping blocks.
The identification of an optimal set of blocks is actually a
classic chaining problem, and can be interpreted as a maximum
weight path problem in a directed acyclic graph (Gusfield,
1997). We have implemented a single source DAG shortest path
algorithm to find the optimal chain. The time complexity is
O(n2
) time for a given set of n blocks.
Sometimes a conserved region may not cover all sequences.
A block may miss one or two sequences. After we identify all
the well-conserved blocks across all the sequences, we can look
for those weakly conserved blocks, which may miss a few
fragments, and add these blocks back to the optimal chain.
3.3 Recursion
For each sequence group, BlockMSA recursively applies the
above two steps 3.1 and 3.2, until each sequence group is small
enough or no more conserved regions are found within it.
3.4 Time complexity
BlockMSA consists of three main procedures. The time for step
(1), the identification of block candidates takes O(n2
l2
) þ
O(nfsl), where n is the number of sequences, fs is the number of
seed fragments and l is the average sequence length in an
unaligned region. Here O(n2
l2
) is the time for pairwise
alignments, and O(nfsl) is the time for candidate block
construction. In step (2), constructing biclustering matrix
takes O(bn), where b is the number of block candidates. For
biclustering, we used BiMax, which takes O(bnMin(b,n)
10. ).
Step (3) is a recursion process, which recursively clusters
sequences and identifies the blocks in them. This step should
be much faster, since we decompose the problems into smaller
and smaller problems. In order to be time efficient, we do not
do pairwise alignments again during the recursive process.
Instead, we find the corresponding alignments in the original
pairwise alignment library and use them to derive fragment
similarity. So the worst running time of BlockMSA is
O(n2
l2
) þ O(n2
fsl ) þ O(bn2
Min(b,n)
11. ).
4 IMPLEMENTATION
The BlockMSA program is implemented in Java. The pairwise
alignments can be obtained from any sequence alignment
program. We used ClustalW to do the pairwise alignment.
BlockMSA has been tested on Linux, Unix and OS-X.
Biclustering as a method for RNA local MSA
3293
12. BlockMSA is available under open source licensing and is
distributed with documents and examples.
5 RESULTS
ClustalW, MAFFT(L-INS-i), MUSCLE and PROBCONS
were run using their default parameter settings. BlockMSA
was run using block size of 11 per the following.
5.1 Choice of BlocksMSA block size
We evaluated a choice of block size ranging from 3 to 15 on the
322 datasets from BRAliBase and LocaExtR. For each block
size, we calculate the mean of SPSs over all the test groups.
Block size of 11 gives the best result. (see Fig. 4 and the
Supplementary Material Table 2).
5.2 Evaluation
5.2.1 SPS scores The SPS score of each dataset is calculated
by using the compalign program (Eddy, 2007). See Figure 5 and
the Supplementary Material Table 3. BlockMSA has the
leading performance for the benchmark suite containing
15 sequences or larger, i.e. k15 through k80 of BRAliBase
and LocalExtR. BlockMSA demonstrates a trend of increas-
ingly better performance with larger problem size. This trend is
contradicted for the T box test set but not the Intron test set.
For the three smallest test sets, k5, k7 and k10, MAFFT has the
best scores, followed by PROBCONS and then BlockMSA.
5.2.2 Consensus count Consensus Counts were calculated
using a 90% consensus threshold. See Figure 6 and
Supplementary Material Table 4. BlockMSA displays the
highest Consensus Count for k15 and ties with MAFFT for
the highest consensus count for the benchmark sets larger
than 15 sequences. On the smaller problems, PROBCONS has
the leading performance. On all the ki test groups, except k5,
all three of these programs attain Consensus Counts no
different than 2 from each other, on counts that range from
19 to 25.
BlockMSA also displays the highest Consensus Count for the
two biological datasets of LSet. On the T box data, BlockMSA
achieves a Consensus Count of 27. Both PROBCONS and
MAFFT fall behind, scoring 20 and 19. For the Intron dataset,
BlockMSA and MAFFT tie for the highest score, 55, followed
by MUSCLE, 42, and PROBCONS, 33.
5.3 Biological validation
BlockMSA demonstrates the best results on the T box dataset.
BlockMSA identifies the T box motif as a conserved region.
It identifies 10 of 12 specifier sequences in the dataset.
MUSCLE and ClustalW also identify the T box motif,
and 9 and 6 specifier sequences, respectively. MAFFT and
PROBCONS, which score well using SPS and Consensus
Count does not identify the T box motif. Their ability to
identify the specifier sequences, 5 and 0, respectively, is
consistent with their results on the T box motif (Table 3).
The entire set of Intron sequences is not yet annotated, thus
we cannot quantify the results as we did for the T box.
Disappointingly, none of these five programs, separately or
together, produce a sufficiently palpable alignment for us to
promptly annotate the sequences.
We computed consensus sequences and their Consensus
Counts for each of the five alignments with four consensus
Fig. 6. Consensus Count comparison. BlockMSA, ClustalW, MAFFT,
MUSCLE and PROBCONS are run on BRAliBase, LocalExtR and
LSet. For each of the test group, we calculate their average Consensus
Count over all the datasets of the test group with the 90% consensus
threshold. BlocMSA has leading performance as the number of
sequences increases.
Fig. 4. Block size comparison. We test the block size from 3 to 15 on
the test groups from BRAliBase and LocaExtR. For each block size,
we calculate the mean of SPSs over all the test groups. Block size of 11
gave the best result.
Fig. 5. SPS comparison. BlockMSA, ClustalW, MAFFT, MUSCLE
and PROBCONS are run on BRAliBase, LocalExtR and LSet.
In BRAliBase and LocalExtR, BlockMSA has the leading performance
as the input set size increases to 15. For Intron dataset, BlockMSA also
leads.
S.Wang et al.
3294
13. thresholds (Supplementary Table 6). By Consensus Count,
MAFFT and BlockMSA continue to be very close in
performance for consensus thresholds of 80% or greater.
At lower consensus threshold, 70%, MAFFT and MUSCLE
hold an edge.
We manually aligned the 20 consensus sequences. From an
inspection of that alignment, we are able to make the following
qualitative assessment. The five alignment programs largely
agreed that the group IC1 Introns contains eight conserved
regions. At comparable Consensus Counts, independent of
the consensus threshold needed to achieve that count, the five
programs largely agreed on the location and contents of the
conserved region. Thus, qualitatively, we conclude that
BlockMSA, MAFFT and MUSCLE were better at aligning
the Intron sequences than PROBCONS and ClustalW,
a conclusion that correlates with Consensus Count.
6 DISCUSSION
This article is the first effort to represent a MSA problem as
a biclustering problem. We incorporate biclustering into a
divide-and-conquer approach for local MSA, where conserved
blocks of subsequences are identified and further alignment
is accomplished by creating sub-problems where the sub-
problems constitute subsequences of subsets of the sequences.
The net result is both a multiple sequence alignment and a
hierarchical clustering of the sequences.
The algorithm was tested on ncRNA datasets. ncRNAs
pose special problems compared to other sequences. Unlike
gene sequences, where single nucleotide polymorphism is often
significant, a nucleotide substitution in a helix of a ncRNA is
easily compensated by a substitution in the corresponding
base pair. Consequently, large homologous sets of ncRNAs
often display a high of variability in both length and content.
In our use of biclustering, intuitively, we first perform a
large-scale search for sets of short, local, multiple alignments,
(blocks). Many of these blocks overlap in sequences and
suggest different sequence groupings. Rather than simply
marking conflicting blocks as mutually exclusive and maximiz-
ing the number of consistent blocks, biclustering seeks to
maximize the total consistency. Conflicting aspects of the block
definitions are simply removed. Blocks, once chosen are not
further subdivided. The approach finds and then maintains
local areas of agreement. With respect to quantitative measures
of MSA, BlockMSA scores comparable to or better than
the other leading MSA programs. A contrast surfaces between
BlockMSA and the other leading MSA programs in
the alignment of large sets of T box and Intron sequences.
The other leading MSA programs lag BlockMSA in their
ability to identify the most highly conserved regions, and
therefore functionally and structurally important parts of the
ncRNA sequences. Our conjecture is that PROBCONS and
likely MAFFT, which are iterative optimization methods, are
introducing gaps to improve SPS score which then results in the
break up and misalignment of contiguous conserved regions.
It is plausible that this strategy is effective for proteins and
coding genes but not ncRNAs. In highly variable ncRNAs,
critical (conserved) structures are often flanked by less
conserved sequences. These flanks, incorrectly, may provide
iterative methods with flexibility to improve score. Without a
penalty for breaking up contiguous conserved regions, local
optimizations may insert gaps in those contiguous areas.
In proteins the larger alphabet size may be enough to limit
this effect. In coding genes, misaligning reading frames can also
be expected to come at a large price.
Our alignment results depend on the performance of the
biclustering program. Currently, biclustering is still an active
research area (Barkow et al., 2006; Cheng and Church, 2000).
We expect as the biclustering techniques improve and mature,
the performance of BlockMSA will improve.
ACKNOWLEDGEMENTS
The authors acknowledge support from the National Science
Foundation, DBI-0241180, IIS-0325116 and the National
Institutes of Health, GM067317. We thank Tina Henkin for
sharing her T box alignment.
Conflict of Interest: none declared.
REFERENCES
Bailey,T.L. and Elkan,C. (1994) Fitting a mixture model by expectation
maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell.
Syst. Mol. Biol., 2, 28–36.
Barkow,S. et al. (2006) BicAT: a biclustering analysis toolbox. Bioinformatics, 22,
1282–1283.
Cannone,J.J. et al. (2002) The comparative RNA web (CRW) site: an online
database of comparative sequence and structure information for ribosomal,
intron, and other RNAs. BMC Bioinformatics, 3, 2.
Cech,T.R. (1990) Self-splicing of group I introns. Annu. Rev. Biochem., 59,
543–568.
Cheng,Y. and Church,G.M. (2000) Biclustering of expression data.
Proceedings of the 8th International Conference on Intelligent Systems for
Molecular Biology. AAAI Press, Menlo Park. pp. 93–103.
Dhillon,I.S (2001) Co-clustering documents and words using bipartite
spectral graph partitioning. Proceedings of the Seventh ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
ACM Press, San Francisco, CA. pp. 269–274.
Do,C.B. et al. (2005) PROBCONS: probabilistic consistency-based multiple
sequence alignment. Genome Res., 15, 330–340. http://PROBCONS.stanford.
edu/download.html
Economist (2007) Really new advances. http://www.economist.com/science/
displaystory.cfm?story_id¼9333471
Eddy,S. (2007) SQUID1.9g-C function library for sequence analysis. http://
selab.wustl.edu/cgi-bin/selab.pl?mode=software#squid
Table 3. T box motif and specifier sequence identification
MSA program Identified
T box motif ?
Number of identified
specifier sequences
BlockMSA Yes 10
ClustalW Yes 6
MAFFT No 5
MUSCLE Yes 9
PROBCONS No 0
A T box motif is defined as identified if a 60% consensus of an alignment
produces the motif in its entirety and no gap is inserted within the motif to break
its contiguousness. Specifier sequence, (maximum of 12), is defined as identified if
all three nucleotides in the codon are present by simple majority consensus.
Biclustering as a method for RNA local MSA
3295
14. Edgar,R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy
and high throughput. Nucleic Acids Res., 32, 1792–97. MUSCLE Version 3.6.
http://www.drive5.com/MUSCLE/download3.6.html
Griffiths-Jones,S. et al. (2005) Rfam: annotating non-coding RNAs in complete
genomes. Nucleic Acids Res., 33, 121–124.
Grundy,F.J. and Henkin,T.M (2003) The T box and S box transcription
termination control systems. Front. Biosci., 8, 20–31.
Gusfield,D. (1997) Algorithms on Strings, Trees, and Sequences. Cambridge
University Press, Cambridge, UK.
Gutell,R.R. et al. (2002) The accuracy of ribosomal RNA comparative structure
models. Curr. Opin. Struct. Biol., 12, 301–310.
Gutell,R.R. et al. (1994) Lessons from an evolving ribosomal RNA: 16S and 23S
rRNA structure from a comparative perspective. Microbiol. Rev., 58, 10–26.
Katoh,K. et al. (2005) MAFFT version 5: improvement in accuracy of multiple
sequence alignment. Nucleic Acids Res., 33, 511–518.
Lawrence,C.E. et al. (1993) Detecting subtle sequence signals: a Gibbs sampling
strategy for multiple alignment. Science, 262, 208–214.
Madeira,S.C. and Oliveira,A.L. (2004) Biclustering algorithms for
biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol.
Bioinform., 1, 24–25.
Morgenstern,B. (1999) DIALIGN 2: improvement of the segment-to-segment
approach to multiple sequence alignment. Bioinformatics, 15, 211–218.
Notredame,C. et al. (2000) T-Coffee: a novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol., 302, 205–217.
Notredame,C. (2002) Recent progress in multiple sequence alignment: a survey.
Pharmacogenomics, 3, 131–144.
Notredame,C. (2007) Recent evolutions of multiple sequence alignment
algorithms. PLoS Comput. Biol., 3, 1405–1408.
Prelic,A. et al. (2006) A systematic comparison and evaluation of biclustering
methods for gene expression data. Bioinformatics, 22, 1122–1129.
Robert,C.G. and Batzoglou,S. (2006) Multiple sequence alignment. Curr. opin.
Struct. Biol., 16, 368–373.
Thompson,J.D. et al. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting posi-
tion-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22,
4673–4690.
Thompson,J.D. et al. (1999) BAliBASE: a benchmark alignment database
for the evaluation of multiple alignment programs. Bioinformatics, 15, 87–88.
Wilm,A. et al. (2006) An enhanced RNA alignment benchmark for sequence
alignment programs. Algorithms Mol. Biol., 1, 19.
S.Wang et al.
3296