The document discusses database searching algorithms like FASTA and BLAST. It explains that FASTA uses heuristics to search for exact word matches and join high-scoring regions, while BLAST uses heuristics to compile a neighborhood of high-scoring words and then search for these words in the database to find local alignments faster than dynamic programming. It also discusses parameters that influence the speed and sensitivity of the searches.
This document discusses FASTA and BLAST algorithms for database searching to find similar sequences to a query. It explains that FASTA uses a "hit and extend" method to search for short identical matches, while BLAST searches for words above a threshold score rather than exact matches. BLAST is generally faster than FASTA and Smith-Waterman as it uses heuristics. The document provides details on how BLAST works including compiling a word list, searching the database for hits, and extending hits into alignments.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
- Dynamic programming is used to find the optimal alignment between two protein sequences by recursively computing sub-alignments and storing them in a lookup table.
- The example shows calculating the alignment score between a zinc-finger core sequence and a viral sequence fragment by filling a table and tracking the cumulative scores.
- Filling the table from left to right and top to bottom allows reconstructing the highest scoring alignment between the two sequences.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
The document describes the BLAST algorithm for comparing biological sequences. BLAST stands for Basic Local Alignment Search Tool. It allows for fast comparison of a query sequence against large databases. BLAST uses heuristics to find locally similar regions between sequences and scores alignments based on identities without considering gaps. This rapid approximation allows BLAST to be applied to search large databases on common computers, providing a significant improvement over previous algorithms. The document outlines the methods used in BLAST, including compiling high-scoring words from the query, scanning the database for hits, and extending hits to determine significant alignments. It also discusses evaluating the statistical significance of results and how parameters like word length and score thresholds can impact BLAST's speed and accuracy.
Postgres vs Elasticsearch while enriching data - Vlad Somov | Ruby Meditaiton...Ruby Meditation
Postgres and Elasticsearch can both be used for enriching and searching unstructured data. Postgres performance improved significantly with the addition of multicolumn indexes and a GIN index with trgm_ops on text fields. Elasticsearch was generally faster for searches but Postgres optimization closed the gap. The document also compared index types like B-tree, GIN, GIST and the inner workings of analyzers and indexing in Elasticsearch.
Presentation for blast algorithm bio-informaticezahid6
Presentation for BLAST algorithm
Publisher Md.Zahid Hasan
Bio-informatics blast is the use of computational tools for the process of acquisition, visualization, analysis and distribution of these datasets obtained by imaging modalities.
This document discusses establishing a dry lab called ADAM that provides bioinformatics solutions using a combination of existing platforms. It will use a WordPress communication platform, Mantis ticketing platform, and Galaxy workflow platform to enable knowledge sharing and manage projects. The goal is to empower researchers and clinicians with molecular profiling and personal genomics tools through reusable pipelines and a version control system. Next steps include getting feedback, deploying the selected platforms, running use cases to evaluate performance.
This document discusses FASTA and BLAST algorithms for database searching to find similar sequences to a query. It explains that FASTA uses a "hit and extend" method to search for short identical matches, while BLAST searches for words above a threshold score rather than exact matches. BLAST is generally faster than FASTA and Smith-Waterman as it uses heuristics. The document provides details on how BLAST works including compiling a word list, searching the database for hits, and extending hits into alignments.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
- Dynamic programming is used to find the optimal alignment between two protein sequences by recursively computing sub-alignments and storing them in a lookup table.
- The example shows calculating the alignment score between a zinc-finger core sequence and a viral sequence fragment by filling a table and tracking the cumulative scores.
- Filling the table from left to right and top to bottom allows reconstructing the highest scoring alignment between the two sequences.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
The document describes the BLAST algorithm for comparing biological sequences. BLAST stands for Basic Local Alignment Search Tool. It allows for fast comparison of a query sequence against large databases. BLAST uses heuristics to find locally similar regions between sequences and scores alignments based on identities without considering gaps. This rapid approximation allows BLAST to be applied to search large databases on common computers, providing a significant improvement over previous algorithms. The document outlines the methods used in BLAST, including compiling high-scoring words from the query, scanning the database for hits, and extending hits to determine significant alignments. It also discusses evaluating the statistical significance of results and how parameters like word length and score thresholds can impact BLAST's speed and accuracy.
Postgres vs Elasticsearch while enriching data - Vlad Somov | Ruby Meditaiton...Ruby Meditation
Postgres and Elasticsearch can both be used for enriching and searching unstructured data. Postgres performance improved significantly with the addition of multicolumn indexes and a GIN index with trgm_ops on text fields. Elasticsearch was generally faster for searches but Postgres optimization closed the gap. The document also compared index types like B-tree, GIN, GIST and the inner workings of analyzers and indexing in Elasticsearch.
Presentation for blast algorithm bio-informaticezahid6
Presentation for BLAST algorithm
Publisher Md.Zahid Hasan
Bio-informatics blast is the use of computational tools for the process of acquisition, visualization, analysis and distribution of these datasets obtained by imaging modalities.
This document discusses establishing a dry lab called ADAM that provides bioinformatics solutions using a combination of existing platforms. It will use a WordPress communication platform, Mantis ticketing platform, and Galaxy workflow platform to enable knowledge sharing and manage projects. The goal is to empower researchers and clinicians with molecular profiling and personal genomics tools through reusable pipelines and a version control system. Next steps include getting feedback, deploying the selected platforms, running use cases to evaluate performance.
This document discusses various methods for predicting genes and analyzing unknown DNA sequences, including:
- Using profiles, patterns, and hidden Markov models (HMMs) to find conserved sequences and predict protein function
- Ontologies like Gene Ontology that organize genes and gene products in a structured network to facilitate annotation and analysis
- Computational tools like Genefinder and Glimmer that use signals like coding potential, open reading frames, start/stop codons, and sequence similarity to known genes to predict gene structures in sequences
- Integrating multiple lines of evidence, like HMMs, EST alignments, repeats, and CpG islands, can improve gene prediction over a single method.
The document discusses a lab for bioinformatics and computational genomics at Ghent University. The lab has 10 "genome hackers" who are mostly engineers and 42 scientists, technicians, geneticists and clinicians. The lab focuses on bioinformatics, epigenetics, personal genomics and 3D printing. Bioinformatics is defined as the application of information technology to biological information, facilitated by computers. The document then discusses various topics related to genetics, genomics and personalized medicine.
This document provides an introduction and overview of the Perl programming language. It discusses what Perl is, its history and origins, its uses for bioinformatics tasks, examples of bioinformatics problems suited to Perl, and includes examples of basic Perl programs and concepts like variables, operators, conditionals and loops.
This document provides an overview of hidden Markov models (HMMs) and their application to gene prediction. It discusses how HMMs can model insertions and deletions in sequence alignments through their graphical representation using states and transitions. The document also explains how HMMs assign probabilities to sequences based on allowed state emissions and transitions. HMMs allow for more flexible modeling of gapped alignments than profiles or patterns alone.
This document discusses protein structure and bioinformatics. It begins by explaining the rationale for understanding protein structure and function, including determining protein sequences, structures, and relating this to function. It then covers levels of protein structure from primary to quaternary, methods for determining protein structures like X-ray crystallography, and uses of protein modeling and databases. The document provides examples of protein domains, folds, and membrane protein topology. It emphasizes that sequence determines conformation and that structure implies function.
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vwebProf. Wim Van Criekinge
This document discusses next generation epigenetic profiling and biomarkers. It provides an overview of epigenetics and methylation in oncology. MDxHealth is developing next-generation epigenetic biomarkers including methylation-based companion diagnostics. Deep sequencing techniques like MBD-Seq are being used to discover biomarkers from genome-wide methylation profiling and provide high resolution of methylation heterogeneity. Validation studies of biomarkers like MGMT promoter methylation are also discussed.
This document provides an overview of GitHub as a code hosting platform and introduces some basic Python concepts including control structures, lists, dictionaries, regular expressions, and BioPython. It demonstrates how to install packages like Biopython, parse sequence data from files or online sources, and extract information from sequence records like identifiers, lengths, and subsequences.
The document discusses database searching algorithms like FASTA and BLAST. It explains the mathematical concepts behind BLAST like using Erdos-Renyi theory to model random sequence alignments and calculate the expected length of the longest random match. It also describes the Karlin-Alschul equation used in BLAST to calculate the statistical significance of matches as the expected number of alignments (E) based on the size of the search space and alignment score. The document provides details on parameters and scoring approaches used in database searching algorithms.
The document discusses algorithms for database searching and sequence alignment. It introduces BLAST and FASTA, two widely used algorithms for database searching. BLAST works by finding short words in sequences that score above a threshold and then extending any alignments found. FASTA uses a "hit and extend" heuristic to find locally similar regions. The document then discusses the statistical models that BLAST uses to calculate expected values and rank matching sequences by significance. It describes how BLAST models alignments as coin tosses to apply the Erdös-Rényi theorem and derive the Karlin-Altschul equation for calculating expected values.
The document discusses sequence alignment and database searching techniques. It provides code examples and explanations for the Needleman-Wunsch algorithm, dynamic programming, FASTA, and BLAST. For BLAST, it explains the key steps of breaking sequences into words, searching the database for word matches, and extending matches to find high-scoring pairs and hits between query and database sequences. Parameters like word threshold, word size, extension cutoff, and E-value are also discussed for customizing BLAST searches.
The document provides information about various bioinformatics lessons that will take place on Thursdays, including topics like biological databases, sequence alignments, database searching using FASTA and BLAST, phylogenetics, and protein structure. It also includes details about database searching methods like dynamic programming, FASTA, BLAST, and parameters that can be adjusted for BLAST searches.
This document discusses the BLAST algorithm for comparing biological sequences. It explains that BLAST allows rapid sequence comparison of a query sequence against a database. BLAST is fast, accurate, and accessible online. The document then describes the four main components of a BLAST search: choosing the query sequence, BLAST program, database, and optional parameters. It provides details on how to interpret BLAST search results, including the expect value, and how BLAST works by compiling word pairs from the query and database in three phases of searching and alignment.
The document discusses various sequence comparison techniques including pairwise alignment, local alignment, global alignment, and multiple alignment. It describes heuristic methods like FASTA and BLAST that are faster than dynamic programming but may miss optimal alignments. It provides details on the FASTA algorithm including finding identical words, re-scoring, joining segments, and dynamic programming. It also explains the BLAST algorithm and steps including query preprocessing, database scanning, and hit extension. Specialized BLAST databases and tools are also listed.
This document summarizes key concepts in sequence alignment including:
1) Sequence alignment involves finding the linear correspondence between symbols in one sequence to another that maximizes similarity. Dynamic programming is commonly used to compute optimal alignments.
2) BLAST is an extremely fast database search tool that uses heuristics like word matching to find local alignments and statistical analysis to assess significance.
3) Multiple sequence alignments make conserved features more apparent but are more difficult to compute than pairwise alignments. Progressive alignment gradually merges pairwise alignments based on a phylogenetic tree.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
BLAST and FASTA are algorithms for searching sequence databases to find local alignments between a query sequence and database sequences, with BLAST providing faster searches and improved statistical analysis compared to FASTA. Both algorithms work by first identifying short exact matches between sequences and then extending these matches to identify longer regions of similarity. The algorithms model DNA and protein sequence alignments as coin tosses to determine the expected length of the longest matching region between random sequences.
FASTA is a bioinformatic tool for fast sequence searching that allows for comparison of a query sequence against a database to find similar sequences. It uses heuristic methods like focusing on diagonal areas of alignment matrices to achieve high sensitivity searches at high speed. The FASTA algorithm works by first finding regions of identity, rescoring matches using substitution matrices, joining matching segments while eliminating low scoring segments, and constructing an optimal alignment using dynamic programming.
This document provides an overview of the BLAST algorithm used for comparing biological sequences and identifying sequence similarities. It describes how BLAST works by generating words from a query sequence and searching a database for exact matches. Significant matches are extended locally to identify Maximal Segment Pairs (MSPs) based on scoring. MSPs are evaluated and ranked using E-values, which estimate the statistical significance of matches. The document also discusses different BLAST programs and provides examples of running BLAST searches and interpreting results. Homework assignments are included applying BLAST to specific sequence analysis tasks.
This document discusses biological database searching. It begins by defining biological databases as organized collections of persistent biological data that can be queried and retrieved. Examples are provided such as GenBank and SwissProt. Database searching involves using a query sequence to find related sequences in the database based on homology. Parameters like E-values and bit scores are used to define the significance of matches. Popular search algorithms like BLAST and FASTA first use heuristics to find high-scoring regions before applying dynamic programming for alignment.
The document discusses sequence similarity searching and comparison. It describes how programs like BLAST and FASTA are used to rapidly identify similarities between sequences and determine evolutionary relationships. BLAST and FASTA utilize word matching and heuristics to efficiently search large databases and return local or global alignments with scoring of matches. They provide a powerful method for functions prediction by comparing new sequences to known genes and proteins.
The document provides an overview of computational methods for sequence alignment. It discusses different types of sequence alignment including global and local alignment. It also describes various methods for sequence alignment, such as dot matrix analysis, dynamic programming algorithms (e.g. Needleman-Wunsch, Smith-Waterman), and word/k-tuple methods. Scoring matrices like PAM and BLOSUM that are used for sequence alignments are also explained.
This document discusses various methods for predicting genes and analyzing unknown DNA sequences, including:
- Using profiles, patterns, and hidden Markov models (HMMs) to find conserved sequences and predict protein function
- Ontologies like Gene Ontology that organize genes and gene products in a structured network to facilitate annotation and analysis
- Computational tools like Genefinder and Glimmer that use signals like coding potential, open reading frames, start/stop codons, and sequence similarity to known genes to predict gene structures in sequences
- Integrating multiple lines of evidence, like HMMs, EST alignments, repeats, and CpG islands, can improve gene prediction over a single method.
The document discusses a lab for bioinformatics and computational genomics at Ghent University. The lab has 10 "genome hackers" who are mostly engineers and 42 scientists, technicians, geneticists and clinicians. The lab focuses on bioinformatics, epigenetics, personal genomics and 3D printing. Bioinformatics is defined as the application of information technology to biological information, facilitated by computers. The document then discusses various topics related to genetics, genomics and personalized medicine.
This document provides an introduction and overview of the Perl programming language. It discusses what Perl is, its history and origins, its uses for bioinformatics tasks, examples of bioinformatics problems suited to Perl, and includes examples of basic Perl programs and concepts like variables, operators, conditionals and loops.
This document provides an overview of hidden Markov models (HMMs) and their application to gene prediction. It discusses how HMMs can model insertions and deletions in sequence alignments through their graphical representation using states and transitions. The document also explains how HMMs assign probabilities to sequences based on allowed state emissions and transitions. HMMs allow for more flexible modeling of gapped alignments than profiles or patterns alone.
This document discusses protein structure and bioinformatics. It begins by explaining the rationale for understanding protein structure and function, including determining protein sequences, structures, and relating this to function. It then covers levels of protein structure from primary to quaternary, methods for determining protein structures like X-ray crystallography, and uses of protein modeling and databases. The document provides examples of protein domains, folds, and membrane protein topology. It emphasizes that sequence determines conformation and that structure implies function.
2014 03 28_next_generation_epigenetic_profling_v_les_epigenetica_vwebProf. Wim Van Criekinge
This document discusses next generation epigenetic profiling and biomarkers. It provides an overview of epigenetics and methylation in oncology. MDxHealth is developing next-generation epigenetic biomarkers including methylation-based companion diagnostics. Deep sequencing techniques like MBD-Seq are being used to discover biomarkers from genome-wide methylation profiling and provide high resolution of methylation heterogeneity. Validation studies of biomarkers like MGMT promoter methylation are also discussed.
This document provides an overview of GitHub as a code hosting platform and introduces some basic Python concepts including control structures, lists, dictionaries, regular expressions, and BioPython. It demonstrates how to install packages like Biopython, parse sequence data from files or online sources, and extract information from sequence records like identifiers, lengths, and subsequences.
The document discusses database searching algorithms like FASTA and BLAST. It explains the mathematical concepts behind BLAST like using Erdos-Renyi theory to model random sequence alignments and calculate the expected length of the longest random match. It also describes the Karlin-Alschul equation used in BLAST to calculate the statistical significance of matches as the expected number of alignments (E) based on the size of the search space and alignment score. The document provides details on parameters and scoring approaches used in database searching algorithms.
The document discusses algorithms for database searching and sequence alignment. It introduces BLAST and FASTA, two widely used algorithms for database searching. BLAST works by finding short words in sequences that score above a threshold and then extending any alignments found. FASTA uses a "hit and extend" heuristic to find locally similar regions. The document then discusses the statistical models that BLAST uses to calculate expected values and rank matching sequences by significance. It describes how BLAST models alignments as coin tosses to apply the Erdös-Rényi theorem and derive the Karlin-Altschul equation for calculating expected values.
The document discusses sequence alignment and database searching techniques. It provides code examples and explanations for the Needleman-Wunsch algorithm, dynamic programming, FASTA, and BLAST. For BLAST, it explains the key steps of breaking sequences into words, searching the database for word matches, and extending matches to find high-scoring pairs and hits between query and database sequences. Parameters like word threshold, word size, extension cutoff, and E-value are also discussed for customizing BLAST searches.
The document provides information about various bioinformatics lessons that will take place on Thursdays, including topics like biological databases, sequence alignments, database searching using FASTA and BLAST, phylogenetics, and protein structure. It also includes details about database searching methods like dynamic programming, FASTA, BLAST, and parameters that can be adjusted for BLAST searches.
This document discusses the BLAST algorithm for comparing biological sequences. It explains that BLAST allows rapid sequence comparison of a query sequence against a database. BLAST is fast, accurate, and accessible online. The document then describes the four main components of a BLAST search: choosing the query sequence, BLAST program, database, and optional parameters. It provides details on how to interpret BLAST search results, including the expect value, and how BLAST works by compiling word pairs from the query and database in three phases of searching and alignment.
The document discusses various sequence comparison techniques including pairwise alignment, local alignment, global alignment, and multiple alignment. It describes heuristic methods like FASTA and BLAST that are faster than dynamic programming but may miss optimal alignments. It provides details on the FASTA algorithm including finding identical words, re-scoring, joining segments, and dynamic programming. It also explains the BLAST algorithm and steps including query preprocessing, database scanning, and hit extension. Specialized BLAST databases and tools are also listed.
This document summarizes key concepts in sequence alignment including:
1) Sequence alignment involves finding the linear correspondence between symbols in one sequence to another that maximizes similarity. Dynamic programming is commonly used to compute optimal alignments.
2) BLAST is an extremely fast database search tool that uses heuristics like word matching to find local alignments and statistical analysis to assess significance.
3) Multiple sequence alignments make conserved features more apparent but are more difficult to compute than pairwise alignments. Progressive alignment gradually merges pairwise alignments based on a phylogenetic tree.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
BLAST and FASTA are algorithms for searching sequence databases to find local alignments between a query sequence and database sequences, with BLAST providing faster searches and improved statistical analysis compared to FASTA. Both algorithms work by first identifying short exact matches between sequences and then extending these matches to identify longer regions of similarity. The algorithms model DNA and protein sequence alignments as coin tosses to determine the expected length of the longest matching region between random sequences.
FASTA is a bioinformatic tool for fast sequence searching that allows for comparison of a query sequence against a database to find similar sequences. It uses heuristic methods like focusing on diagonal areas of alignment matrices to achieve high sensitivity searches at high speed. The FASTA algorithm works by first finding regions of identity, rescoring matches using substitution matrices, joining matching segments while eliminating low scoring segments, and constructing an optimal alignment using dynamic programming.
This document provides an overview of the BLAST algorithm used for comparing biological sequences and identifying sequence similarities. It describes how BLAST works by generating words from a query sequence and searching a database for exact matches. Significant matches are extended locally to identify Maximal Segment Pairs (MSPs) based on scoring. MSPs are evaluated and ranked using E-values, which estimate the statistical significance of matches. The document also discusses different BLAST programs and provides examples of running BLAST searches and interpreting results. Homework assignments are included applying BLAST to specific sequence analysis tasks.
This document discusses biological database searching. It begins by defining biological databases as organized collections of persistent biological data that can be queried and retrieved. Examples are provided such as GenBank and SwissProt. Database searching involves using a query sequence to find related sequences in the database based on homology. Parameters like E-values and bit scores are used to define the significance of matches. Popular search algorithms like BLAST and FASTA first use heuristics to find high-scoring regions before applying dynamic programming for alignment.
The document discusses sequence similarity searching and comparison. It describes how programs like BLAST and FASTA are used to rapidly identify similarities between sequences and determine evolutionary relationships. BLAST and FASTA utilize word matching and heuristics to efficiently search large databases and return local or global alignments with scoring of matches. They provide a powerful method for functions prediction by comparing new sequences to known genes and proteins.
The document provides an overview of computational methods for sequence alignment. It discusses different types of sequence alignment including global and local alignment. It also describes various methods for sequence alignment, such as dot matrix analysis, dynamic programming algorithms (e.g. Needleman-Wunsch, Smith-Waterman), and word/k-tuple methods. Scoring matrices like PAM and BLOSUM that are used for sequence alignments are also explained.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
This document provides an overview of pairwise sequence alignment and BLAST. It discusses how pairwise alignment works using substitution matrices to assign homology between sites. It demonstrates the dynamic programming approach to pairwise alignment calculation and describes how local alignments are identified. The document also introduces BLAST and how it uses word matching to rapidly identify similar sequences in a database and then performs local alignments on matching regions.
This document provides an overview of the BLAST algorithm for calculating sequence similarity and its various outputs. BLAST (Basic Local Alignment Search Tool) finds local alignments between biological sequences and produces outputs in various formats including pairwise alignments, hit tables, and structured outputs like ASN.1 and XML. It works by creating a lookup table of words from the query and scanning the database for matches to initiate gapped or ungapped extensions. The document discusses strategies for improving BLAST throughput, such as using MegaBLAST for very similar sequences, increasing word size thresholds, and concatenating queries to minimize database scanning.
This document provides an overview of the bioinformatics tools BLAST and FASTA for database searching. It discusses how BLAST uses heuristics like rapid word lookup methods to skip comparing most database entries, making it extremely fast compared to dynamic programming methods. FASTA also uses heuristics like identifying short identical matches to focus alignments, but BLAST is now one order of magnitude faster than FASTA and two orders of magnitude faster than Smith-Waterman. The document provides details on how each tool works and their advantages and limitations.
A review of two alignment-free methods for sequence comparison. In this presentation two alignment-free methods are studied:
- "Similarity analysis of DNA sequences based on LZ complexity and dynamic programming algorithm" by Guo et al.
- "Alignment-free comparison of genome sequences by a new numerical characterization" by Huang et al.
Similar to Bioinformatics t5-databasesearching v2014 (20)
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
The document discusses the Rh blood group system and its clinical significance. It describes the key observations in 1939 that linked adverse reactions in mothers to stillborn fetuses and blood transfusions from fathers, indicating a relationship. This syndrome is now called hemolytic disease of the fetus and newborn. The Rh system was identified in 1940 through experiments immunizing animals with Rhesus macaque monkey red blood cells. The D antigen is the most important RBC antigen in transfusion practice, as those lacking it do not produce anti-D antibody unless exposed to D antigen through transfusion or pregnancy. Testing for D is routinely performed to ensure D-negative patients receive D-negative blood.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
This document contains a list of names, emails, and study programs of students. It includes their official student code, last name, first name, email, and educational program. There are 20 students listed with their details.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
The document discusses various topics related to analyzing protein sequences using Python and Biopython. It provides examples of using Biopython to parse sequence data from UniProt, calculate lengths and translations of sequences. It also discusses analyzing properties of sequences like molecular weight, isoelectric point, transmembrane regions, and comparing sequences to find conserved motifs. Finally, it introduces hydropathy indices and tools for predicting properties like transmembrane helices from primary sequences.
This document discusses Python functions. It explains that there are built-in functions provided as part of Python and user-defined functions. User-defined functions are created using the def keyword and can take parameters and return values. The body of a function is indented and runs when the function is called. Functions allow code to be reused and organized in a modular way. Examples are provided to demonstrate defining and calling functions with different parameters and return values.
The document provides a recap of Python programming concepts like conditions and statements, while loops, for loops, break and continue statements, and working with strings. It also introduces regular expressions as a way to match patterns in strings using a formal language that can be interpreted by a regular expression processor.
[SUMMARY
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
The document provides an overview of the history and evolution of various programming languages. It discusses early languages like FORTRAN, LISP, PASCAL, C, and Java. It also covers scripting languages and their uses. The document explains what Python is as a programming language - that it is interpreted, object-oriented, and high-level. It was named after Monty Python and was created by Guido van Rossum. The document then gives examples of using Python to program Minecraft by importing protein data from PDB files and using coordinates to place blocks to visualize proteins in the game.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of NoSQL databases, including:
- Key-value stores store data as maps or hashmaps and are efficient for data access but limited in query capabilities.
- Column-oriented stores group attributes into column families and store data efficiently but are operationally challenging.
- Document databases store loosely structured data like JSON and allow retrieving documents by keys or contents.
- Graph databases are suited for interaction networks and path finding but are less suited for tabular data.
The document discusses creating a multicore database project. It recommends taking the following steps:
1. Define what the project is about, what it aims to achieve, and who it is for.
2. Identify information resources and develop a basic data model.
3. Design a user interface mockup without technical constraints, thinking creatively.
This document discusses biological databases and PHP. It begins with an overview of biological databases and examples using BIOSQL to load genetic data from GenBank into a MySQL database. It then provides examples of building a basic 3-tier model with Apache, PHP, and a MySQL backend database. The document also includes a brief introduction to PHP, covering its history, why it is commonly used, and basic syntax like conditional statements.
This document discusses biological databases and SQL. It provides an overview of primary and derived data in biological research, as well as different data levels. It then discusses direct querying of selected bioinformatics databases using SQL and provides examples of 3-tier database models. The document proceeds to discuss rationale for learning SQL to query biological databases and provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, integrity rules and constraints.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
The document provides information on various Python programming concepts including control structures, lists, dictionaries, regular expressions, exceptions, and biological applications using Biopython. It discusses if/else statements, while and for loops, list operations, dictionary usage, regex patterns, exception handling roles, and gives examples analyzing protein sequences and structures using Biopython.
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPRAHUL
This Dissertation explores the particular circumstances of Mirzapur, a region located in the
core of India. Mirzapur, with its varied terrains and abundant biodiversity, offers an optimal
environment for investigating the changes in vegetation cover dynamics. Our study utilizes
advanced technologies such as GIS (Geographic Information Systems) and Remote sensing to
analyze the transformations that have taken place over the course of a decade.
The complex relationship between human activities and the environment has been the focus
of extensive research and worry. As the global community grapples with swift urbanization,
population expansion, and economic progress, the effects on natural ecosystems are becoming
more evident. A crucial element of this impact is the alteration of vegetation cover, which plays a
significant role in maintaining the ecological equilibrium of our planet.Land serves as the foundation for all human activities and provides the necessary materials for
these activities. As the most crucial natural resource, its utilization by humans results in different
'Land uses,' which are determined by both human activities and the physical characteristics of the
land.
The utilization of land is impacted by human needs and environmental factors. In countries
like India, rapid population growth and the emphasis on extensive resource exploitation can lead
to significant land degradation, adversely affecting the region's land cover.
Therefore, human intervention has significantly influenced land use patterns over many
centuries, evolving its structure over time and space. In the present era, these changes have
accelerated due to factors such as agriculture and urbanization. Information regarding land use and
cover is essential for various planning and management tasks related to the Earth's surface,
providing crucial environmental data for scientific, resource management, policy purposes, and
diverse human activities.
Accurate understanding of land use and cover is imperative for the development planning
of any area. Consequently, a wide range of professionals, including earth system scientists, land
and water managers, and urban planners, are interested in obtaining data on land use and cover
changes, conversion trends, and other related patterns. The spatial dimensions of land use and
cover support policymakers and scientists in making well-informed decisions, as alterations in
these patterns indicate shifts in economic and social conditions. Monitoring such changes with the
help of Advanced technologies like Remote Sensing and Geographic Information Systems is
crucial for coordinated efforts across different administrative levels. Advanced technologies like
Remote Sensing and Geographic Information Systems
9
Changes in vegetation cover refer to variations in the distribution, composition, and overall
structure of plant communities across different temporal and spatial scales. These changes can
occur natural.
Exploiting Artificial Intelligence for Empowering Researchers and Faculty, In...Dr. Vinod Kumar Kanvaria
Exploiting Artificial Intelligence for Empowering Researchers and Faculty,
International FDP on Fundamentals of Research in Social Sciences
at Integral University, Lucknow, 06.06.2024
By Dr. Vinod Kumar Kanvaria
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
How to Make a Field Mandatory in Odoo 17Celine George
In Odoo, making a field required can be done through both Python code and XML views. When you set the required attribute to True in Python code, it makes the field required across all views where it's used. Conversely, when you set the required attribute in XML views, it makes the field required only in the context of that particular view.
This presentation includes basic of PCOS their pathology and treatment and also Ayurveda correlation of PCOS and Ayurvedic line of treatment mentioned in classics.
5. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
6. Needleman-Wunsch-edu.pl
The Score Matrix
----------------
Seq1(j)1 2 3 4 5 6 7
Seq2 * C K H V F C R
(i) * 0 -1 -2 -3 -4 -5 -6 -7
1 C -1 1 a
0 -1 -2 -3 -4 -5
2 K -2 0 c 2 b
1 0 -1 -2 -3
3 K -3 -1 1 1 0 -1 -2 -3
4 C -4 -2 0 0 0 -1 0 -1
5 F -5 -3 -1 -1 -1 1 0 -1
6 C -6 -4 -2 -2 -2 0 2 1
7 K -7 -5 -3 -3 -3 -1 1 1
8 C -8 -6 -4 -4 -4 -2 0 0
9 V -9 -7 -5 -5 -3 -3 -1 -1
A: matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
if (substr(seq1,j-1,1) eq substr(seq2,i-1,1)
B: up_score = matrix(i-1,j) + GAP
C: left_score = matrix(i,j-1) + GAP
7. • The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
Multiple Alignment Method
8. • Consider the task of searching
SWISSPROT against a query
sequence:
– say our query sequence is 362
amino acids long
– SWISSPROT release 38 contains
29,085,265 amino acids
– finding local alignments via
dynamic programming would
entail O(1010) matrix operations
• Given size of databases, more
efficient methods needed
Database Searching
9. Heuristic approaches to DP for database searching
FASTA (Pearson 1995)
Uses heuristics to avoid
calculating the full dynamic
programming matrix
Speed up searches by an
order of magnitude
compared to full Smith-
Waterman
The statistical side of FASTA is
still stronger than BLAST
BLAST (Altschul 1990, 1997)
Uses rapid word lookup
methods to completely skip
most of the database
entries
Extremely fast
One order of magnitude
faster than FASTA
Two orders of magnitude
faster than Smith-
Waterman
Almost as sensitive as FASTA
10. « Hit and extend heuristic»
• Problem: Too many calculations
“wasted” by comparing regions
that have nothing in common
• Initial insight: Regions that are
similar between two sequences
are likely to share short
stretches that are identical
• Basic method: Look for similar
regions only near short
stretches that match exactly
FASTA
11. FASTA-Stages
1. Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
2. Score and select top 10 scoring “local diagonals”
3. Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
4. Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
5. After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and
uses the score as the optimized score.
12.
13.
14. • Sensitivity: the ability of a
program to identify weak but
biologically significant sequence
similarity.
• Selectivity: the ability of a
program to discriminate between
true matches and matches
occurring by chance alone.
– A decrease in selectivity results in
more false positives being reported.
FastA
15. FastA (http://www.ebi.ac.uk/fasta33/)
Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences
Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively
The larger the
word-length the
less sensitive, but
faster the search
will be
Max number of
scores and
alignments is 100
16. FastA Output
Database
code
hyperlinked
to the SRS
database at
EBI
Accession
number
Description Length
Initn, init1, opt, z-score
calculated
during run
E score -
expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.
E() does not
represent the
% similarity
17. FastA is a family of programs
FastA, TFastA, FastX, FastY
Query: DNA Protein
Database:DNA Protein
18. FASTA can miss significant
similarity since
– For proteins, similar sequences do
not have to share identical residues
• Asp-Lys-Val is quite similar to
• Glu-Arg-Ile yet it is missed even with
ktuple size of 1 since no amino acid
matches
• Gly-Asp-Gly-Lys-Gly is quite similar
to Gly-Glu-Gly-Arg-Gly but there is
no match with ktuple size of 2
FASTA problems
19. FASTA can miss significant
similarity since
– For nucleic acids, due to codon
“wobble”, DNA sequences may
look like XXyXXyXXy where X’s
are conserved and y’s are not
• GGuUCuACgAAg and
GGcUCcACaAAA both code for
the same peptide sequence (Gly-Ser-
Thr-Lys) but they don’t match with
ktuple size of 3 or higher
FASTA problems
20. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
Blast
22. What does BLAST do?
• Search a large target set of sequences...
• …for hits to a query sequence...
• …and return the alignments and scores from those
hits...
• Do it fast.
Show me those sequences that deserve a second look.
Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity to
distant related sequences.
23. The big red button
Do My Job
It is dangerous to hide too much of the
underlying complexity from the scientists.
24. • Approach: find segment pairs
by first finding word pairs that
score above a threshold, i.e.,
find word pairs of fixed length w
with a score of at least T
• Key concept “Neigborhood”:
Seems similar to FASTA, but
we are searching for words
which score above T rather than
that match exactly
• Calculate neigborhood (T) for
substrings of query (size W)
Overview
25. Overview
Compile a list of words which give a score
above T when paired with the query sequence.
– Example using PAM-120 for query sequence ACDE
(w=4, T=17):
A C D E
A C D E = +3 +9 +5 +5 = 22
• try all possibilities:
A A A A = +3 -3 0 0 = 0 no good
A A A C = +3 -3 0 -7 = -7 no good
• ...too slow, try directed change
26. Overview
A C D E
A C D E = +3 +9 +5 +5 = 22
• change 1st pos. to all acceptable substitutions
g C D E = +1 +9 +5 +5 = 20 ok
n C D E = +0 +9 +5 +5 = 19 ok
I C D E = -1 +9 +5 +5 = 18 ok
k C D E = -2 +9 +5 +5 = 17 ok
• change 2nd pos.: can't - all alternatives negative
and the other three positions only add up to 13
• change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
• continue - use recursion
• For "best" values of w and T there are typically
about 50 words in the list for every residue in the
query sequence
27. Neighborhood.pl
# Calculate neighborhood
my %NH;
for (my $i = 0; $i < @A; $i++) {
my $s1 = $S{$W[0]}{$A[$i]};
for (my $j = 0; $j < @A; $j++) {
my $s2 = $S{$W[1]}{$A[$j]};
for (my $k = 0; $k < @A; $k++) {
my $s3 = $S{$W[2]}{$A[$k]};
my $score = $s1 + $s2 + $s3;
my $word = "$A[$i]$A[$j]$A[$k]";
next if $word =~ /[BZX*]/;
$NH{$word} = $score if $score >= $T;
}
}
}
# Output neighborhood
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) {
print "$word $NH{$word}n";
}
30. S
Length of extension
Score
Trim to max
indexed
*
*Two non-overlapping HSP’s on a diagonal within distance A
31. S
Length of extension
Score
Trim to max
indexed
*
*Two non-overlapping HSP’s on a diagonal within distance A
32. The BLAST algorithm
• Break the search sequence into words
– W = 3 for proteins, W = 12 for DNA
MCGPFILGTYC
CGP
MCG
MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC
• Include in the search all words that score
above a certain value (T) for any search word
MCG CGP
MCT MGP …
MCN CTP
… …
This list can be
computed in linear
time
33. The Blast Algorithm (2)
• Search for the words in the database
– Word locations can be precomputed and indexed
– Searching for a short string in a long string
• HSP (High Scoring Pair) = A match between
a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
• Extend the hit until the score falls below a
threshold value, S
34.
35. BLAST parameters
• Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be found,
at the expense of increased noise in the results set.
• Choosing a value for w
– small w: many matches to expand
– big w: many words to be generated
– w=4 is a good compromise
• Lowering the segment extension cutoff (S) returns
longer extensions for each hit.
• Changing the minimum E-value changes the
threshold for reporting a hit.
36. Critical parameters: T,W and scoring matrix
• The proper value of T depends ons both the
values in the scoring matrix and balance
between speed and sensitivity
• Higher values of T progressively remove
more word hits and reduce the search space.
• Word size (W) of 1 will produce more hits
than a word size of 10. In general, if T is
scaled uniformly with W, smaller word
sizes incraese sensitivity and decrease
speed.
• The interplay between W,T and the scoring
matrix is criticial and choosing them wisely
is the most effective way of controlling the
speed and sensiviy of blast
37. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
38. Database Searching
• How can we find a particular short sequence
in a database of sequences (or one HUGE
sequence)?
• Problem is identical to local sequence
alignment, but on a much larger scale.
• We must also have some idea of the
significance of a database hit.
– Databases always return some kind of hit, how
much attention should be paid to the result?
• How can we determine how “unusual” a
particular alignment score is?
39. Sentence 1:
“These algorithms are trying to find the best way to match up
two sequences”
Sentence 2:
“This does not mean that they will find anything profound”
ALIGNMENT:
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...: : ::::.. :: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND------
12 exact matches
14 conservative substitutions
Is this a good alignment?
Significance
40. • A key to the utility of BLAST is
the ability to calculate expected
probabilities of occurrence of
Maximum Segment Pairs
(MSPs) given w and T
• This allows BLAST to rank
matching sequences in order of
“significance” and to cut off
listings at a user-specified
probability
Overview
41. Mathematical Basis of BLAST
• Model matches as a sequence of coin tosses
• Let p be the probability of a “head”
– For a “fair” coin, p = 0.5
• (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
• Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
• Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
42. Mathematical Basis of BLAST
• To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
AATCAT
ATTCAG
HTHHHT
• For DNA, the probability of a “head” is 1/4
– What is it for amino acid sequences?
43. Mathematical Basis of BLAST
• So, for one particular alignment, the Erdös-Rényi
property can be applied
• What about for all possible alignments?
– Consider that sequences are being shifted back and forth,
dot matrix plot
• The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
45. Karlin-Alschul Statistics
E=kmn-λS
This equation states that the number of alignments
expected by chance (E) during the sequence
database search is a function of the size of the
search space (m*n), the normalized score (λS)
and a minor constant (k mostly 0.1)
E-Value grows linearly with the product of target and
query sizes. Doubling target set size and doubling
query length have the same effect on e-value
47. Scoring alignments
• Score: S (~R)
– S=SM(qi,ti) - Sgaps
• Any alignment has a score
• Any two sequences have a(t least one)
optimal alignment
48. • For a particular scoring matrix and its
associated gap initiation and extention costs
one must calculate λ and k
• Unfortunately (for gapped alignments), you
can’t do this analytically and the values must
be estimated empirically
– The procedure involves aligning random
sequences (Monte Carlo approach) with a specific
scoring scheme and observing the alignment
properties (scores, target frequencies and
lengths)
49. Significance
“Monte Carlo” Approach:
• Compares result to randomized result,
similarly to results generated by a roulette
wheel at Monte Carlo
• Typical procedure for alignments
– Randomize sequence A
– Align to sequence B
– Repeat many times (hundreds)
– Keep track op optimal score
• Histogram of scores …
53. Normal Distribution does NOT Fit Alignment Scores !!
• In seeking optimal Alignments between two
sequences, one desires those that have the highest
score - i.e. one is seeking a distribution of maxima
• In seeking optimal Matches between an Input
Sequence and Sequence Entries in a Database, one
again desires the matches that have the highest
score, and these are obtained via examination of the
distribution of such scores for the entries in the
database - this is again a distribution of maxima.
“A Normal Distribution is a distribution of Sums of
independent variables rather than a sum of their
Maxima.“
Significance
54. Comparing distributions
Gaussian: Extreme Value:
1
1
x
e
x
f x e e
2
2
2
2
x
f x e
55. Alignment scores follow extreme value distributions
Alignment of unrelated/random sequences result in scores
following an extreme value distribution
E
P = 1 –e-E
P(xS) = 1-exp(-kmne-S)
m, n: sequence lengths.
k, : free parameters.
x
E=-ln(1-P)
This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
56. Alignment scores follow extreme value distributions
Alignment algorithms will always produce alignments,
regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.
Example: scores fitted to
extreme value distribution.
99.9% of this distribution is
located below score=112
=> hit with score = 112 has a
p-value of 0.1%
57. BLAST uses precomputed extreme
value distributions to calculate E-values
from alignment scores
For this reason BLAST only allows
certain combinations of substitution
matrices and gap penalties
This also means that the fit is based on
a different data set than the one you
are working on
Significance
A word of caution: BLAST tends to overestimate the significance of its
matches
E-values from BLAST are fine for identifying sure hits
One should be careful using BLAST’s E-values to judge if a marginal hit
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
58. Determining P-values
• If we can estimate and , then we can
determine, for a given match score x, the
probability that a random match with score x
or greater would have occurred in the
database.
• For sequence matches, a scoring system and
database can be parameterized by two
parameters, k and , related to and .
– It would be nice if we could compare hit
significance without regard to the scoring system
used!
59. Bit Scores
• The expected number of hits with score S
is:
E = Kmn e s
– Where m and n are the sequence lengths
• Normalize the raw score using:
S
ln K
ln 2
S
• Obtains a “bit score” S’, with a standard set of
units.
• The new E-value is:
E mn 2
S
61. • The distribution of scores graph of
frequency of observed scores
• expected curve (asterisks) according
to the extreme value distribution
–the theoretic curve should be
similar to the observed results
• deviations indicate that the fitting
parameters are wrong
–too weak gap penalties
–compositional biases
FastA Output
64. • A summary of the statistics and of the
program parameters follows the histogram.
– An important number in this summary is the
Kolmogorov-Smirnov statistic, which indicates
how well the actual data fit the theoretical
statistical distribution. The lower this value, the
better the fit, and the more reliable the statistical
estimates.
– In general, a Kolmogorov-Smirnov statistic under
0.1 indicates a good fit with the theoretical model.
If the statistic is higher than 0.2, the statistics may
not be valid, and it is recommended to repeat the
search, using more stringent (more negative)
values for the gap penalty parameters.
FastA Output
65. Statistics summary
• Optimal local alignment scores for pairs of random
amino acid sequences of the same length follow and
extreme-value distribution. For any score S, the
probability of observing a score >= S is given by the
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(-
lambda.S))
• k en Lambda are parameters related to the position
of the maximum and the with of the distribution,
• Note the long tail at the right. This means that a
score serveral standard deviations above the mean
has higher probability of arising by chance (that is, it
is less significant) than if the scores followed a
normal distribution.
66. P-values
• Many programs report P = the probability that the
alignment is no better than random. The relationship
between Z and P depends on the distribution of the
scores from the control population, which do NOT
follow the normal distributions
– P<=10E-100 (exact match)
– P in range 10E-100 10E-50 (sequences nearly identical eg.
Alleles or SNPs
– P in range 10E-50 10E-10 (closely related sequenes,
homology certain)
– P in range 10-5 10E-1 (usually distant relatives)
– P > 10-1 (match probably insignificant)
67. E
• For database searches, most programs report E-values. The
E-value of an alignemt is the expected number of sequences
that give the same Z-score or better if the database is probed
with a random sequence. E is found by multiplying the value
of P by the size of the database probed. Note that E but not P
depends on the size of the database. Values of P are
between 0 and 1. Values of E are between 0 and the number
of sequences in the database searched:
– E<=0.02 sequences probably homologous
– E between 0.02 and 1 homology cannot be ruled out
– E>1 you would have to expect this good a match by just chance
68. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
Blast
69. BLAST is actually a family of programs:
• BLASTN - Nucleotide query searching a
nucleotide database.
• BLASTP - Protein query searching a
protein database.
• BLASTX - Translated nucleotide query
sequence (6 frames) searching a protein
database.
• TBLASTN - Protein query searching a
translated nucleotide (6 frames) database.
• TBLASTX - Translated nucleotide query (6
frames) searching a translated nucleotide
(6 frames) database.
Blast
85. • Be aware of what options you
have selected when using
BLAST, or FASTA
implementations.
• Treat BLAST searches as
scientific experiments
• So you should try your searches
with the filters on and off to see
whether it makes any difference
to the output
Tips
86. Tips: Low-complexity and Gapped Blast Algorithm
• The common, Web-based ones often have
default settings that will affect the outcome
of your searches. By default all NCBI BLAST
implementations filter out biased sequence
composition from your query sequence (e.g.
signal peptide and transmembrane
sequences - beware!).
• The SEG program has been implemented
as part of the blast routine in order to mask
low-complexity regions
• Low-complexity regions are denoted by
strings of Xs in the query sequence
87. • The sequence databases contain a
wealth of information. They also
contain a lot of errors. Contaminants
…
• Annotation errors, frameshifts that
may result in erroneous conceptual
translations.
• Hypothetical proteins ?
• In the words of Fox Mulder, "Trust
no one."
Tips
88. • Once you get a match to things
in the databases, check whether
the match is to the entire protein,
or to a domain. Don't
immediately assume that a
match means that your protein
carries out the same function
(see above). Compare your
protein and the match protein(s)
along their entire lengths before
making this assumption.
Tips
89. • Domain matches can also cause problems
by hiding other informative matches. For
instance if your protein contains a common
domain you'll get significant matches to
every homologous sequence in the
database. BLAST only reports back a
limited number of matches, ordered by P
value.
• If this list consists only of matches to the
same domain, cut this bit out of your query
sequence and do the BLAST search again
with the edited sequence (e.g. NHR).
Tips
90. • Do controls wherever possible. In
particular when you use a particular
search software for the first time.
• Suitable positive controls would be protein
sequences known to have distant
homologues in the databases to check
how good the software is at detecting such
matches.
• Negative controls can be employed to
make sure the compositional bias of the
sequence isn't giving you false positives.
Shuffle your query sequence and see what
difference this makes to the matches that
are returned. A real match should be lost
upon shuffling of your sequence.
Tips
91. • Perform Controls
#!/usr/bin/perl -w
use strict;
my ($def, @seq) = <>;
print $def;
chomp @seq;
@seq = split(//, join("", @seq));
my $count = 0;
while (@seq) {
my $index = rand(@seq);
my $base = splice(@seq, $index, 1);
print $base;
print "n" if ++$count % 60 == 0;
}
print "n" unless $count %60 == 0;
Tips
92. • Read the footer first
• View results graphically
• Parse Blasts with Bioperl
Tips
93. • BLAST's major advantage is its speed.
– 2-3 minutes for BLAST versus several hours
for a sensitive FastA search of the whole of
GenBank.
• When both programs use their default
setting, BLAST is usually more sensitive
than FastA for detecting protein sequence
similarity.
– Since it doesn't require a perfect sequence
match in the first stage of the search.
FastA vs. Blast
94. Weakness of BLAST:
– The long word size it uses in the initial stage of DNA
sequence similarity searches was chosen for speed, and not
sensitivity.
– For a thorough DNA similarity search, FastA is the
program of choice, especially when run with a lowered
KTup value.
– FastA is also better suited to the specialised task of
detecting genomic DNA regions using a cDNA query
sequence, because it allows the use of a gap extension
penalty of 0. BLAST, which only creates ungapped
alignments, will usually detect only the longest exon, or fail
altogether.
• In general, a BLAST search using the default
parameters should be the first step in a database
similarity search strategy. In many cases, this is all
that may be required to yield all the information
needed, in a very short time.
FastA vs. Blast
95. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast Local Blast
BLAT
96. 1. Old (ungapped) BLAST
2. New BLAST (allows gaps)
3. Profile -> PSI Blast - Position Specific
Iterated
Strategy:Multiple alignment of the hits
Calculates a position-specific score matrix
Searches with this matrix
In many cases is much more sensitive to weak but
biologically relevant sequence similarities
PSSM !!!
PSI-Blast
97. • Patterns of conservation from the alignment of
related sequences can aid the recognition of
distant similarities.
– These patterns have been variously called motifs,
profiles, position-specific score matrices, and
Hidden Markov Models.
For each position in the derived pattern, every
amino acid is assigned a score.
(1) Highly conserved residue at a position: that
residue is assigned a high positive score, and
others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive
scores near zero.
(3) Position-specific scores can also be assigned to
potential insertions and deletions.
PSI-Blast
98. Pattern
• a set of alternative
sequences, using
“regular expressions”
• Prosite
(http://www.expasy.org/
prosite/)
102. • The power of profile methods can be
further enhanced through iteration of
the search procedure.
– After a profile is run against a database,
new similar sequences can be detected. A
new multiple alignment, which includes
these sequences, can be constructed, a
new profile abstracted, and a new
database search performed.
– The procedure can be iterated as often as
desired or until convergence, when no new
statistically significant sequences are
detected.
PSI-Blast
103. (1) PSI-BLAST takes as an input a single protein sequence
and compares it to a protein database, using the gapped
BLAST program.
(2) The program constructs a multiple alignment, and then a
profile, from any significant local alignments found.
The original query sequence serves as a template for the multiple
alignment and profile, whose lengths are identical to that of the
query. Different numbers of sequences can be aligned in different
template positions.
(3) The profile is compared to the protein database, again
seeking local alignments using the BLAST algorithm.
(4) PSI-BLAST estimates the statistical significance of the local
alignments found.
Because profile substitution scores are constructed to a fixed
scale, and gap scores remain independent of position, the
statistical theory and parameters for gapped BLAST alignments
remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), a
specified number of times or until convergence.
PSI-Blast
109. PSI-BLAST pitfalls
• Avoid too close sequences: overfit!
• Can include false homologous! Therefore check
the matches carefully: include or exclude
sequences based on biological knowledge.
• The E-value reflects the significance of the
match to the previous training set not to the
original sequence!
• Choose carefully your query sequence.
• Try reverse experiment to certify.
110. Reduce overfitting risk by Cobbler
• A single sequence is selected
from a set of blocks and enriched
by replacing the conserved
regions delineated by the blocks
by consensus residues derived
from the blocks.
• Embedding consensus residues
improves performance
• S. Henikoff and J.G. Henikoff;
Protein Science (1997) 6:698-
705.
111. DataBase Searching
Dynamic Programming
Reloaded
Database Searching
Fasta
Blast
Statistics
Practical Guide
Extentions
PSI-Blast
PHI-Blast
Local Blast
BLAT
122. BLAT method
• Align sequence with BLAT, get alignment
info
• Per BLAT hit, pick up additional info from
connected databases:
– mRNAs
– ESTs
– RepeatMasker
– CpG Islands
– RefSeq Genes
123.
124. Weblems
W5.1: Submit the amino acid sequence of papaya
papein to a BLAST (gapped and ungapped) and to a
PSI-BLAST search. What are the main difference in
results?
W5.2: Is there a relationship between Klebsiella
aerogenes urease, Pseudomonas diminuta
phosphotriesterase and mouse adenosine deaminase
? Also use DALI, ClustalW and T-coffee.
W5.3: Yeast two-hybrid typically yields DNA
sequences. How would you find the corresponding
protein ?
W5.4: When and why would you use tblastn ?
W5.5: How would you search a database if you want to
restrict the search space to those entries having a
secretion signal consisting of 4 consecutive (N-terminal)
basic residues ?