The document discusses algorithms for database searching and sequence alignment. It introduces BLAST and FASTA, two widely used algorithms for database searching. BLAST works by finding short words in sequences that score above a threshold and then extending any alignments found. FASTA uses a "hit and extend" heuristic to find locally similar regions. The document then discusses the statistical models that BLAST uses to calculate expected values and rank matching sequences by significance. It describes how BLAST models alignments as coin tosses to apply the Erdös-Rényi theorem and derive the Karlin-Altschul equation for calculating expected values.
Whole genome sequencing is a technique to sequence the entire genome of an organism. It involves breaking the genome into small fragments, copying the fragments, sequencing the fragments, and reassembling the sequence data into the full genome. Key steps include isolating DNA, fragmenting it, ligating fragments into plasmids, amplifying the plasmids, sequencing the fragments using Sanger sequencing, and assembling the sequence reads into the complete genome. Whole genome sequencing allows researchers to discover coding and non-coding regions, predict disease susceptibility, and perform evolutionary studies by comparing species.
The document discusses several sequence similarity tools: BLAST, FASTA, and CLUSTAL. It describes BLAST as an algorithm that finds locally and globally alignable sequences by calculating segment pairs between a query and database sequences above a scoring threshold. FASTA is described as a fast protein or nucleotide comparison tool that is more sensitive than BLAST but slower. CLUSTAL W is introduced as a tool that produces meaningful multiple sequence alignments of divergent sequences and can reveal evolutionary relationships.
Systems biology is the computational and mathematical modeling of complex biological systems. It is a biology-based interdisciplinary field of study that focuses on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.
1. Next-generation sequencing methods such as Roche 454, Illumina GAII, and ABI SOLiD allow for high throughput DNA sequencing through massive parallel sequencing.
2. These methods involve clonal amplification of DNA fragments on solid surfaces or in emulsion PCR followed by sequencing using pyrosequencing, sequencing by synthesis with reversible terminators, or sequencing by ligation approaches.
3. The resulting sequencing data requires high throughput management and analysis pipelines to process the large volumes of sequence data produced.
The document discusses transcriptomics and the relationship between transcriptome size and organism complexity. It questions how gene expression contributes to transcriptome size and what new studies reveal about size and complexity. Specifically, it notes that alternative splicing and RNA editing increase transcriptome size and complexity. It also discusses that the human genome is pervasively transcribed, with one stretch of DNA encoding many RNAs, including microRNAs, which control mRNA expression and are involved in development, gene regulation, and diseases like cancer.
Protein threading is a protein structure prediction method that involves "threading" or placing an amino acid sequence into known protein structure templates to find the best matching fold. The key steps are:
1) A query sequence is threaded into structural positions of templates from a structure library to find sequence-structure alignments
2) Alignments are scored and optimized using an objective function accounting for residue interactions and preferences
3) The highest scoring template is selected as the predicted structure, though loop regions are often not accurately predicted
Whole genome sequencing is a technique to sequence the entire genome of an organism. It involves breaking the genome into small fragments, copying the fragments, sequencing the fragments, and reassembling the sequence data into the full genome. Key steps include isolating DNA, fragmenting it, ligating fragments into plasmids, amplifying the plasmids, sequencing the fragments using Sanger sequencing, and assembling the sequence reads into the complete genome. Whole genome sequencing allows researchers to discover coding and non-coding regions, predict disease susceptibility, and perform evolutionary studies by comparing species.
The document discusses several sequence similarity tools: BLAST, FASTA, and CLUSTAL. It describes BLAST as an algorithm that finds locally and globally alignable sequences by calculating segment pairs between a query and database sequences above a scoring threshold. FASTA is described as a fast protein or nucleotide comparison tool that is more sensitive than BLAST but slower. CLUSTAL W is introduced as a tool that produces meaningful multiple sequence alignments of divergent sequences and can reveal evolutionary relationships.
Systems biology is the computational and mathematical modeling of complex biological systems. It is a biology-based interdisciplinary field of study that focuses on complex interactions within biological systems, using a holistic approach (holism instead of the more traditional reductionism) to biological research.
1. Next-generation sequencing methods such as Roche 454, Illumina GAII, and ABI SOLiD allow for high throughput DNA sequencing through massive parallel sequencing.
2. These methods involve clonal amplification of DNA fragments on solid surfaces or in emulsion PCR followed by sequencing using pyrosequencing, sequencing by synthesis with reversible terminators, or sequencing by ligation approaches.
3. The resulting sequencing data requires high throughput management and analysis pipelines to process the large volumes of sequence data produced.
The document discusses transcriptomics and the relationship between transcriptome size and organism complexity. It questions how gene expression contributes to transcriptome size and what new studies reveal about size and complexity. Specifically, it notes that alternative splicing and RNA editing increase transcriptome size and complexity. It also discusses that the human genome is pervasively transcribed, with one stretch of DNA encoding many RNAs, including microRNAs, which control mRNA expression and are involved in development, gene regulation, and diseases like cancer.
Protein threading is a protein structure prediction method that involves "threading" or placing an amino acid sequence into known protein structure templates to find the best matching fold. The key steps are:
1) A query sequence is threaded into structural positions of templates from a structure library to find sequence-structure alignments
2) Alignments are scored and optimized using an objective function accounting for residue interactions and preferences
3) The highest scoring template is selected as the predicted structure, though loop regions are often not accurately predicted
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
The document describes FASTA, a sequence similarity search tool that compares nucleotide or amino acid sequences. FASTA is faster than BLAST and was first described in 1985. It finds patches of similarity between a query sequence and database. There are different FASTA programs that compare query sequences to protein or DNA libraries or translate DNA sequences. FASTA is rapid, allows gaps, and is useful for tasks like species identification, phylogeny, DNA mapping, and understanding protein function.
The document discusses various methods for structurally aligning proteins, including combinatorial extension, VAST, DALI, SSAP, and TM-align. It also describes Ramachandran plots, which show allowed and favored phi/psi dihedral angle combinations for protein backbone chains based on steric constraints. Structural alignment methods are useful for detecting evolutionary relationships between proteins with low sequence similarity. Ramachandran plots help validate protein structures by identifying conformations not allowed by steric hindrance.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
Ab initio protein structure prediction uses computational methods to predict a protein's 3D structure from its amino acid sequence. It relies on conformational searching to generate structure decoys and selecting native-like models. The key factors for success are an accurate energy function, efficient search methods like molecular dynamics or genetic algorithms, and effective selection of models close to the native structure. Model selection approaches include energy evaluations, compatibility scores, clustering of similar decoys, and identifying the lowest energy conformations.
This document discusses isoschizomers, neoschizomers, and isocaudomers - types of restriction enzymes that recognize the same or similar DNA sequences but may cut the DNA differently. It provides examples like SphI and BbuI which are isoschizomers that recognize the same sequence (CGTAC/G) but were isolated from different bacteria. It also discusses the uses of restriction enzymes in recombinant DNA technology and their biological role in bacteria to modify or destroy incoming DNA.
The document discusses FASTA, a sequence alignment software tool. It describes the history and development of FASTA, which was originally designed for protein sequence similarity searching and later expanded to support DNA and translated DNA searches. FASTA uses local sequence alignment and heuristic methods to quickly search databases and find similar sequences. It supports various types of searches for protein, nucleotide, and translated sequences.
Secondary structure prediction tools analyze a protein's amino acid sequence to predict its 3D structure and function. These tools use various methods like Chou-Fasman, GOR, neural networks, and hidden Markov models to identify alpha helices and beta sheets based on characteristics like residue propensity values, sequence homology, and patterns in windows of amino acids. Accurate prediction of secondary structure is important for determining a protein's tertiary structure and biological role.
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
It contains information about- DNA Sequencing; History and Era sequencing; Next Generation Sequencing- Introduction, Workflow, Illumina/Solexa sequencing, Roche/454 sequencing, Ion Torrent sequencing, ABI-SOLiD sequencing; Comparison between NGS & Sangers and NGS Platforms; Advantages and Applications of NGS; Future Applications of NGS.
Microsatellites are short tandem repeats of DNA motifs between 1-9 base pairs found throughout genomes. They have high mutation rates and genetic diversity. Microsatellites are used for DNA fingerprinting, identifying individuals in forensics and paternity testing, and studying population genetics and genetic diseases like triplet expansion disorders.
Next generation sequencing (NGS) refers to modern DNA sequencing technologies that allow for high-speed, low-cost sequencing of entire genomes. NGS works by massively parallel sequencing of millions of DNA fragments. The Illumina sequencing by synthesis method is the most commonly used NGS approach. It involves library preparation, cluster generation on a flow cell, sequencing via reversible dye-terminator chemistry, and computational analysis of sequenced reads. Key advantages of NGS include its scalability, unlimited dynamic range, tunable coverage levels, and ability to multiplex many samples simultaneously in a single run.
This document discusses global and local sequence alignment. Global alignment aims to align the entire sequences, treating gaps equally across the sequences. It is useful for closely related sequences of similar length. Local alignment finds locally similar regions, allowing gaps to be treated differently. It is useful for more distantly related sequences that may contain similar subsequences. Both use dynamic programming, with global alignment using Needleman-Wunsch and local using Smith-Waterman. Dynamic programming breaks the problem into subproblems by filling a matrix to find the highest scoring alignment.
This document discusses Hardy-Weinberg equilibrium, which describes the expected genotype and allele frequencies in a population that is not evolving. It will be in equilibrium if 5 assumptions are met: large population size, no migration, negligible mutations, random mating, no natural selection. The model consists of two equations to calculate expected allele and genotype frequencies. Observed frequencies in a sample California population at the EST locus match the expected frequencies, indicating the population is in equilibrium at this locus and not evolving. However, the assumptions are often violated in real populations.
This document provides an overview of the FASTA software. FASTA is a program used by biologists to study and analyze DNA and protein sequences. It uses a simple text-based format to present sequences and allows for the naming of sequences and inclusion of comments. FASTA is a rapid program that can be used locally or through email servers to find regional similarities between sequences and identify potential matches while ignoring complete sensitivity. It has become a standard tool in biology for sequencing and analyzing proteins and DNA.
This document discusses protein threading modeling methods. Protein threading, also called fold recognition, is used to model proteins that have the same fold as proteins with known structures but no homologous sequences. It differs from homology modeling which is used for proteins that have homologous sequences. Protein threading works by using statistical knowledge of relationships between structures in the Protein Data Bank and the sequence of the protein being modeled. It is based on observations that there are a limited number of folds in nature and most new structures have similar folds to ones already in the PDB. The document then describes the general steps of the protein threading method.
The document discusses similarity searches of sequence databases using BLAST and FASTA. It describes the importance of identifying similar sequences to infer shared biological function. BLAST and FASTA use heuristic algorithms to rapidly identify local similarities between query and database sequences. The statistical significance of alignments is assessed using an extreme value distribution to calculate p-values and e-values, which estimate the probability of observing a given alignment score by chance. This allows filtering of random matches and identification of biologically meaningful similarities.
Protein-protein interactions are important for many biological processes. There are various types of interactions depending on their composition and duration. Methods to study interactions include yeast two-hybrid, co-immunoprecipitation, affinity chromatography, and chromatin immunoprecipitation. Databases such as IntAct and MINT provide repositories for protein interaction data.
1) Flat files are a simple way to store sequence data as plain text files on a hard drive, with each file containing sequences in a format like Genbank or EMBL.
2) Relational databases offer a more structured way to store and query biological sequence data by organizing it across multiple tables linked by fields.
3) Popular biological sequence databases like GenBank store DNA, protein, and structure data in flat files in formats like GBFF, but also provide relational database access and web query tools for programmatic searching.
The document discusses various topics in bioinformatics including:
1) Control structures, lists, dictionaries, and regular expressions in Python.
2) Parsing Swiss-Prot files and extracting amino acid frequencies using Biopython.
3) Functions for working with biological sequences like transcription, translation, and translating between different genetic codes using the Biopython module.
Scoring schemes in bioinformatics (blosum)SumatiHajela
This document discusses scoring schemes in bioinformatics, specifically BLOSUM (BLOcks SUbstitution Matrix). It introduces BLOSUM, describing that it is based on conserved amino acid patterns from multiple sequence alignments. It then explains the BLOSUM-62 matrix and the BLOSUM scoring algorithm. The document contrasts BLOSUM with PAM matrices, noting key differences like BLOSUM being based on direct observations while PAM uses evolutionary modeling. Finally, it outlines the significance of scoring matrices for detecting distant evolutionary relationships between protein sequences.
The document describes FASTA, a sequence similarity search tool that compares nucleotide or amino acid sequences. FASTA is faster than BLAST and was first described in 1985. It finds patches of similarity between a query sequence and database. There are different FASTA programs that compare query sequences to protein or DNA libraries or translate DNA sequences. FASTA is rapid, allows gaps, and is useful for tasks like species identification, phylogeny, DNA mapping, and understanding protein function.
The document discusses various methods for structurally aligning proteins, including combinatorial extension, VAST, DALI, SSAP, and TM-align. It also describes Ramachandran plots, which show allowed and favored phi/psi dihedral angle combinations for protein backbone chains based on steric constraints. Structural alignment methods are useful for detecting evolutionary relationships between proteins with low sequence similarity. Ramachandran plots help validate protein structures by identifying conformations not allowed by steric hindrance.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
Ab initio protein structure prediction uses computational methods to predict a protein's 3D structure from its amino acid sequence. It relies on conformational searching to generate structure decoys and selecting native-like models. The key factors for success are an accurate energy function, efficient search methods like molecular dynamics or genetic algorithms, and effective selection of models close to the native structure. Model selection approaches include energy evaluations, compatibility scores, clustering of similar decoys, and identifying the lowest energy conformations.
This document discusses isoschizomers, neoschizomers, and isocaudomers - types of restriction enzymes that recognize the same or similar DNA sequences but may cut the DNA differently. It provides examples like SphI and BbuI which are isoschizomers that recognize the same sequence (CGTAC/G) but were isolated from different bacteria. It also discusses the uses of restriction enzymes in recombinant DNA technology and their biological role in bacteria to modify or destroy incoming DNA.
The document discusses FASTA, a sequence alignment software tool. It describes the history and development of FASTA, which was originally designed for protein sequence similarity searching and later expanded to support DNA and translated DNA searches. FASTA uses local sequence alignment and heuristic methods to quickly search databases and find similar sequences. It supports various types of searches for protein, nucleotide, and translated sequences.
Secondary structure prediction tools analyze a protein's amino acid sequence to predict its 3D structure and function. These tools use various methods like Chou-Fasman, GOR, neural networks, and hidden Markov models to identify alpha helices and beta sheets based on characteristics like residue propensity values, sequence homology, and patterns in windows of amino acids. Accurate prediction of secondary structure is important for determining a protein's tertiary structure and biological role.
Next Generation Sequencing (NGS) Is A Modern And Cost Effective Sequencing Technology Which Enables Scientists To Sequence Nucleic Acids At Much Faster Rate. In This Presentation, You Will Learn About What is NGS, Idea Behind NGS, Methodology And Protocol, Widely Adapted NGS Protocols, Applications And References For Further Study.
It contains information about- DNA Sequencing; History and Era sequencing; Next Generation Sequencing- Introduction, Workflow, Illumina/Solexa sequencing, Roche/454 sequencing, Ion Torrent sequencing, ABI-SOLiD sequencing; Comparison between NGS & Sangers and NGS Platforms; Advantages and Applications of NGS; Future Applications of NGS.
Microsatellites are short tandem repeats of DNA motifs between 1-9 base pairs found throughout genomes. They have high mutation rates and genetic diversity. Microsatellites are used for DNA fingerprinting, identifying individuals in forensics and paternity testing, and studying population genetics and genetic diseases like triplet expansion disorders.
Next generation sequencing (NGS) refers to modern DNA sequencing technologies that allow for high-speed, low-cost sequencing of entire genomes. NGS works by massively parallel sequencing of millions of DNA fragments. The Illumina sequencing by synthesis method is the most commonly used NGS approach. It involves library preparation, cluster generation on a flow cell, sequencing via reversible dye-terminator chemistry, and computational analysis of sequenced reads. Key advantages of NGS include its scalability, unlimited dynamic range, tunable coverage levels, and ability to multiplex many samples simultaneously in a single run.
This document discusses global and local sequence alignment. Global alignment aims to align the entire sequences, treating gaps equally across the sequences. It is useful for closely related sequences of similar length. Local alignment finds locally similar regions, allowing gaps to be treated differently. It is useful for more distantly related sequences that may contain similar subsequences. Both use dynamic programming, with global alignment using Needleman-Wunsch and local using Smith-Waterman. Dynamic programming breaks the problem into subproblems by filling a matrix to find the highest scoring alignment.
This document discusses Hardy-Weinberg equilibrium, which describes the expected genotype and allele frequencies in a population that is not evolving. It will be in equilibrium if 5 assumptions are met: large population size, no migration, negligible mutations, random mating, no natural selection. The model consists of two equations to calculate expected allele and genotype frequencies. Observed frequencies in a sample California population at the EST locus match the expected frequencies, indicating the population is in equilibrium at this locus and not evolving. However, the assumptions are often violated in real populations.
This document provides an overview of the FASTA software. FASTA is a program used by biologists to study and analyze DNA and protein sequences. It uses a simple text-based format to present sequences and allows for the naming of sequences and inclusion of comments. FASTA is a rapid program that can be used locally or through email servers to find regional similarities between sequences and identify potential matches while ignoring complete sensitivity. It has become a standard tool in biology for sequencing and analyzing proteins and DNA.
This document discusses protein threading modeling methods. Protein threading, also called fold recognition, is used to model proteins that have the same fold as proteins with known structures but no homologous sequences. It differs from homology modeling which is used for proteins that have homologous sequences. Protein threading works by using statistical knowledge of relationships between structures in the Protein Data Bank and the sequence of the protein being modeled. It is based on observations that there are a limited number of folds in nature and most new structures have similar folds to ones already in the PDB. The document then describes the general steps of the protein threading method.
The document discusses similarity searches of sequence databases using BLAST and FASTA. It describes the importance of identifying similar sequences to infer shared biological function. BLAST and FASTA use heuristic algorithms to rapidly identify local similarities between query and database sequences. The statistical significance of alignments is assessed using an extreme value distribution to calculate p-values and e-values, which estimate the probability of observing a given alignment score by chance. This allows filtering of random matches and identification of biologically meaningful similarities.
Protein-protein interactions are important for many biological processes. There are various types of interactions depending on their composition and duration. Methods to study interactions include yeast two-hybrid, co-immunoprecipitation, affinity chromatography, and chromatin immunoprecipitation. Databases such as IntAct and MINT provide repositories for protein interaction data.
1) Flat files are a simple way to store sequence data as plain text files on a hard drive, with each file containing sequences in a format like Genbank or EMBL.
2) Relational databases offer a more structured way to store and query biological sequence data by organizing it across multiple tables linked by fields.
3) Popular biological sequence databases like GenBank store DNA, protein, and structure data in flat files in formats like GBFF, but also provide relational database access and web query tools for programmatic searching.
The document discusses various topics in bioinformatics including:
1) Control structures, lists, dictionaries, and regular expressions in Python.
2) Parsing Swiss-Prot files and extracting amino acid frequencies using Biopython.
3) Functions for working with biological sequences like transcription, translation, and translating between different genetic codes using the Biopython module.
The document discusses reading and writing files in Python. It provides examples of opening files for reading, writing, and appending. It demonstrates how to read an entire file, individual lines, and loop through lines. It also shows how to write strings to files and close files once writing is complete. Additional topics covered include a template for reading files line by line and examples of counting lines, words, and characters in a file.
This document provides an overview of phylogenetic methodologies. It defines key phylogenetic terms like clade, internal node, and outgroups. It discusses different species concepts and how phylogenetic trees illustrate evolutionary relationships. It also covers popular phylogenetic methodologies like distance methods, maximum parsimony, and maximum likelihood. Distance methods calculate pairwise distances and cluster sequences into trees. UPGMA averages these distances while neighbor joining finds the shortest branches. The document highlights the use of phylogenetic analysis across various fields.
This document provides an overview of GitHub as a hosted Git service and introduces some basic Python concepts including control structures, lists, dictionaries, regular expressions, and BioPython. It demonstrates how to install Biopython and parse sequence data from Swiss-Prot using Biopython modules. It also includes example questions for analyzing sequence data from Swiss-Prot.
The document discusses various topics related to drug discovery through bioinformatics and computational approaches. It begins by discussing comparative genomics and using knowledge about model organisms to identify similar biological areas and pathways in other species. It also discusses topics like high-throughput screening of large libraries, the definitions of targets, hits and leads in drug discovery, and approaches like using RNAi and phenotypic screening in model organisms. Finally, it discusses computational methods that can be used throughout the drug discovery process, including for target identification and validation, virtual screening, assessing drug-likeness of compounds, and describing compounds using structural and physicochemical descriptors.
This document discusses various topics relating to protein structure and bioinformatics. It begins with an overview of protein structure and why understanding protein structure is important. It then discusses the different levels of protein structure from primary to quaternary structure. Methods for determining protein structure like X-ray crystallography and NMR are mentioned. Databases for storing protein structures like the Protein Data Bank are also summarized. The document touches on topics like protein folding, domains, membrane protein topology, and secondary structure prediction methods.
This document provides an overview of biological databases and SQL. It discusses different types of data in biological research, including primary data and derived data. It lists several major biological databases and whether they support direct SQL querying. It also shows an example 3-tier model for biological databases. The rationale for learning SQL to query biological databases is described. The document then provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, relationships, and normalization. It also covers creating tables, integrity constraints, authorization, and privileges in SQL.
This document discusses biological databases and bioinformatics. It begins by listing various related fields including biology, computer science, bioinformatics, statistics, and machine learning. It then describes different types of searches that can be performed in biological databases, including annotation searches, homology searches, pattern searches, and predictions. Finally, it mentions that databases can be used for comparisons, such as gene families and phylogenetic trees.
This document outlines an introduction to relational database management systems (RDBMS). It discusses installing MySQL, connecting to databases using the MySQL monitor, and provides an overview of basic SQL commands. The topics covered include creating databases and users, granting privileges, and executing SQL statements to define tables and populate them with data.
The document discusses database searching algorithms like FASTA and BLAST. It explains the mathematical concepts behind BLAST like using Erdos-Renyi theory to model random sequence alignments and calculate the expected length of the longest random match. It also describes the Karlin-Alschul equation used in BLAST to calculate the statistical significance of matches as the expected number of alignments (E) based on the size of the search space and alignment score. The document provides details on parameters and scoring approaches used in database searching algorithms.
The document provides information about various bioinformatics lessons that will take place on Thursdays, including topics like biological databases, sequence alignments, database searching using FASTA and BLAST, phylogenetics, and protein structure. It also includes details about database searching methods like dynamic programming, FASTA, BLAST, and parameters that can be adjusted for BLAST searches.
The document discusses database searching algorithms like FASTA and BLAST. It explains that FASTA uses heuristics to search for exact word matches and join high-scoring regions, while BLAST uses heuristics to compile a neighborhood of high-scoring words and then search for these words in the database to find local alignments faster than dynamic programming. It also discusses parameters that influence the speed and sensitivity of the searches.
The document discusses sequence alignment and database searching techniques. It provides code examples and explanations for the Needleman-Wunsch algorithm, dynamic programming, FASTA, and BLAST. For BLAST, it explains the key steps of breaking sequences into words, searching the database for word matches, and extending matches to find high-scoring pairs and hits between query and database sequences. Parameters like word threshold, word size, extension cutoff, and E-value are also discussed for customizing BLAST searches.
This document summarizes key concepts in sequence alignment including:
1) Sequence alignment involves finding the linear correspondence between symbols in one sequence to another that maximizes similarity. Dynamic programming is commonly used to compute optimal alignments.
2) BLAST is an extremely fast database search tool that uses heuristics like word matching to find local alignments and statistical analysis to assess significance.
3) Multiple sequence alignments make conserved features more apparent but are more difficult to compute than pairwise alignments. Progressive alignment gradually merges pairwise alignments based on a phylogenetic tree.
Presentation for blast algorithm bio-informaticezahid6
Presentation for BLAST algorithm
Publisher Md.Zahid Hasan
Bio-informatics blast is the use of computational tools for the process of acquisition, visualization, analysis and distribution of these datasets obtained by imaging modalities.
This document provides an overview of the BLAST algorithm used for comparing biological sequences and identifying sequence similarities. It describes how BLAST works by generating words from a query sequence and searching a database for exact matches. Significant matches are extended locally to identify Maximal Segment Pairs (MSPs) based on scoring. MSPs are evaluated and ranked using E-values, which estimate the statistical significance of matches. The document also discusses different BLAST programs and provides examples of running BLAST searches and interpreting results. Homework assignments are included applying BLAST to specific sequence analysis tasks.
The document discusses various sequence comparison techniques including pairwise alignment, local alignment, global alignment, and multiple alignment. It describes heuristic methods like FASTA and BLAST that are faster than dynamic programming but may miss optimal alignments. It provides details on the FASTA algorithm including finding identical words, re-scoring, joining segments, and dynamic programming. It also explains the BLAST algorithm and steps including query preprocessing, database scanning, and hit extension. Specialized BLAST databases and tools are also listed.
The document describes the BLAST algorithm for comparing biological sequences. BLAST stands for Basic Local Alignment Search Tool. It allows for fast comparison of a query sequence against large databases. BLAST uses heuristics to find locally similar regions between sequences and scores alignments based on identities without considering gaps. This rapid approximation allows BLAST to be applied to search large databases on common computers, providing a significant improvement over previous algorithms. The document outlines the methods used in BLAST, including compiling high-scoring words from the query, scanning the database for hits, and extending hits to determine significant alignments. It also discusses evaluating the statistical significance of results and how parameters like word length and score thresholds can impact BLAST's speed and accuracy.
This document discusses the BLAST algorithm for comparing biological sequences. It explains that BLAST allows rapid sequence comparison of a query sequence against a database. BLAST is fast, accurate, and accessible online. The document then describes the four main components of a BLAST search: choosing the query sequence, BLAST program, database, and optional parameters. It provides details on how to interpret BLAST search results, including the expect value, and how BLAST works by compiling word pairs from the query and database in three phases of searching and alignment.
This document discusses biological database searching. It begins by defining biological databases as organized collections of persistent biological data that can be queried and retrieved. Examples are provided such as GenBank and SwissProt. Database searching involves using a query sequence to find related sequences in the database based on homology. Parameters like E-values and bit scores are used to define the significance of matches. Popular search algorithms like BLAST and FASTA first use heuristics to find high-scoring regions before applying dynamic programming for alignment.
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Komei Sugiura
The document proposes a novel spatio-temporal pseudo relevance feedback approach for scientific data retrieval. It uses Space-Time-Text (STT) information from the top retrieved results to expand the initial query. An evaluation on a test collection showed the STT approach improved recall, average precision, and number of hits compared to baselines, demonstrating its effectiveness especially for datasets with limited textual metadata. The method could potentially be applied to retrieve heterogeneous scientific and sensor data from large repositories.
B.sc biochem i bobi u 3.1 sequence alignmentRai University
This document provides an outline of basic concepts in bioinformatics including sequence alignment, scoring alignments, inserting gaps, dynamic programming, and database searches. It discusses comparing biological sequences to determine similarity and homology for predicting gene/protein function and constructing phylogenies. Scoring matrices like BLOSUM and PAM are described for quantifying sequence similarity. Dynamic programming algorithms like Needleman-Wunsch and Smith-Waterman are summarized for global and local sequence alignment. Database search tools like FASTA and BLAST are introduced for searching sequence databases.
B.sc biochem i bobi u 3.1 sequence alignmentRai University
This document provides an outline of basic concepts in bioinformatics including sequence alignment, scoring alignments, inserting gaps, dynamic programming, and database searches. It discusses comparing biological sequences to determine similarity and homology for predicting gene and protein function, constructing phylogeny, and finding motifs. It describes scoring matrices, gap penalties, global and local alignment, and algorithms for database searches including FASTA and BLAST.
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTetsuya Sakai
This document discusses topic set size design for evaluating short text conversation systems. It proposes using statistical techniques to determine the optimal number of test topics needed to reliably compare multiple systems. Based on sample pilot results, the study recommends a test set size of 100 topics which would allow distinguishing differences between systems while controlling for type I and type II errors. Obtaining evaluation metrics from the official task results could provide more accurate estimates for designing future test collections.
This document discusses FASTA and BLAST algorithms for database searching to find similar sequences to a query. It explains that FASTA uses a "hit and extend" method to search for short identical matches, while BLAST searches for words above a threshold score rather than exact matches. BLAST is generally faster than FASTA and Smith-Waterman as it uses heuristics. The document provides details on how BLAST works including compiling a word list, searching the database for hits, and extending hits into alignments.
The document discusses different types of adversarial search algorithms. It describes min-max algorithm and alpha-beta pruning. Min-max algorithm searches through the game tree recursively to find the optimal move assuming the opponent plays optimally. Alpha-beta pruning improves on min-max by pruning parts of the tree that cannot contain better moves based on the alpha and beta values being passed down the tree.
This document discusses sequence database searching algorithms. It describes the key criteria of sensitivity, selectivity, and speed. Heuristic algorithms like BLAST and FASTA use word searches to rapidly identify similar sequences. BLAST finds ungapped segments while FASTA joins diagonals to produce gapped alignments. Both use substitution matrices and E-values to statistically evaluate results. The document also discusses different BLAST and FASTA variants and output formats.
EUGM 2013 - Dragos Horváth (Labooratoire de Chemoinformatique Univ Strasbourg...ChemAxon
1) The document discusses methods for setting up similarity-driven virtual screening using various molecular similarity metrics and descriptor spaces.
2) It finds that traditional dogmas like only using Tanimoto similarity above 0.85 can be inaccurate, and recommends calibrating similarity cutoffs specifically for each target, query, and chemical space.
3) Tversky similarity with an alpha value of 0.7-0.9, which more heavily penalizes the query missing features of actives, is found to often give excellent results. The best approach is to test multiple options and calibrate for each individual virtual screening project.
Computational Biology, Part 4 Protein Coding Regionsbutest
The document discusses different machine learning approaches for supervised classification and sequence analysis. It describes several classification algorithms like k-nearest neighbors, decision trees, linear discriminants, and support vector machines. It also discusses evaluating classifiers using cross-validation and confusion matrices. For sequence analysis, it covers using position-specific scoring matrices, hidden Markov models, cobbling, and family pairwise search to identify new members of protein families. It compares the performance of these different machine learning methods on sequence analysis tasks.
Similar to Bioinformatics t5-database searching-v2013_wim_vancriekinge (20)
This document provides an overview of bioinformatics and biological databases. It discusses how bioinformatics draws from fields like biology, computer science, statistics, and machine learning. Biological databases are important resources for bioinformatics that can be searched and analyzed to answer questions, find similar sequences, locate patterns, and make predictions. The document also outlines common uses of biological databases, such as annotation searches, homology searches, pattern searches, and predictive analyses.
The document discusses the Rh blood group system and its clinical significance. It describes the key observations in 1939 that linked adverse reactions in mothers to stillborn fetuses and blood transfusions from fathers, indicating a relationship. This syndrome is now called hemolytic disease of the fetus and newborn. The Rh system was identified in 1940 through experiments immunizing animals with Rhesus macaque monkey red blood cells. The D antigen is the most important RBC antigen in transfusion practice, as those lacking it do not produce anti-D antibody unless exposed to D antigen through transfusion or pregnancy. Testing for D is routinely performed to ensure D-negative patients receive D-negative blood.
The document discusses views and materialized views in data warehousing and decision support systems. It covers three main points:
1) OLAP queries typically involve aggregate queries, so precomputation is essential for fast response times. Materialized views allow precomputing aggregates across multiple dimensions.
2) Warehouses can be thought of as collections of asynchronously replicated tables and periodically maintained views, renewing interest in efficient view maintenance.
3) Materialized views store the results of views in the database for fast access like a cache, but they require maintenance as underlying tables change. Incremental maintenance algorithms are ideal to efficiently update materialized views.
The document discusses various database concepts including normalization, which is used to design optimal relation schemas by removing redundant data. It also covers transaction processing, which involves executing logical database operations as transactions to maintain data integrity. Database systems use techniques like logging and concurrency control to prevent transaction anomalies and ensure failures can be recovered from.
This document contains a list of names, emails, and study programs of students. It includes their official student code, last name, first name, email, and educational program. There are 20 students listed with their details.
This document discusses the Biological Databases project being conducted by a group of students. The project involves using the video game Minecraft to visualize protein structures retrieved from the Protein Data Bank (PDB). Python scripts are used to import PDB data files and place blocks in Minecraft to represent atoms, with different block colors used to distinguish atom types. SPARQL queries are also employed to search the RDF version of the PDB for protein entries. The goal is to build 3D protein models inside Minecraft for educational and visualization purposes.
The document discusses various bioinformatics tools and algorithms for analyzing protein sequences, including Biopython for working with biological sequence data, the Kyte-Doolittle algorithm for predicting transmembrane regions, and the Chou-Fasman algorithm for predicting secondary structure from amino acid preferences for alpha helices, beta sheets, and random coils. It also provides examples of analyzing Swiss-Prot data to find properties of human proteins and applying these tools and libraries to extract insights from protein sequences.
The document discusses various topics related to analyzing protein sequences using Python and Biopython. It provides examples of using Biopython to parse sequence data from UniProt, calculate lengths and translations of sequences. It also discusses analyzing properties of sequences like molecular weight, isoelectric point, transmembrane regions, and comparing sequences to find conserved motifs. Finally, it introduces hydropathy indices and tools for predicting properties like transmembrane helices from primary sequences.
This document discusses Python functions. It explains that there are built-in functions provided as part of Python and user-defined functions. User-defined functions are created using the def keyword and can take parameters and return values. The body of a function is indented and runs when the function is called. Functions allow code to be reused and organized in a modular way. Examples are provided to demonstrate defining and calling functions with different parameters and return values.
The document provides a recap of Python programming concepts like conditions and statements, while loops, for loops, break and continue statements, and working with strings. It also introduces regular expressions as a way to match patterns in strings using a formal language that can be interpreted by a regular expression processor.
[SUMMARY
This document discusses next generation DNA sequencing technologies. It begins by describing some of the limitations of traditional Sanger sequencing, such as read lengths of 500-1000 bases and throughput of 57,000 bases per run. It then introduces some key next generation sequencing technologies, such as 454 sequencing which uses emulsion PCR and pyrosequencing to achieve read lengths of 20-100 bases but higher throughput of 20-100 Mb per run. Illumina/Solexa sequencing is also discussed, which uses sequencing by synthesis with reversible terminators and laser-based detection. Finally, third generation sequencing technologies are mentioned, such as Pacific Biosciences' single molecule real time sequencing and nanopore sequencing. In summary, the document provides a high-level
The document provides an overview of the history and evolution of various programming languages. It discusses early languages like FORTRAN, LISP, PASCAL, C, and Java. It also covers scripting languages and their uses. The document explains what Python is as a programming language - that it is interpreted, object-oriented, and high-level. It was named after Monty Python and was created by Guido van Rossum. The document then gives examples of using Python to program Minecraft by importing protein data from PDB files and using coordinates to place blocks to visualize proteins in the game.
This document provides an introduction to bio-ontologies and the semantic web. It discusses what ontologies are and how they are used in the bio domain through initiatives like the OBO Foundry. It introduces key semantic web technologies like RDF, URIs, Turtle syntax, and SPARQL query language. It provides examples of ontologies like the Gene Ontology and how ontologies can be represented and queried using these semantic web standards.
This document provides an overview of NoSQL databases, including:
- Key-value stores store data as maps or hashmaps and are efficient for data access but limited in query capabilities.
- Column-oriented stores group attributes into column families and store data efficiently but are operationally challenging.
- Document databases store loosely structured data like JSON and allow retrieving documents by keys or contents.
- Graph databases are suited for interaction networks and path finding but are less suited for tabular data.
The document discusses creating a multicore database project. It recommends taking the following steps:
1. Define what the project is about, what it aims to achieve, and who it is for.
2. Identify information resources and develop a basic data model.
3. Design a user interface mockup without technical constraints, thinking creatively.
This document discusses biological databases and PHP. It begins with an overview of biological databases and examples using BIOSQL to load genetic data from GenBank into a MySQL database. It then provides examples of building a basic 3-tier model with Apache, PHP, and a MySQL backend database. The document also includes a brief introduction to PHP, covering its history, why it is commonly used, and basic syntax like conditional statements.
This document discusses biological databases and SQL. It provides an overview of primary and derived data in biological research, as well as different data levels. It then discusses direct querying of selected bioinformatics databases using SQL and provides examples of 3-tier database models. The document proceeds to discuss rationale for learning SQL to query biological databases and provides definitions and explanations of key SQL concepts like tables, records, queries, data types, keys, integrity rules and constraints.
This document discusses biological databases and bioinformatics. It begins with an overview of bioinformatics as an interdisciplinary field combining biology, computer science, and information technology. It then discusses different types of biological databases, including those focused on sequences, pathways, protein structures, and gene expression. The document outlines some common uses of biological databases, including searching for annotations, identifying similar sequences through homology, searching for patterns, and making predictions. It also briefly discusses comparing data across databases. The summary provides a high-level overview of the key topics and uses of biological databases covered in the document.
The document discusses several topics related to protein structure prediction using Python:
1. It introduces the Chou-Fasman algorithm for predicting protein secondary structure from amino acid sequence. The algorithm calculates preference parameters for each amino acid to be in alpha helices, beta sheets, or other structures.
2. It provides an example of calculating helical propensity.
3. It lists the preference parameters output by the Chou-Fasman algorithm for each amino acid.
4. It outlines the steps of applying the Chou-Fasman algorithm to predict secondary structure elements in a protein sequence.
The document provides information on various Python programming concepts including control structures, lists, dictionaries, regular expressions, exceptions, and biological applications using Biopython. It discusses if/else statements, while and for loops, list operations, dictionary usage, regex patterns, exception handling roles, and gives examples analyzing protein sequences and structures using Biopython.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
Temple of Asclepius in Thrace. Excavation resultsKrassimira Luka
The temple and the sanctuary around were dedicated to Asklepios Zmidrenus. This name has been known since 1875 when an inscription dedicated to him was discovered in Rome. The inscription is dated in 227 AD and was left by soldiers originating from the city of Philippopolis (modern Plovdiv).
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.pptHenry Hollis
The History of NZ 1870-1900.
Making of a Nation.
From the NZ Wars to Liberals,
Richard Seddon, George Grey,
Social Laboratory, New Zealand,
Confiscations, Kotahitanga, Kingitanga, Parliament, Suffrage, Repudiation, Economic Change, Agriculture, Gold Mining, Timber, Flax, Sheep, Dairying,
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
6. Needleman-Wunsch-edu.pl
The Score Matrix
---------------Seq1(j)1
2
3
4 5
6
7
8
9
Seq2
*
C
K
H
V
F
C
R
(i) *
0
-1
-2
-3
-4
-5
-6
-7
1
C
-1
1 a 0
-1
-2
-3
-4
-5
2
K
-2
0c
2b
1
0
-1
-2
-3
3
K
-3
-1
1
1
0
-1
-2
-3
A:
4
C
-4
-2 matrix(i,j) = matrix(i-1,j-1) + (MIS)MATCH
0
0
0
-1
0
-1
if
5
F
-5
-3
-1(substr(seq1,j-1,1) eq substr(seq2,i-1,1)
-1
-1
1
0
-1
6
C
-6
-4 up_score = matrix(i-1,j) + GAP 2
-2
-2
-2
0
1
B:
7
K
-7
-5
-3
-3
-3
-1
1
1
8
C
-8
-6 left_score =-4
-4
-4
0
C:
matrix(i,j-1) +-2
GAP 0
9
V
-9
-7
-5
-5
-3
-3
-1
-1
7. Multiple Alignment Method
• The most practical and widely used
method in multiple sequence alignment
is the hierarchical extensions of
pairwise alignment methods.
• The principal is that multiple alignments
is achieved by successive application
of pairwise methods.
– First do all pairwise alignments (not just one
sequence with all others)
– Then combine pairwise alignments to generate
overall alignment
8. Database Searching
• Consider the task of searching
SWISS-PROT against a query
sequence:
– say our query sequence is 362
amino- acids long
– SWISS-PROT release 38
contains 29,085,265 amino acids
– finding local alignments via
dynamic programming would
entail O(1010)matrix operations
• Given size of databases, more
efficient methods needed
9. Heuristic approaches to DP for database searching
FASTA (Pearson 1995)
BLAST (Altschul 1990, 1997)
Uses heuristics to avoid
calculating the full dynamic
programming matrix
Uses rapid word lookup
methods to completely skip
most of the database
entries
Speed up searches by an
order of magnitude
compared to full SmithWaterman
The statistical side of FASTA is
still stronger than BLAST
Extremely fast
One order of magnitude
faster than FASTA
Two orders of magnitude
faster than SmithWaterman
Almost as sensitive as FASTA
10. FASTA
« Hit and extend heuristic»
• Problem: Too many calculations
“wasted” by comparing regions
that have nothing in common
• Initial insight: Regions that are
similar between two sequences
are likely to share short
stretches that are identical
• Basic method: Look for similar
regions only near short
stretches that match exactly
11. FASTA-Stages
1.
2.
3.
4.
5.
Find k-tups in the two sequences (k=1,2 for
proteins, 4-6 for DNA sequences)
Score and select top 10 scoring “local diagonals”
Rescan top 10 regions, score with PAM250
(proteins) or DNA scoring matrix. Trim off the
ends of the regions to achieve highest scores.
Try to join regions with gapped alignments. Join
if similarity score is one standard deviation above
average expected score
After finding the best initial region, FASTA
performs a global alignment of a 32 residue wide
region centered on the best initial region, and
uses the score as the optimized score.
12.
13.
14. FastA
• Sensitivity: the ability of a
program to identify weak but
biologically significant sequence
similarity.
• Selectivity: the ability of a
program to discriminate between
true matches and matches
occurring by chance alone.
– A decrease in selectivity results in
more false positives being reported.
15. FastA (http://www.ebi.ac.uk/fasta33/)
Gap opening penalty
-12, -16 by default
for fasta with
proteins and DNA,
respectively
Gap extension
penalty -2, -4 by
default for fasta
with proteins and
DNA, respectively
Max number of
scores and
alignments is 100
Blosum50
default.
Lower PAM
higher blosum
to detect close
sequences
Higher PAM and
lower blosum
to detect distant
sequences
The larger the
word-length the
less sensitive, but
faster the search
will be
16. FastA Output
Initn, init1, opt, zscore calculated
during run
E score expectation
value, how
many hits are
expected to be
found by
chance with
such a score
while
comparing
this query to
this database.
Database
code
hyperlinked
to the SRS
database at
EBI
Accession
number
Description
Length
E() does not
represent the
% similarity
17. FastA is a family of programs
FastA, TFastA, FastX, FastY
Query:
DNAProtein
Database:DNA
Protein
18. FASTA problems
FASTA can miss significant similarity
since
– For proteins, similar sequences do
not have to share identical residues
• Asp-Lys-Val is quite similar to
• Glu-Arg-Ile yet it is missed even with
ktuple size of 1 since no amino acid
matches
• Gly-Asp-Gly-Lys-Gly is quite similar
to Gly-Glu-Gly-Arg-Gly but there is
no match with ktuple size of 2
19. FASTA problems
FASTA can miss significant
similarity since
– For nucleic acids, due to codon
“wobble”, DNA sequences may
look like XXyXXyXXy where X’s
are conserved and y’s are not
• GGuUCuACgAAg and
GGcUCcACaAAA both code for
the same peptide sequence (Gly-SerThr-Lys) but they don’t match with
ktuple size of 3 or higher
22. What does BLAST do?
• Search a large target set of sequences...
• …for hits to a query sequence...
• …and return the alignments and scores from those
hits...
• Do it fast.
Show me those sequences that deserve a second look.
Blast programs were designed for fast database
searching, with minimal sacrifice of sensitivity to
distant related sequences.
23. The big red button
Do My Job
It is dangerous to hide too much of the
underlying complexity from the scientists.
24. Overview
• Approach: find segment pairs
by first finding word pairs that
score above a threshold, i.e.,
find word pairs of fixed length
wwith a score of at least T
• Key concept “Neigborhood”:
Seems similar to FASTA, but
we are searching for words
which score above T rather than
that match exactly
• Calculate neigborhood (T) for
substrings of query (size W)
25. Overview
Compile a list of words which give a score
above T when paired with the query sequence.
– Example using PAM-120 for query sequence ACDE
(w=4, T=17):
A
C
D
E
A C
D
E = +3 +9 +5 +5 = 22
• try all possibilities:
A
A
A
A
A
A
A = +3 -3
C = +3 -3
• ...too slow, try directed change
0 0 = 0
0 -7 = -7
no good
no good
26. Overview
A
A
g
n
I
k
C D E
C D E = +3 +9 +5 +5 = 22
• change 1st pos. to all acceptable substitutions
C D E = +1 +9 +5 +5 = 20ok
C D E = +0 +9 +5 +5 = 19 ok
C D E = -1 +9 +5 +5 = 18 ok
C D E = -2 +9 +5 +5 = 17 ok
• change 2nd pos.: can't - all alternatives negative
and the other three positions only add up to 13
• change 3rd pos. in combination with first position
gCnE = 1 9 2 5 = 17 ok
• continue - use recursion
• For "best" values of w and T there are typically
about 50 words in the list for every residue in the
query sequence
27. Neighborhood.pl
# Calculate neighborhood
my %NH;
for (my $i = 0; $i < @A; $i++) {
my $s1 = $S{$W[0]}{$A[$i]};
for (my $j = 0; $j < @A; $j++) {
my $s2 = $S{$W[1]}{$A[$j]};
for (my $k = 0; $k < @A; $k++) {
my $s3 = $S{$W[2]}{$A[$k]};
my $score = $s1 + $s2 + $s3;
my $word = "$A[$i]$A[$j]$A[$k]";
next if $word =~ /[BZX*]/;
$NH{$word} = $score if $score >= $T;
}
}
}
# Output neighborhood
foreach my $word (sort {$NH{$b} <=> $NH{$a} or $a cmp $b} keys %NH) {
print "$word $NH{$word}n";
}
32. The BLAST algorithm
• Break the search sequence into words
– W = 3 for proteins, W = 12 for DNA
MCGPFILGTYC
CGP
MCG, CGP, GPF, PFI, FIL,
ILG, LGT, GTY, TYC
MCG
• Include in the search all words that score
above a certain value (T) for any search word
MCGCGP
MCT
MGP
MCN
CTP
…
…
…
This list can be
computed in linear
time
33. The Blast Algorithm (2)
• Search for the words in the database
– Word locations can be precomputed and indexed
– Searching for a short string in a long string
• HSP (High Scoring Pair) = A match between
a query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
• Extend the hit until the score falls below a
threshold value, S
34.
35. BLAST parameters
• Lowering the neighborhood word threshold (T)
allows more distantly related sequences to be found,
at the expense of increased noise in the results set.
• Choosing a value for w
– small w: many matches to expand
– big w: many words to be generated
– w=4 is a good compromise
• Lowering the segment extension cutoff (S) returns
longer extensions for each hit.
• Changing the minimum E-value changes the
threshold for reporting a hit.
36. Critical parameters: T,W and scoring matrix
• The proper value of T depends ons both the
values in the scoring matrix and balance
between speed and sensitivity
• Higher values of T progressively remove
more word hits and reduce the search space.
• Word size (W) of 1 will produce more hits
than a word size of 10. In general, if T is
scaled uniformly with W, smaller word
sizes incraese sensitivity and decrease
speed.
• The interplay between W,T and the scoring
matrix is criticial and choosing them wisely
is the most effective way of controlling the
speed and sensiviy of blast
38. Database Searching
• How can we find a particular short sequence
in a database of sequences (or one HUGE
sequence)?
• Problem is identical to local sequence
alignment, but on a much larger scale.
• We must also have some idea of the
significance of a database hit.
– Databases always return some kind of hit, how
much attention should be paid to the result?
• How can we determine how “unusual” a
particular alignment score is?
39. Significance
Sentence 1:
“These algorithms are trying to find the best way to match up
two sequences”
Sentence 2:
“This does not mean that they will find anything profound”
ALIGNMENT:
THESEALGRITHMARETR--YINGTFINDTHEBESTWAYTMATCHPTWSEQENCES
:: :.. . .. ...:
:
::::..
:: . : ...
THISDESNTMEANTHATTHEYWILLFINDAN-------YTHIN-GPRFND-----12 exact matches
14 conservative substitutions
Is this a good alignment?
40. Overview
• A key to the utility of BLAST is
the ability to calculate expected
probabilities of occurrence of
Maximum Segment Pairs
(MSPs) given w and T
• This allows BLAST to rank
matching sequences in order of
“significance” and to cut off
listings at a user-specified
probability
41. Mathematical Basis of BLAST
• Model matches as a sequence of coin tosses
• Let p be the probability of a “head”
– For a “fair” coin, p = 0.5
• (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of heads is
R = log1/p (n).
• Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
• Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
42. Mathematical Basis of BLAST
• To model random sequence alignments, replace a
match with a “head” and mismatch with a “tail”.
AATCAT
HTHHHT
ATTCAG
• For DNA, the probability of a “head” is 1/4
– What is it for amino acid sequences?
43. Mathematical Basis of BLAST
• So, for one particular alignment, the Erdös-Rényi
property can be applied
• What about for all possible alignments?
– Consider that sequences are being shifted back and
forth, dot matrix plot
• The expected length of the longest match is
R=log1/p(mn)
where m and n are the lengths of the two sequences.
45. Karlin-Alschul Statistics
E=kmn-λS
This equation states that the number of alignments
expected by chance (E) during the sequence
database search is a function of the size of the
search space (m*n), the normalized score (λS)
and a minor constant (k mostly 0.1)
E-Value grows linearly with the product of target and
query sizes. Doubling target set size and doubling
query length have the same effect on e-value
47. Scoring alignments
• Score: S (~R)
– S= M(qi,ti) - gaps
• Any alignment has a score
• Any two sequences have a(t least one)
optimal alignment
48. • For a particular scoring matrix and its
associated gap initiation and extention costs
one must calculate λ and k
• Unfortunately (for gapped alignments), you
can’t do this analytically and the values must
be estimated empirically
– The procedure involves aligning random
sequences (Monte Carlo approach) with a specific
scoring scheme and observing the alignment
properties (scores, target frequencies and
lengths)
49. Significance
“Monte Carlo” Approach:
• Compares result to randomized
result, similarly to results generated by a
roulette wheel at Monte Carlo
• Typical procedure for alignments
– Randomize sequence A
– Align to sequence B
– Repeat many times (hundreds)
– Keep track op optimal score
• Histogram of scores …
53. Significance
Normal Distribution does NOT Fit Alignment Scores !!
• In seeking optimal Alignments between two
sequences, one desires those that have the highest
score - i.e. one is seeking a distribution of maxima
• In seeking optimal Matches between an Input
Sequence and Sequence Entries in a Database, one
again desires the matches that have the highest
score, and these are obtained via examination of the
distribution of such scores for the entries in the
database - this is again a distribution of maxima.
“A Normal Distribution is a distribution of Sums of
independent variables rather than a sum of their
Maxima.“
55. Alignment scores follow extreme value distributions
Alignment of unrelated/random sequences result in scores
following an extreme value distribution
x
P = 1 –e-E
E
P(x S) = 1-exp(-k m n e- S)
m, n: sequence lengths.
k,
free parameters.
E=-ln(1-P)
This can be shown analytically for ungapped alignments and has
been found empirically to also hold for gapped alignments under
commonly used conditions.
56. Alignment scores follow extreme value distributions
Alignment algorithms will always produce
alignments, regardless of whether it is meaningful or not
=> important to have way of selecting significant alignments
from large set of database hits.
Solution: fit distribution of scores from database search to
extreme value distribution; determine p-value of hit from this
fitted distribution.
Example: scores fitted to
extreme value distribution.
99.9% of this distribution is
located below score=112
=> hit with score = 112 has a
p-value of 0.1%
57. Significance
BLAST uses precomputed extreme
value distributions to calculate Evalues from alignment scores
For this reason BLAST only allows
certain combinations of substitution
matrices and gap penalties
This also means that the fit is based on
a different data set than the one you
are working on
A word of caution: BLAST tends to overestimate the significance of its
matches
E-values from BLAST are fine for identifying sure hits
One should be careful using BLAST’s E-values to judge if a marginal hit
can be trusted (e.g., you may want to use E-values of 10-4 to 10-5).
58. Determining P-values
• If we can estimate and , then we can
determine, for a given match score x, the
probability that a random match with score x
or greater would have occurred in the
database.
• For sequence matches, a scoring system and
database can be parameterized by two
parameters, k and , related to and .
– It would be nice if we could compare hit
significance without regard to the scoring system
used!
59. Bit Scores
• The expected number of hits with score
is:
E = Kmne s
S
– Where m and n are the sequence lengths
• Normalize the raw score using:
S
S
ln K
ln 2
• Obtains a “bit score” S’, with a standard set of
units.
S
• The new E-value is: E mn 2
61. FastA Output
• The distribution of scores graph of
frequency of observed scores
• expected curve (asterisks) according
to the extreme value distribution
–the theoretic curve should be
similar to the observed results
• deviations indicate that the fitting
parameters are wrong
–too weak gap penalties
–compositional biases
64. FastA Output
• A summary of the statistics and of the
program parameters follows the histogram.
– An important number in this summary is the
Kolmogorov-Smirnov statistic, which indicates
how well the actual data fit the theoretical
statistical distribution. The lower this value, the
better the fit, and the more reliable the statistical
estimates.
– In general, a Kolmogorov-Smirnov statistic under
0.1 indicates a good fit with the theoretical model.
If the statistic is higher than 0.2, the statistics may
not be valid, and it is recommended to repeat the
search, using more stringent (more negative)
values for the gap penalty parameters.
65. Statistics summary
• Optimal local alignment scores for pairs of random
amino acid sequences of the same length follow and
extreme-value distribution. For any score S, the
probability of observing a score >= S is given by the
Karlin-Altschul statistic (P(score>=S)=1-exp(-kmne(lambda.S))
• k en Lambda are parameters related to the position
of the maximum and the with of the distribution,
• Note the long tail at the right. This means that a
score serveral standard deviations above the mean
has higher probability of arising by chance (that is, it
is less significant) than if the scores followed a
normal distribution.
66. P-values
• Many programs report P = the probability that the
alignment is no better than random. The relationship
between Z and P depends on the distribution of the
scores from the control population, which do NOT
follow the normal distributions
– P<=10E-100 (exact match)
– P in range 10E-100 10E-50 (sequences nearly identical eg.
Alleles or SNPs
– P in range 10E-50 10E-10 (closely related
sequenes, homology certain)
– P in range 10-5 10E-1 (usually distant relatives)
– P > 10-1 (match probably insignificant)
67. E
• For database searches, most programs report E-values. The
E-value of an alignemt is the expected number of sequences
that give the same Z-score or better if the database is probed
with a random sequence. E is found by multiplying the value
of P by the size of the database probed. Note that E but not P
depends on the size of the database. Values of P are
between 0 and 1. Values of E are between 0 and the number
of sequences in the database searched:
– E<=0.02
sequences probably homologous
– E between 0.02 and 1
homology cannot be ruled out
– E>1
you would have to expect this good a match by just chance
85. Tips
• Be aware of what options you
have selected when using
BLAST, or FASTA
implementations.
• Treat BLAST searches as
scientific experiments
• So you should try your searches
with the filters on and off to see
whether it makes any difference
to the output
86. Tips: Low-complexity and Gapped Blast Algorithm
• The common, Web-based ones often have
default settings that will affect the outcome
of your searches. By default all NCBI BLAST
implementations filter out biased sequence
composition from your query sequence (e.g.
signal peptide and transmembrane
sequences - beware!).
• The SEG program has been implemented
as part of the blast routine in order to mask
low-complexity regions
• Low-complexity regions are denoted by
strings of Xs in the query sequence
87. Tips
• The sequence databases contain a
wealth of information. They also
contain a lot of errors. Contaminants
…
• Annotation errors, frameshifts that
may result in erroneous conceptual
translations.
• Hypothetical proteins ?
• In the words of Fox Mulder, "Trust
no one."
88. Tips
• Once you get a match to things
in the databases, check whether
the match is to the entire
protein, or to a domain. Don't
immediately assume that a
match means that your protein
carries out the same function
(see above). Compare your
protein and the match protein(s)
along their entire lengths before
making this assumption.
89. Tips
• Domain matches can also cause problems
by hiding other informative matches. For
instance if your protein contains a common
domain you'll get significant matches to
every homologous sequence in the
database. BLAST only reports back a
limited number of matches, ordered by P
value.
• If this list consists only of matches to the
same domain, cut this bit out of your query
sequence and do the BLAST search again
with the edited sequence (e.g. NHR).
90. Tips
• Do controls wherever possible. In
particular when you use a particular
search software for the first time.
• Suitable positive controls would be protein
sequences known to have distant
homologues in the databases to check
how good the software is at detecting such
matches.
• Negative controls can be employed to
make sure the compositional bias of the
sequence isn't giving you false positives.
Shuffle your query sequence and see what
difference this makes to the matches that
are returned. A real match should be lost
upon shuffling of your sequence.
91. Tips
• Perform Controls
#!/usr/bin/perl -w
use strict;
my ($def, @seq) = <>;
print $def;
chomp @seq;
@seq = split(//, join("", @seq));
my $count = 0;
while (@seq) {
my $index = rand(@seq);
my $base = splice(@seq, $index, 1);
print $base;
print "n" if ++$count % 60 == 0;
}
print "n" unless $count %60 == 0;
92. Tips
• Read the footer first
• View results graphically
• Parse Blasts with Bioperl
93. FastA vs. Blast
• BLAST's major advantage is its speed.
– 2-3 minutes for BLAST versus several hours
for a sensitive FastA search of the whole of
GenBank.
• When both programs use their default
setting, BLAST is usually more sensitive
than FastA for detecting protein sequence
similarity.
– Since it doesn't require a perfect sequence
match in the first stage of the search.
94. FastA vs. Blast
Weakness of BLAST:
– The long word size it uses in the initial stage of DNA
sequence similarity searches was chosen for speed, and not
sensitivity.
– For a thorough DNA similarity search, FastA is the
program of choice, especially when run with a lowered
KTup value.
– FastA is also better suited to the specialised task of
detecting genomic DNA regions using a cDNA query
sequence, because it allows the use of a gap extension
penalty of 0. BLAST, which only creates ungapped
alignments, will usually detect only the longest exon, or fail
altogether.
• In general, a BLAST search using the default
parameters should be the first step in a database
similarity search strategy. In many cases, this is all
that may be required to yield all the information
needed, in a very short time.
96. PSI-Blast
1. Old (ungapped) BLAST
2. New BLAST (allows gaps)
3. Profile -> PSI Blast - Position Specific
Iterated
Strategy:Multiple alignment of the hits
Calculates a position-specific score matrix
Searches with this matrix
In many cases is much more sensitive to weak but
biologically relevant sequence similarities
PSSM !!!
97. PSI-Blast
• Patterns of conservation from the alignment of
related sequences can aid the recognition of
distant similarities.
– These patterns have been variously called motifs,
profiles, position-specific score matrices, and
Hidden Markov Models.
For each position in the derived pattern, every
amino acid is assigned a score.
(1) Highly conserved residue at a position: that
residue is assigned a high positive score, and
others are assigned high negative scores.
(2) Weakly conserved positions: all residues receive
scores near zero.
(3) Position-specific scores can also be assigned to
potential insertions and deletions.
98. Pattern
• a set of alternative
sequences, using
“regular expressions”
• Prosite
(http://www.expasy.org/
prosite/)
102. PSI-Blast
• The power of profile methods can be
further enhanced through iteration of
the search procedure.
– After a profile is run against a database,
new similar sequences can be detected. A
new multiple alignment, which includes
these sequences, can be constructed, a
new profile abstracted, and a new
database search performed.
– The procedure can be iterated as often as
desired or until convergence, when no new
statistically significant sequences are
detected.
103. PSI-Blast
(1) PSI-BLAST takes as an input a single protein sequence
and compares it to a protein database, using the gapped
BLAST program.
(2) The program constructs a multiple alignment, and then a
profile, from any significant local alignments found.
The original query sequence serves as a template for the multiple
alignment andprofile, whose lengths are identical to that of the
query. Different numbers of sequences can be aligned in different
template positions.
(3) The profile is compared to the protein database, again
seeking local alignments using the BLAST algorithm.
(4) PSI-BLAST estimates the statistical significance of the local
alignments found.
Because profile substitution scores are constructed to a fixed
scale, and gap scores remain independent of position, the
statistical theory andparameters for gapped BLAST alignments
remain applicable to profile alignments.
(5) Finally, PSI-BLAST iterates, by returning to step (2), a
specified number of times or until convergence.
109. PSI-BLAST pitfalls
• Avoid too close sequences: overfit!
• Can include false homologous! Therefore check
the matches carefully: include or exclude
sequences based on biological knowledge.
• The E-value reflects the significance of the
match to the previous training set not to the
original sequence!
• Choose carefully your query sequence.
• Try reverse experiment to certify.
110. Reduce overfitting risk by Cobbler
• A single sequence is selected
from a set of blocks and enriched
by replacing the conserved
regions delineated by the blocks
by consensus residues derived
from the blocks.
• Embedding consensus residues
improves performance
• S. Henikoff and J.G. Henikoff;
Protein Science (1997) 6:698705.
122. BLAT method
• Align sequence with BLAT, get alignment
info
• Per BLAT hit, pick up additional info from
connected databases:
–
–
–
–
–
mRNAs
ESTs
RepeatMasker
CpG Islands
RefSeq Genes
123.
124. Weblems
W5.1: Submit the amino acid sequence of papaya
papein to a BLAST (gapped and ungapped) and to a
PSI-BLAST search. What are the main difference in
results?
W5.2: Is there a relationship between Klebsiella
aerogenes urease, Pseudomonas diminuta
phosphotriesterase and mouse adenosine deaminase
? Also use DALI, ClustalW and T-coffee.
W5.3: Yeast two-hybrid typically yields DNA
sequences. How would you find the corresponding
protein ?
W5.4: When and why would you use tblastn ?
W5.5: How would you search a database if you want to
restrict the search space to those entries having a
secretion signal consisting of 4 consecutive (Nterminal) basic residues ?