1. The document discusses various bioinformatics concepts and tools including sequence alignment, BLAST, substitution matrices, and open reading frames. Sequence alignment involves comparing sequences to find similar regions and can be local or global. BLAST is a tool used to find similar sequences in a database by searching for exact and similar matches. Substitution matrices like BLOSUM and PAM assign scores to amino acid substitutions observed in protein evolution. Open reading frames refer to the three possible frames for translating a nucleic acid sequence into a protein.
This document discusses the Basic Local Alignment Search Tool (BLAST), which allows users to compare a query DNA or protein sequence against sequence databases to find regions of local similarity. BLAST breaks the query into short words that are then searched for in database sequences. When words are found in common, BLAST extends the alignment in both directions to find higher-scoring matches. BLAST outputs include a graphical display of alignments, a hit list ranking matches by similarity score, and detailed alignments. BLAST has many applications, such as identifying species, establishing evolutionary relationships, DNA mapping, and locating protein domains.
NCBI; Introduction, Homepage and about
Tools and database of NCBI
BLAST; Introduction, Homepage and types of BLAST
Some databases of NCBI
References
Acknowledgements
Electroporation is a method to transform cells by creating transient pores in the cell membrane through applying brief high-voltage electric pulses, allowing DNA to enter the cell. It involves suspending cells in a solution with DNA between electrodes and applying pulses of 4000-8000 V/cm for milliseconds. This forms pores in the membrane through which DNA can enter. It is commonly used to transform bacteria, yeast, plant protoplasts, and transfect eukaryotic cells. Key factors influencing electroporation include field strength, pulse length, DNA purity and concentration, and cell growth conditions.
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
The document provides an overview of the history and scope of bioinformatics. It discusses how bioinformatics emerged from the fields of computer science and biology. The history section outlines major developments from Mendel's work in 1865 to the sequencing of the human genome in 2001. Bioinformatics has various applications in areas like drug development, personalized medicine, and biotechnology. It also has significant scope in India, with growing job opportunities in both the public and private sectors.
This document discusses the Basic Local Alignment Search Tool (BLAST), which allows users to compare a query DNA or protein sequence against sequence databases to find regions of local similarity. BLAST breaks the query into short words that are then searched for in database sequences. When words are found in common, BLAST extends the alignment in both directions to find higher-scoring matches. BLAST outputs include a graphical display of alignments, a hit list ranking matches by similarity score, and detailed alignments. BLAST has many applications, such as identifying species, establishing evolutionary relationships, DNA mapping, and locating protein domains.
NCBI; Introduction, Homepage and about
Tools and database of NCBI
BLAST; Introduction, Homepage and types of BLAST
Some databases of NCBI
References
Acknowledgements
Electroporation is a method to transform cells by creating transient pores in the cell membrane through applying brief high-voltage electric pulses, allowing DNA to enter the cell. It involves suspending cells in a solution with DNA between electrodes and applying pulses of 4000-8000 V/cm for milliseconds. This forms pores in the membrane through which DNA can enter. It is commonly used to transform bacteria, yeast, plant protoplasts, and transfect eukaryotic cells. Key factors influencing electroporation include field strength, pulse length, DNA purity and concentration, and cell growth conditions.
In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
The document discusses BLAST (Basic Local Alignment Search Tool), an algorithm used to compare a query DNA or protein sequence against a database of sequences. BLAST works by identifying exact or approximate matches between words of 3-11 letters in the query and database sequences. Matches are extended to find local alignments with high scores. Significant alignments are identified based on their score and the expected number of matches by chance (E-value). The document provides examples of how BLAST finds local alignments and calculates E-values. It also describes different BLAST programs and suggestions for using BLAST.
The document provides an overview of the history and scope of bioinformatics. It discusses how bioinformatics emerged from the fields of computer science and biology. The history section outlines major developments from Mendel's work in 1865 to the sequencing of the human genome in 2001. Bioinformatics has various applications in areas like drug development, personalized medicine, and biotechnology. It also has significant scope in India, with growing job opportunities in both the public and private sectors.
Protein databases can contain either sequence or structure information. Some key protein sequence databases include PIR, Swiss-Prot, and TrEMBL. PIR classifies entries by annotation level, Swiss-Prot aims to provide high annotation levels and interlink information, and TrEMBL contains all coding sequences with some entries eventually incorporated into Swiss-Prot. Important structure databases are PDB, which contains 3D protein structures, and SCOP and CATH, which classify evolutionary and structural relationships between protein domains.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
FASTA is a bioinformatics tool and biological database that is used to compare amino acid sequences of proteins or nucleotide sequences of DNA. It was first described in 1985 by Lipman and Pearson. FASTA performs fast homology searches to find similarities between a query sequence and sequences in a database. While similar to BLAST, FASTA is faster for sequence comparisons. It works by identifying patches of sequence similarity that may contain gaps. Some key FASTA programs include FASTA, TFASTA, FASTS, and FASTX/Y. FASTA is useful for applications like identification of species, establishing phylogeny, DNA mapping, and understanding protein function.
This document provides an overview of the FASTA software. FASTA is a program used by biologists to study and analyze DNA and protein sequences. It uses a simple text-based format to present sequences and allows for the naming of sequences and inclusion of comments. FASTA is a rapid program that can be used locally or through email servers to find regional similarities between sequences and identify potential matches while ignoring complete sensitivity. It has become a standard tool in biology for sequencing and analyzing proteins and DNA.
Transgenesis involves introducing foreign DNA into an animal's genome. This allows for the production of transgenic animals that exhibit new traits. Common methods for creating transgenic animals include pronuclear microinjection, embryonic stem cell manipulation, and retrovirus-mediated gene transfer. Examples of transgenic animals include glowing fish, disease models like Alzheimer's mice, and farm animals engineered for increased wool/milk. While transgenic technology has benefits for research, agriculture, and medicine, it also carries some risks that require further study.
This document discusses sequence alignment methods. It describes global and local alignment, and algorithms used for alignment including dot matrix analysis, dynamic programming, and word/k-tuple methods as implemented in FASTA and BLAST programs. BLAST and FASTA are described as popular tools for sequence database searches that use heuristic methods and word matching to quickly identify regions of local similarity.
Knockout mice are mice that have had a specific gene inactivated through replacement or disruption with artificial DNA. This allows researchers to study the function of that gene. The technique was awarded the 2007 Nobel Prize in Physiology. The procedure involves isolating the target gene, engineering a modified DNA sequence, introducing this into embryonic stem cells, and implanting the modified stem cells into mouse blastocysts. This generates chimeric mice that can pass the modified gene to offspring. Knockout mice provide insights into gene function in humans and are used as models for diseases. They also enable drug and therapy testing, though some genes cause developmental issues if knocked out.
Biological databases store and organize biological data and information. There are two main types - primary databases that contain original experimental data that cannot be changed, and secondary databases that contain derived data analyzed from primary sources. Examples of primary databases include GenBank for DNA sequences and SWISS-PROT for protein sequences. Secondary databases include PROSITE for protein families and domains, and Pfam for protein family alignments. Biological databases allow sharing of genomic and protein information worldwide and provide a foundation for research.
Relaxed plasmids & Regulation of copy numbersalvia16
Plasmids are classified as either stringent or relaxed based on their copy number in bacterial cells. Stringent plasmids exist in low copy numbers (<100 copies/cell) and rely on the bacterial genome for replication and segregation. Relaxed plasmids exist in high copy numbers (>100 copies/cell) and replicate independently of the bacterial genome. Relaxed plasmids include ColE1, which uses an antisense RNA mechanism to regulate its copy number based on plasmid concentration in the cell.
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
The document discusses codon-anticodon interactions and the wobble hypothesis. It explains that tRNAs have anticodons that interact with mRNA codons to add the correct amino acid to the growing polypeptide chain. The wobble hypothesis proposes that the third position of codons and first position of anticodons allow some variability, or "wobbling", in base pairing through interactions with inosine and other modified bases. This wobbling allows a single tRNA to bind to multiple codons, resolving the redundancy in the genetic code and allowing fewer tRNAs than codons.
Microinjection is a gene transfer technique where DNA is directly injected into cells using a fine glass micropipette. It is highly efficient at the individual cell level and was originally used for transfecting hard-to-transfect cells. The procedure involves holding a cell using one pipette while another pipette is used to inject DNA into the cell's cytoplasm or nucleus. It allows for stable transfection efficiencies of around 20% and is used to generate transgenic animals by injecting DNA into oocytes, eggs or embryos. However, it is time-consuming and can only be done for a small number of cells.
This document discusses dot plot analysis, which allows comparison of two biological sequences to identify similar regions. It describes how dot plots are generated using a similarity matrix and defines different features that can be observed, such as identical sequences appearing on the principal diagonal, direct and inverted repeats appearing as multiple diagonals, and low complexity regions forming boxes. Applications of dot plot analysis include identifying alignments, self-base pairing, sequence transposition, and gene locations between genomes. Limitations include high memory needs for long sequences and low efficiency for global alignments.
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
FASTA is a sequence alignment tool that was developed before BLAST. It uses a hashing strategy to find matches between k-tuples, or short stretches of identical residues, in query and target sequences. FASTA breaks sequences down into k-tuples and searches target databases to find similarities. While faster than dynamic programming, FASTA and BLAST may not find optimal alignments or true homologs.
This document discusses biological databases and nucleic acid sequence databases. It describes the three primary nucleotide sequence databases: GenBank, EMBL, and DDBJ. GenBank is hosted by the National Center for Biotechnology Information and contains over 286 million bases and 352,000 sequences. EMBL is hosted by the European Molecular Biology Laboratory and mirrors data daily with GenBank and DDBJ. DDBJ is the DNA Data Bank of Japan and also mirrors data daily with the other two databases. Biological databases are important tools for scientists to understand biology at multiple levels.
This document discusses various enzymes used for genetic engineering and DNA manipulation. It describes restriction endonucleases and DNA ligase which cut and join DNA fragments. It also discusses other DNA modifying enzymes like nucleases which degrade DNA, and polymerases which synthesize DNA copies. Specific enzymes covered in detail include DNA polymerase I, T4 DNA polymerase, T7 DNA polymerase, terminal transferase, T4 DNA ligase, and T4 RNA ligase.
Multiple sequence alignment (MSA) aligns three or more biological sequences, like proteins or nucleic acids, to infer homology and evolutionary relationships. There are three main methods - dynamic programming computes an optimal alignment but has high runtime; progressive alignment first does pairwise alignments and adds sequences; iterative alignment successively improves approximations without pairwise alignments. Popular tools for MSA include Clustal W, MAFFT, MUSCLE, and T-Coffee. MSA helps detect similarities, conserved motifs, and structural homologies between sequences.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...JIA-MING CHANG
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/.
The document provides an overview of sequence alignment concepts including:
- Definitions of terms like identity, homology, orthologous, and paralogous genes
- Examples and explanations of scoring matrices used for nucleotide and protein sequence alignments like BLOSUM and PAM matrices
- An example multiple sequence alignment of glyceraldehyde-3-phosphate dehydrogenases from different species
- Descriptions of how scoring matrices are used to quantify sequence similarity and their importance in sequence analysis
Protein databases can contain either sequence or structure information. Some key protein sequence databases include PIR, Swiss-Prot, and TrEMBL. PIR classifies entries by annotation level, Swiss-Prot aims to provide high annotation levels and interlink information, and TrEMBL contains all coding sequences with some entries eventually incorporated into Swiss-Prot. Important structure databases are PDB, which contains 3D protein structures, and SCOP and CATH, which classify evolutionary and structural relationships between protein domains.
The Protein Data Bank (PDB) is an open database that archives 3D structural data of biological macromolecules. It was established in 1971 and currently holds over 150,000 structures determined by X-ray crystallography or NMR spectroscopy. The PDB is overseen by the Worldwide Protein Data Bank and freely accessible online. It serves as a key resource for structural biology and many other databases rely on protein structures deposited in the PDB.
FASTA is a bioinformatics tool and biological database that is used to compare amino acid sequences of proteins or nucleotide sequences of DNA. It was first described in 1985 by Lipman and Pearson. FASTA performs fast homology searches to find similarities between a query sequence and sequences in a database. While similar to BLAST, FASTA is faster for sequence comparisons. It works by identifying patches of sequence similarity that may contain gaps. Some key FASTA programs include FASTA, TFASTA, FASTS, and FASTX/Y. FASTA is useful for applications like identification of species, establishing phylogeny, DNA mapping, and understanding protein function.
This document provides an overview of the FASTA software. FASTA is a program used by biologists to study and analyze DNA and protein sequences. It uses a simple text-based format to present sequences and allows for the naming of sequences and inclusion of comments. FASTA is a rapid program that can be used locally or through email servers to find regional similarities between sequences and identify potential matches while ignoring complete sensitivity. It has become a standard tool in biology for sequencing and analyzing proteins and DNA.
Transgenesis involves introducing foreign DNA into an animal's genome. This allows for the production of transgenic animals that exhibit new traits. Common methods for creating transgenic animals include pronuclear microinjection, embryonic stem cell manipulation, and retrovirus-mediated gene transfer. Examples of transgenic animals include glowing fish, disease models like Alzheimer's mice, and farm animals engineered for increased wool/milk. While transgenic technology has benefits for research, agriculture, and medicine, it also carries some risks that require further study.
This document discusses sequence alignment methods. It describes global and local alignment, and algorithms used for alignment including dot matrix analysis, dynamic programming, and word/k-tuple methods as implemented in FASTA and BLAST programs. BLAST and FASTA are described as popular tools for sequence database searches that use heuristic methods and word matching to quickly identify regions of local similarity.
Knockout mice are mice that have had a specific gene inactivated through replacement or disruption with artificial DNA. This allows researchers to study the function of that gene. The technique was awarded the 2007 Nobel Prize in Physiology. The procedure involves isolating the target gene, engineering a modified DNA sequence, introducing this into embryonic stem cells, and implanting the modified stem cells into mouse blastocysts. This generates chimeric mice that can pass the modified gene to offspring. Knockout mice provide insights into gene function in humans and are used as models for diseases. They also enable drug and therapy testing, though some genes cause developmental issues if knocked out.
Biological databases store and organize biological data and information. There are two main types - primary databases that contain original experimental data that cannot be changed, and secondary databases that contain derived data analyzed from primary sources. Examples of primary databases include GenBank for DNA sequences and SWISS-PROT for protein sequences. Secondary databases include PROSITE for protein families and domains, and Pfam for protein family alignments. Biological databases allow sharing of genomic and protein information worldwide and provide a foundation for research.
Relaxed plasmids & Regulation of copy numbersalvia16
Plasmids are classified as either stringent or relaxed based on their copy number in bacterial cells. Stringent plasmids exist in low copy numbers (<100 copies/cell) and rely on the bacterial genome for replication and segregation. Relaxed plasmids exist in high copy numbers (>100 copies/cell) and replicate independently of the bacterial genome. Relaxed plasmids include ColE1, which uses an antisense RNA mechanism to regulate its copy number based on plasmid concentration in the cell.
S.Prasanth Kumar is a bioinformatician who studies proteomics, 2D-PAGE, and proteome databases. Proteomics involves the study of proteins expressed by a genome through analysis of protein sequences, structures, modifications, and interactions. Major databases include Swiss-Prot, which contains annotated protein sequences, and TrEMBL, which contains automatically generated sequences. Other databases contain information on protein families and domains, nucleotide sequences, 2D-PAGE gel images, and post-translational modifications.
The document discusses codon-anticodon interactions and the wobble hypothesis. It explains that tRNAs have anticodons that interact with mRNA codons to add the correct amino acid to the growing polypeptide chain. The wobble hypothesis proposes that the third position of codons and first position of anticodons allow some variability, or "wobbling", in base pairing through interactions with inosine and other modified bases. This wobbling allows a single tRNA to bind to multiple codons, resolving the redundancy in the genetic code and allowing fewer tRNAs than codons.
Microinjection is a gene transfer technique where DNA is directly injected into cells using a fine glass micropipette. It is highly efficient at the individual cell level and was originally used for transfecting hard-to-transfect cells. The procedure involves holding a cell using one pipette while another pipette is used to inject DNA into the cell's cytoplasm or nucleus. It allows for stable transfection efficiencies of around 20% and is used to generate transgenic animals by injecting DNA into oocytes, eggs or embryos. However, it is time-consuming and can only be done for a small number of cells.
This document discusses dot plot analysis, which allows comparison of two biological sequences to identify similar regions. It describes how dot plots are generated using a similarity matrix and defines different features that can be observed, such as identical sequences appearing on the principal diagonal, direct and inverted repeats appearing as multiple diagonals, and low complexity regions forming boxes. Applications of dot plot analysis include identifying alignments, self-base pairing, sequence transposition, and gene locations between genomes. Limitations include high memory needs for long sequences and low efficiency for global alignments.
The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. This presentation deals with what, why, how, where and who of PDB. In this presentation we have also included briefing about various file formats available in PDB with emphasis on PDB file format
FASTA is a sequence alignment tool that was developed before BLAST. It uses a hashing strategy to find matches between k-tuples, or short stretches of identical residues, in query and target sequences. FASTA breaks sequences down into k-tuples and searches target databases to find similarities. While faster than dynamic programming, FASTA and BLAST may not find optimal alignments or true homologs.
This document discusses biological databases and nucleic acid sequence databases. It describes the three primary nucleotide sequence databases: GenBank, EMBL, and DDBJ. GenBank is hosted by the National Center for Biotechnology Information and contains over 286 million bases and 352,000 sequences. EMBL is hosted by the European Molecular Biology Laboratory and mirrors data daily with GenBank and DDBJ. DDBJ is the DNA Data Bank of Japan and also mirrors data daily with the other two databases. Biological databases are important tools for scientists to understand biology at multiple levels.
This document discusses various enzymes used for genetic engineering and DNA manipulation. It describes restriction endonucleases and DNA ligase which cut and join DNA fragments. It also discusses other DNA modifying enzymes like nucleases which degrade DNA, and polymerases which synthesize DNA copies. Specific enzymes covered in detail include DNA polymerase I, T4 DNA polymerase, T7 DNA polymerase, terminal transferase, T4 DNA ligase, and T4 RNA ligase.
Multiple sequence alignment (MSA) aligns three or more biological sequences, like proteins or nucleic acids, to infer homology and evolutionary relationships. There are three main methods - dynamic programming computes an optimal alignment but has high runtime; progressive alignment first does pairwise alignments and adds sequences; iterative alignment successively improves approximations without pairwise alignments. Popular tools for MSA include Clustal W, MAFFT, MUSCLE, and T-Coffee. MSA helps detect similarities, conserved motifs, and structural homologies between sequences.
This presentation gives you a detailed information about the swiss prot database that comes under UniProtKB. It also covers TrEMBL: a computer annotated supplement to Swiss-Prot.
PSLDoc: Protein subcellular localization prediction based on gapped-dipeptide...JIA-MING CHANG
Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/.
The document provides an overview of sequence alignment concepts including:
- Definitions of terms like identity, homology, orthologous, and paralogous genes
- Examples and explanations of scoring matrices used for nucleotide and protein sequence alignments like BLOSUM and PAM matrices
- An example multiple sequence alignment of glyceraldehyde-3-phosphate dehydrogenases from different species
- Descriptions of how scoring matrices are used to quantify sequence similarity and their importance in sequence analysis
2016.09.28
TOPIC REVIEW
• Exam
• PS2 Sequence Alignment
• Command Line Blast
• PS1 Molecular Biology
• Personal Microbiome Project
CURRENTLY
LET’S NEGOTIATE
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam (1) - 20%
• Research project - 45%
• Participation - 5%
OR
• Problem sets (4) - 10%
• Microbiome project - 20%
• Exam 1 - 15%
• Exam 2 - 15%
• Research project - 35%
• Participation - 5%
PS2 SEQUENCE ALIGNMENT
PS2 SEQUENCE ALIGNMENT
RefSeqs, protein (experimentally supported)
On chromosome 17
Reverse strand
PRCD Progressive rod-cone degeneration
PS2: GLOBAL ALIGNMENT
BLOSUM62
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
BLOSUM80
• Substitutions more penalized and
gaps are favored.
PAM60
• Substitutions more penalized and gaps
are favored.
PAM250
• substitutions less penalized and are
preferred to gaps. There is also a
decrease in the level of identity.
PS2: LOCAL ALIGNMENT
SEQ1 A L S C V W M I P
SEQ2 A I S C M I P T
9 residues
8 residues
Create Matrix: length of seq1 + 1
x
length of seq2 + 1
Matrix 10 x 9
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
Ala A 4
Arg R -1 5
Asn N -2 0 6
Asp D -2 -2 1 6
Cys C 0 -3 -3 -3 9
Gln Q -1 1 0 0 -3 5
Glu E -1 0 0 2 -4 2 5
Gly G 0 -2 0 -1 -3 -2 -2 6
His H -2 0 1 -1 -3 0 0 -2 8
Ile I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
Leu L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Lys K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
Met M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
Phe F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
Pro P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
Ser S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
Thr T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Trp W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Tyr Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
Val V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
A
la
A
rg
A
sn
A
sp
C
y
s
G
ln
G
lu
G
ly
H
is
Il
e
L
e
u
L
y
s
M
e
t
P
h
e
P
ro
S
e
r
T
h
r
T
rp
T
y
r
V
a
l
A R N D C Q E G H I L K M F P S T W Y V
Dynamical programming - global alignment
83
BLOSUM62
GAP COST: -2
At each cell, 3 scores are calculated:
• match score = diagonal cell score +
score from the substitution matrix.
• Vertical gap score = upper neighbor
+ gap cost
• Horizontal gap score = left neighbor
+ gap cost
• The highest score is retained and
the arrow is labelled
A L S C V W M I P
0 -2 -4 -6 -8 -10 -12 -14 -16 -18
-2
-4
-6
-8
-10
-12
-14
-16
A
I
S
C
M
I
P
T
Exercise: fill the scores of the alignment matrix
using the BLOSUM62 substitution matrix.
Gap opening penalty: -5
Gap extension penalty: -1
S V E T D
T
S
I
N
Q
E
T
A ...
The document appears to be the output from running a BLAST search on a query sequence against a yeast genome database. The top hit from the BLAST search is a sequence on Saccharomyces cerevisiae chromosome XI with 85% identity over 88 amino acids and an E-value of 4e-12. Additional significant hits are also reported. The bottom of the document discusses retrieving the genomic sequence for the top hit and generating a dotplot comparison between the query and hit sequences.
1) This document introduces methods for detecting sequence similarity, which is a fundamental analysis in bioinformatics.
2) It describes how to search databases for similar sequences using BLAST or FASTA, and how to compare two sequences using dynamic programming algorithms like Needleman-Wunsch or Smith-Waterman.
3) Substitution matrices like BLOSUM62 are used to score alignments and measure sequence similarity based on amino acid properties.
This document provides an overview of databases, definitions, scoring matrices, and pairwise sequence alignment. It discusses major bioinformatics databases like NCBI, ExPASy, and EBI. It also defines key terms like identity, homology, orthologous, and paralogous sequences. Additionally, it examines the theoretical and empirical bases for scoring matrices like PAM, BLOSUM, and transition/transversion matrices, and how they are used in sequence alignment.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices, including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the different likelihood of transition vs. transversion mutations in DNA. It explains that scoring matrices represent implicit models of evolution and influence sequence analysis outcomes. The document emphasizes that results depend critically on the chosen scoring matrix and model.
The document discusses bioinformatics and computational biology. It describes a lab with over 100 people from diverse backgrounds, including engineers, scientists, technicians, geneticists and clinicians. The lab applies information technology to analyze biological data, focusing on areas like sequence analysis, molecular modeling, phylogeny, medical applications, statistics and more. Specific applications mentioned include analyzing genomes to study genetic diseases and drug design, as well as using the same techniques in agriculture and animal health.
This document provides an overview of sequence alignment and scoring matrices. It defines key terms like identity, homology, orthologous, and paralogous genes. It discusses different types of scoring matrices including unitary matrices that score matches as 1 and mismatches as 0, and transition/transversion matrices that account for the higher likelihood of transitional mutations in nucleic acids. The document emphasizes that scoring matrices represent underlying evolutionary models and influence sequence analysis outcomes.
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This presentation gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs. A step-by-step example is given in addition to its implementation in Python 3.5.
---------------------------------
Read more about GA:
Yu, Xinjie, and Mitsuo Gen. Introduction to evolutionary algorithms. Springer Science & Business Media, 2010.
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
Global and local alignment (bioinformatics)Pritom Chaki
A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context.
This document discusses sequence alignment and contains four sections:
1) Global alignment which finds the highest scoring alignment between entire sequences using dynamic programming.
2) Scoring matrices which generalize alignment scoring by assigning scores to individual character matches/mismatches based on biological evidence.
3) Local alignment which finds the best scoring alignment between substrings of sequences to identify conserved regions, as global alignment may miss these.
4) Ways to solve the local alignment problem efficiently in quadratic time instead of quartic time by computing alignments from each vertex in the grid.
Global and local alignment in BioinformaticsMahmudul Alam
1. Global alignment finds the optimal alignment over the entire sequence length trying to match as many elements as possible, while local alignment finds the region of highest similarity between two sequences that may be of different lengths.
2. The Needleman-Wunsch algorithm is commonly used for global alignment using dynamic programming to find the optimal full sequence alignment with linear gap costs.
3. The Smith-Waterman algorithm is used for local sequence alignment to identify similar regions by calculating similarity scores and only retaining alignments with scores higher than a threshold.
The document analyzes the distribution of rRNA introns within the three-dimensional structures of the 30S and 50S ribosomal subunits. It finds that most intron insertion sites occur near conserved residues that form tRNA binding sites and the subunit interface, even though many positions are not accessible to solvent in the mature ribosome. This correlation between intron locations and functionally important residues suggests an association between intron evolution and rRNA function. Over 1200 introns have been identified within rRNA genes at over 150 unique sites, with the majority belonging to group I, group II, archaeal or spliceosomal intron types.
wealth age region
37 50 M
24 88 U
14 64 A
13 63 U
13 66 U
11.7 72 E
10 71 M
8.2 77 U
8.1 68 U
7.2 66 E
7 69 M
6.2 36 O
5.9 49 U
5.3 73 U
5.2 52 E
5 77 M
5 73 M
4.9 62 A
4.8 54 U
4.7 63 U
4.7 23 U
4.6 70 O
4.6 59 E
4.5 96 E
4.5 84 O
4.5 40 E
4.3 60 U
4 77 E
4 68 E
4 83 E
4 68 A
4 40 E
4 62 M
4 69 E
4 49 A
3.9 64 A
3.9 83 A
3.8 41 A
3.8 78 A
3.6 80 A
3.5 68 O
3.4 67 U
3.4 71 O
3.4 54 A
3.3 62 E
3.3 69 A
3.3 58 U
3.2 71 U
3.2 55 O
3 66 E
3 65 E
3 50 U
3 64 E
3 57 A
3 86 M
3 71 E
3 68 E
3 68 E
3 54 U
2.8 68 A
2.8 76 E
2.8 52 E
2.8 73 O
2.8 46 O
2.7 69 U
2.7 63 E
2.6 42 E
2.6 67 E
2.6 62 O
2.6 66 U
2.6 75 U
2.5 74 E
2.5 73 E
2.5 84 M
2.5 49 A
2.4 60 U
2.4 71 O
2.4 76 A
2.4 67 E
2.3 54 A
2.3 57 U
2.3 54 O
2.3 64 O
2.2 85 E
2.2 45 A
2.2 39 O
2.2 54 E
2.1 68 U
2.1 85 U
2 70 M
2 102 M
2 38 U
2 73 A
2 91 E
2 82 U
2 74 M
2 81 M
2 * U
2 62 E
2 62 U
2 67 U
2 80 O
2 68 M
2 80 U
2 * U
2 60 E
2 74 O
1.9 48 U
1.9 60 E
1.9 43 E
1.9 64 O
1.9 67 U
1.8 62 A
1.8 90 E
1.8 66 U
1.8 68 A
1.8 60 A
1.8 53 A
1.8 47 E
1.8 86 U
1.8 67 A
1.7 54 U
1.7 77 E
1.7 61 U
1.7 83 E
1.7 61 U
1.7 58 U
1.7 64 U
1.7 53 A
1.7 67 A
1.6 57 E
1.6 62 A
1.6 * E
1.6 64 O
1.6 69 A
1.6 71 E
1.6 54 U
1.6 78 A
1.5 45 U
1.5 69 U
1.5 59 U
1.5 * A
1.5 82 O
1.5 68 E
1.5 41 E
1.5 60 E
1.5 64 E
1.5 44 E
1.5 7 E
1.5 72 E
1.5 56 E
1.5 60 E
1.4 61 E
1.4 79 O
1.4 42 O
1.4 63 E
1.4 49 E
1.4 56 E
1.4 67 U
1.4 75 E
1.4 43 M
1.4 61 U
1.4 54 O
1.4 47 E
1.4 64 U
1.4 52 A
1.4 73 A
1.3 83 U
1.3 64 E
1.3 71 O
1.3 71 E
1.3 61 M
1.3 83 E
1.3 43 E
1.3 47 U
1.3 79 E
1.3 53 E
1.3 73 U
1.3 72 U
1.3 72 U
1.3 59 A
1.3 77 E
1.3 68 E
1.3 42 E
1.3 61 U
1.2 69 A
1.2 82 O
1.2 * E
1.2 56 U
1.2 42 M
1.2 63 U
1.2 75 U
1.2 * E
1.2 59 A
1.2 70 E
1.2 46 M
1.2 68 U
1.2 68 A
1.2 69 A
1.2 68 O
1.2 64 A
1.1 53 E
1.1 79 E
1.1 49 E
1.1 47 U
1.1 75 U
1.1 76 M
1.1 66 U
1.1 85 U
1.1 66 O
1.1 70 U
1.1 58 E
1.1 72 E
1.1 52 M
1 52 O
1 79 E
1 69 A
1 52 M
1 75 E
1 62 E
1 65 M
1 63 U
1 87 E
1 61 U
1 58 O
1 60 E
1 67 O
1 80 E
1 63 U
1 9 M
1 59 E
1 * E
1 * O
Sheet1DateExportRefinery OutputJan-04283.92246.01Feb-04241.7237.15Mar-04142.66249.35Apr-04331.02237.72May-04197.33269.92Jun-04210.95285.3Jul-04256.03227.27Aug-04268.59226.86Sep-04114.05129.92Oct-04203.37226.18Nov-04165.71220.87Dec-04308.34235.21Jan-05270230Feb-05137232Mar-05309250Apr-05184248May-05322270Jun-05199240Jul-05246250Aug-05237255Sep-05226236Oct-05287254Nov-05320261Dec-05313277Jan-06313229Feb-06216258Mar-06217260Apr-06316199May-06215226Jun-06200231Jul-06269248Aug-06216234Sep-06291219Oct-06234270Nov-06192277Dec-06275197Jan-07181219Feb-07176146Mar-07149238Apr-07270253May-07266230Jun-07196222Jul-07253141Aug-07237230Sep-07216176Oct-07112194Nov-07217191Dec-07187187Jan-08246191Feb-08157174Mar-08187187Apr-08160208May-08263208Jun-08195195Jul-08113177Aug-08240197Se.
The document discusses pairwise sequence alignment methods. It defines key concepts like homology and orthology. It explains that dynamic programming is used to find optimal alignments through building a score matrix and backtracking. Global alignment finds the best match over full sequences while local alignment identifies regions of local similarity. Scoring systems like PAM matrices assign values based on substitutions and penalties for gaps.
The document discusses sequence similarity searching and comparison. It describes how programs like BLAST and FASTA are used to rapidly identify similarities between sequences and determine evolutionary relationships. BLAST and FASTA utilize word matching and heuristics to efficiently search large databases and return local or global alignments with scoring of matches. They provide a powerful method for functions prediction by comparing new sequences to known genes and proteins.
This document discusses using stochastic models to understand principles of gene regulation from regulatory DNA architecture. It summarizes that regulating promoters downstream of irreversible assembly steps reduces molecular noise compared to regulating initiation rates. Distributed binding sites across enhancers also helps reduce expression noise. The document proposes relating complex biochemical architectures of promoters and enhancers to their transcriptional properties using finite Markov chain approaches.
This document discusses various methods for aligning and comparing biological sequences like DNA and proteins, including local vs global alignment, exact vs heuristic algorithms, and tools like BLAST, PSI-BLAST, and statistical tests for assessing the significance of sequence similarities. Local alignment finds short similar regions, while global alignment considers the full sequence length. Exact methods are rigorous but slow, while heuristics sacrifice completeness for speed. BLAST uses ungapped segment pairs to identify regions for gapped extension and alignment. PSI-BLAST iteratively reweights sequences based on multiple alignments to identify more distant homologs. Statistical tests compare observed sequence similarities to random expectations to assess biological significance.
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Anti-Universe And Emergent Gravity and the Dark UniverseSérgio Sacani
Recent theoretical progress indicates that spacetime and gravity emerge together from the entanglement structure of an underlying microscopic theory. These ideas are best understood in Anti-de Sitter space, where they rely on the area law for entanglement entropy. The extension to de Sitter space requires taking into account the entropy and temperature associated with the cosmological horizon. Using insights from string theory, black hole physics and quantum information theory we argue that the positive dark energy leads to a thermal volume law contribution to the entropy that overtakes the area law precisely at the cosmological horizon. Due to the competition between area and volume law entanglement the microscopic de Sitter states do not thermalise at sub-Hubble scales: they exhibit memory effects in the form of an entropy displacement caused by matter. The emergent laws of gravity contain an additional ‘dark’ gravitational force describing the ‘elastic’ response due to the entropy displacement. We derive an estimate of the strength of this extra force in terms of the baryonic mass, Newton’s constant and the Hubble acceleration scale a0 = cH0, and provide evidence for the fact that this additional ‘dark gravity force’ explains the observed phenomena in galaxies and clusters currently attributed to dark matter.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
4. Sequence alignment
Alignment: Comparing two (pairwise) or more
(multiple) sequences. Searching for a series of
identical or similar characters in the sequences.
-Similarity : Same Physicochemical properties.
- Identity :- Identical
MVNLTSDEKTAVLALWNKVDVEDCGGE
|| || ||||| ||| || || ||
MVHLTPEEKTAVNALWGKVNVDAVGGE
5. Sequence alignment-why???
• The basis for comparison of proteins and genes
using the similarity of their sequences is that the
the proteins or genes are related by evolution;
they have a common ancestor.
• Random mutations in the sequences accumulate
over time, so that proteins or genes that have a
common ancestor far back in time are not as similar
as proteins or genes that diverged from each other
more recently.
6. Alignment
• A way of arranging the objects or alphabets to
find out the similarity and difference existing
between them.
• In case of bioinformatics, it is the arrangement
of sequence (DNA,RNA or protein) to find out
the regions of similarity and difference by
virtue of which homology can be predicted.
9. Why perform to pair wise sequence
alignment?
Finding homology between two sequences
Example : Protein prediction(Sequence or
Structure).
similar sequence (or structure)
similar function
10. Local Vs. Global
• Global alignment compares through out the sequence
and gives best overall alignment but may fail to find out
the local region of similarity among sequence which
exactly contain the domain and motif information.
• Local alignment find regions of ungapped sequence
with high level of similarity. Best for finding the motif
although two sequences are different.
11. Local alignment – finds regions of high similarity in
parts of the sequences
Global alignment – finds the best alignment across
the entire two sequences
Local vs. Global
12. Three types of nucleotide changes:
1. Substitution – a replacement of one (or more)
sequence characters by another:
2. Insertion - an insertion of one (or more) sequence
characters:
3. Deletion – a deletion of one (or more) sequence
characters:
T
A
Evolutionary changes in sequences
Insertion + Deletion Indel
AAGA AACA
AAG
GA
A
A
13. Choosing an alignment:
• Many different alignments between two
sequences are possible:
AAGCTGAATTCGAA
AGGCTCATTTCTGA
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
How one can determine which is the best alignment?
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
. . .
14. Exercise
• Match: +1
• Mismatch: -2
• Indel: -1
AAGCTGAATT-C-GAA
AGGCT-CATTTCTGA-
A-AGCTGAATTC--GAA
AG-GCTCA-TTTCTGA-
Compute the scores of each of the following alignments
Scoring scheme:
-2
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
1
-2
-2
-2
A
C
G
T
A C G T
Substitution matrix
Gap penalty (opening = extending)
15. Open Reading Frames(ORFs)
•6 possible ORFs
–frames 1,2,and 3 in 5’ to 3’direction
–frames 1,2, and 3 in 5’ to 3’ direction
of complimentary strand.
The different reading frames give
entirely different proteins.
Each gene uses a single reading frame, so
once the ribosome gets started, it just has
to count off groups of 3 bases to produce
the proper protein.
16. PAM matrices
• Family of matrices PAM 80, PAM 120, PAM 250, …
• The number with a PAM matrix (the n in PAMn) represents
the evolutionary distance between the sequences on which
the matrix is based
• The (ith,jth) cell in a PAMn matrix denotes the probability that
amino-acid i will be replaced by amino-acid j in time n:
Pi→j,n .
• Greater n numbers denote greater distances
17. BLOSUM matrices
• Different BLOSUMn matrices are calculated independently
from BLOCKS (ungapped, manually created local alignments)
• BLOSUMn is based on a cluster of BLOCKS of sequences
that share at least n percent identity
• The (ith,jth) cell in a BLOSUM matrix denotes the log of odds
of the observed frequency and expected frequency of amino
acids i and j in the same position in the data: log(Pij/qi*qj)
• Higher n numbers denote higher identity between the
sequences on which the matrix is based
18. BLAST
(Basic Local Alignment Search Tool)
• The BLAST program was designed by Eugene
Myers, Stephen Altschul, Warren Gish, David J.
Lipman and Webb Miller at the NIH and was
published in J. Mol. Biol. in 1990.
• OBJECTIVE: Find high scoring ungapped segment
among related sequences
• Most widely used bioinformatics programs as the
algorithm emphasizes speed over sensitivity.
19. • An algorithm for comparing primary biological
sequence information to find out the similarity
existing between these two.
• Emphasizes on regions of local alignment to
detect relationship among sequences which
shares only isolated regions of similarity.
• Not only a tool for visualizing alignment but
also give a view to compare structure and
function.
20. Steps for BLAST
Searches for exact matches of a small fixed length
between query sequence in the database called Seed.
BLAST tries to extend the match in both direction
starting at the seed ungapped alignment occur---- High
Scoring Segment Pair (HSP).
The highest scored HSP’s are presented as final report.
They are called Maximum Scoring Pairing
21. BLAST performs a gapped alignment
between query sequence and database
sequence using a variation of Smith-
Watermann Algorithm statistically
significant alignments are then displayed
to user
22. BLAST PROGRAMS
• BLASTP: protein query sequence against a protein
database, allowing for gaps.
• BLASTN: DNA query sequence against a DNA database,
allowing for gaps.
• BLASTX: DNA query sequence, translated into all six
reading frames, against a protein database, allowing for
gaps.
• TBLASTN: protein query sequence against a DNA
database, translated into all six reading frames, allowing
for gaps.
• TBLASTX: DNA query sequence, translated into all six
reading frames, against a DNA database, translated into
all six reading frames (No gaps allowed)
23. PSI-BLAST
(position-specific scoring matrix)
• Used to find distant relatives of a protein.
• First, a list of all closely related proteins is
created. These proteins are combined into a
general "profile" sequence.
• Now this profile used as a query and again the
search performed to get the more distantly
related sequence.
• PSI-BLAST is much more sensitive in picking
up distant evolutionary relationships than a
standard protein-protein BLAST.
25. Matrix
• A key element in evaluating the quality of a
pairwise sequence alignment is the
"substitution matrix", which assigns a score for
aligning any possible pair of residues.
• BLAST includes BLOSUM & PAM matrix.
28. The Score Matrix
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
ACDEFGH
HICDYGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Gaps
Similarity
Identity
,
i j
X A B
ACDEFGH
HICDYGH
A
B
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
29. Paths in the Score Matrix
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
A C D E F G H
H -2 -3 -1 0 -3 -2 8
I -1 -1 -3 -3 0 -4 -3
C 0 9 -3 -4 -2 -3 -3
D -2 -3 6 2 -3 -1 -1
Y -2 -2 -3 -2 3 -3 2
G 0 -3 -1 -2 -3 6 -2
H -2 -3 -1 0 -3 -2 8
-ACDEFGH
HICD-YGH
Deletion
Insertion
Matches
O
T
Alignments are in a one-
to-one correspondence
with score matrix paths.
30. Low Complexity Regions
• Amino acid or DNA sequence regions that offer very
low information due to their highly biased content
– histidine-rich domains in amino acids
– poly-A tails in DNA sequences
– poly-G tails in nucleotides
– runs of purines
– runs of pyrimidines
– runs of a single amino acid, etc.
31. E-value
• Depends on database size
• Indicates probability of a database
match expected as result of random
chance
• Lower E-value, more significant
sequence, less likely Db result of
random chance
32. E=m x n x p
E=E-value
m=total no. of residues in Database
n=no. of residues in query sequence
p= probability that high scoring pair is result of
random chance
33. • E-value 0.01 and 10-50 Homology
• E-value 0.01 and 10 not significant to
remote homology
• E-value>10 distantly related
34. Bit Score
• Measure sequence similarity which is independent of
query sequence length and database size but based on Raw
Pairwise Alignment
• High bit score , high significantly match
• S’ (λ S-lnk)/ln2
S’=bit score
λ =grumble distributation constt.
K=constt.associated with scoring matrix
(λ and k are two statistical parameters)
35. Low Complexity Regions (LCR)
Masking:
(I) Hard masking
(II) Soft Masking
Program for Masking
(i) SEG :high frequency region declared LCR
(ii) RepeatMasker: score for a sequence region above
certain threshold region declared LCR. Residue
masked with N’s and X’s
37. BLAST result page
• BLAST result page divided into 3 parts.
• Part1 contains the information regarding version, database
used, reference and length of the query sequence.
• Part-2 is the conserved regions and graphical representation
of the alignment where each line represents the alignment of
query sequence with one database sequence.
• It shows the result in 5 different color depending upon the bit
score.
• Part-3 contains the list of database sequence having
similarity obtained while database search and detail view of
alignment along with bitscore, e-value, identities, positives
and gaps.
42. BLAST Preferred
• BLAST uses substitution matrix to find
matching while FASTA identifies identical
matching words using hashing procedure. By
default FASTA scans smaller window sizes
.Thus it gives more sensitive results than
BLAST with better coverage rates of
homologs but usually slower than BLAST
43. • BLAST use low complexity masking means it
may have higher specificity than FASTA
therefore false positives are reduced
• BLAST sometimes give multiple best scoring
alignments from the same sequence, FASTA
returns only one final alignment
44. REFRENCES
Jin Xiong(2006). Essential Bioinformatics.
Cambridge University Press.
Mount D. W. (2004). Bioinformatics &
Genome Analysis. Cold Spring Harbor
Laboratory Press.
URL:-
WWW.ncbi.nlm.nih.gov