BIOINFORMATICS Bioinformatics is an emerging field of science which uses computer technology for storage, retrieval, manipulation and distribution of information related to biological data specifically for DNA, RNA and proteins.DATABASE They are simply the repositories in which all the biological data is stored as computer language. Databases are variously classified on varying basis like data type, data source, organisms, etc.TOOLS Tools are software developed to perform various tasks over the stored data such as searches, analysis, submission, annotation, etc.RESIDUE Terms stand for the building block of the macromolecules in the databases. For example nucleotide for DNA & RNA and amino acids for Proteins.
On basis of Data Type On basis of Data Source Genome Databases Sequence Databases Primary Databases Structure Databases Secondary Databases Microarray Databases Special Categories Chemical Databases Metabolic Databases Integrated Database Enzyme Databases Disease Databases Composite Database Literature Databases Taxonomy Database
BLAST stands for Basic Local Alignment Search Tool Blast is a program which uses specific scoring matrices (like PAM or BLOSSUM) for performing sequence-similarity searches against a variety of sequence databases, to give us high-scoring ungapped segments among related sequences. Complex- requires multiple steps and many parameters The BLAST algorithm is fast, accurate, and web-accessible Is relatively faster than other sequence similarity search tools. Provides us with ability to perform analysis by different types of programs
Program Input Query search Database 1 blastn DNA DNA 1 blastp protein protein 6 blastx DNA protein 6 tblastn protein DNA 36 tblastx DNA DNA Continued
blastn compares a DNA query sequence against a DNA database, allowing for gapsblastp compares a protein query sequence against a protein database, allowing for gapsblastx compares a DNA query sequence translated into six reading frames against a protein database, allowing for gapstblastn compares a protein query sequence against a DNA database translated into six reading frames, allowing for gapstblastx compares a DNA query sequence translated into six reading frames against a DNA database translated into six reading frames. tblastx doesn’t allow for gaps.
MEGABLAST - for comparison of large sets of long DNA sequences RPS-BLAST - Conserved Domain Detection BLAST 2 Sequences - for performing pair-wise alignments for 2 chosen sequences Genomic BLAST - for alignments against select human, microbial or malarial genomes PSI-BLAST - construct a multiple alignment from matches PHI-BLAST -specify a pattern that hits must match
Make specific primers with Primer-BLAST Search trace archives Find conserved domains in your sequence (cds) Find sequences with similar conserved domain architecture (cdart) Search sequences that have gene expression profiles (GEO) Search immunoglobulins (IgBLAST) Search using SNP flanks Screen sequence for vector contamination (vecscreen) Align two (or more) sequences using BLAST (bl2seq) Search protein or nucleotide targets in PubChem BioAssay Search SRA transcript and genomic libraries Constraint Based Protein Multiple Alignment Tool Needleman-Wunsch Global Sequence Alignment Tool Search RefSeqGene http://blast.ncbi.nlm.nih.gov/Blast.cgi
Although how BLAST works is a little complicated and lengthy so inshort and brief explanation BLAST works in following two steps:1. BLAST first searches for short regions of a given length (W) called “words” (or substrings) that score at least “T” when compared to the query sequence that align with sequences in the database (“target sequences”), using a substitution matrix.2. For every pair of sequences (query and target) that have a word or words in common, BLAST extends the alignment in both directions to find alignments that score greater (are more similar) than a certain score threshold (S). These alignments are called high scoring pairs or HSPs; the maximal scoring HSPs are called MSPs.
Query Sequence “words” (subsequences of the query sequ Query words are compared to the database (target sequences) and exact matches identified For each word match, alignment is extended in both directions to find alignments that score greater than some threshold(Schneider and La Rota 2000) (maximal segment pairs, or MSPs)
There are various questions which a BLAST can handle whichcommonly arises in the research laboratory. Some of themost common questions arising are: Which bacterial species have a protein that is related to a protein whose amino-acid sequence I know? Where does the DNA I’ve sequenced come from? What other genes encode proteins that exhibit structures similar to the one I’ve just determined? What does the protein structure looks like? What is the function of the gene or the protein that Ive sequenced? (if it’s not known then you have some work to do) What are the probable functions of the sequence I have? CONTINUED
To answer the question arising we use BLAST for searchingthe database and then analyse the results which it produces.Here to explain this we will see an example We have following sequence of a protein from our experiments with a Mycobacterium tuberculosis Sequence: Now as to see whether this protein has any similarity between other organisms we perform a BLAST to understand it’s importance. To perform BLAST we go to following URL http://blast.ncbi.nlm.nih.gov/ CONTINUED
After performing blast against a chosen or every blast we perform the analysis of the result A chosen entry is shown below This entry shows that the sequence for which we ran BLAST hits against a database (here Swiss-Prot) has a 88% identity with Full=Single-stranded DNA-binding protein accession number P46390.2 Continued
Entry shows us a score which describes the quality of the entry which hasmatched with the query which we have sequenced in our experiment.With the use of accession number which we have obtained afterorganising a BLAST search we can easily access the information aboutmany aspects. Some of them are described below • The organism from which it came • Function of the protein • Region of DNA encoding for the gene • length of the sequence • taxonomy of the organism • FASTA sequence of the protein • Links for the 3D structure if it has been foundSimilarly we can see whether the sequence which we have sequenced ishomologous (similar) or not with any of the sequence in the databasewhich we are referring for the search. As mentioned we can search anydatabase of our interest to check it’s function or function for similarstructures.
BLAST is the most important program in bioinformatics (maybe all of biology) BLAST is based on sound statistical principles (key to its speed and sensitivity) A basic understanding of its principles is key for using/interpreting BLAST output BLAST can play an essential role for helping us to purpose the following structure of a protein Function of sequence Relation with an organism Use blastn or MEGA-BLAST for DNA Use PSI-BLAST for protein searches
BOOKS BIOINFORMATICS by by Pevsner BIOINFORMATICS by Jin Xiong BIOINFORMATICS by Ghosh and MalikINTERNET Slide share www.slideshare.com NCBI www.blast.ncbi.nlm.nih.gov/Blast.cgi UniProt/Swiss-Prot www.uniprot.org