Amity Global Business School, ChennaiINFORMATION TECHNOLOGYBioinformaticsArushi, Dinesh, Kasi, ShruthiNovember 2009 BIOINFORMATICS Introduction Bioinformatics is the application of information technology to the field of molecular biology. The term Bioinformatics was coined by Pauline Hogweg in 1979 for the study of informatics processes in biotic systems. Its primary use since at least the late 1980’s has been in genomics and genetics, particularly in those areas of genomics involving large-scale DNA sequencing. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Bioinformatics has developed out of the need to understand the code of life, DNA. Massive DNA sequencing projects have evolved and added in the growth of the science of bioinformatics. DNA the basic molecule of life directly controls the fundamental biology of life. It codes for genes which code for proteins which determine the biological makeup of humans or any living organism. It is variations and errors in the genomic DNA which ultimately define the likelihood of developing diseases or resistance to these same disorders. The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass of sequence, structure, literature and other biological data and obtain a clearer insight into the fundamental biology of organisms and to use this information to enhance the standard of life for mankind. It is being used now and in the foreseable future in the areas of molecular medicine to help produce better and more customised medicines to prevent or cure diseases, it has environmental benefits in, identifying waste cleanup bacteria and in agriculture it can be used for producing high yield low maintenance crops. These are just a few of the many benefits bioinformatics will help develop. What is bioinformatics? What is done in Bioinformatics? Applications of Bioinformatics Bioinformatics can be used in various fields, as given below: 1. Molecular medicine 2. Personalised medicine 3. Preventative medicine 4. Gene therapy 5. Drug development 6. Microbial genome applications 7. Waste cleanup 8. Climate change Studies 9. Alternative energy sources 10. Biotechnology 11. Antibiotic resistance 12. Forensic analysis of microbes 13. Bio-weapon creation 14. Evolutionary studies 15. Crop improvement 16. Insect resistance 17. Improve nutritional quality 18. Development of Drought resistance varieties 19. Veterinary Science Biological problems that computers can help with: Why use bioinformatics? Softwares and tools available Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions. The computational biology tool best-known among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences. BLAST is one of a number of generally available programs for doing sequence alignment. The NCBI provides a popular web-based implementation that searches their databases. BLAST In bioinformatics, Basic Local Alignment Search Tool,or BLAST, compare primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller at the NIH and was published in J. Mol. Biol. in 1990. Choosing the right BLAST depends on the type of sequence you are searching with (long, short; nucleotide protein), and the desired database. National Centre for Biotechnology Information (NCBI’s) BLAST services There are four ways to use BLAST services: WWW BLAST The easiest way to use BLAST is through the Web. Users may simply point their browsers at the NCBI home page (http://www.ncbi.nlm.nih.gov) and link to the /BLAST pages for any number of different types of searches. A complete description of BLAST services is available at this location. StandAlone BLAST BLAST can be run locally as a full executable and can be used to run BLAST searches against private, local databases, or downloaded copies of the NCBI databases. BLAST binaries are provided for Macintosh, Win32 (PC), LINUX, Solaris, IBM AIX, SGI, Compaq OSF, and HP UX systems. StandAlone BLAST executables may be found on the NCBI anonymous FTP server (ftp://ftp.ncbi.nih.gov) under /blast/executables/. The StandAlone WWW BLAST Server allows you to set up your own in-house version of the NCBI BLAST Web pages. This can be accessed through web browsers on intranet web servers. You can set up the program to search your own custom databases or downloaded copies of the NCBI databases. The StandAlone WWW BLAST Server is available by anonymous FTP at ftp://ftp.ncbi.nih.gov/blast/server/current_release/. At this time, the StandAlone WWW BLAST Server is only available for UNIX Web servers. Network BLAST Blastcl3 is the NCBI BLAST network client. It allows remote TCP/IP connections to the NCBI BLAST servers to run BLAST searches. No web browser is required. The BLAST network client can also be used to "
search many sequences against the NCBI databases at one time. You can download blastcl3 from the anonymous FTP location ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/. BLAST URL API The BLAST URL API is a standardized application program interface (API) for accessing the NCBI QBLAST system. It uses direct HTTP-encoded requests to NCBI web server. The URL API will allow you to write a script that passes sequences to the NCBI databases and return results locally. There are many features:
It does not require the download of an executable from NCBI FTP site.
There is no need download database files from NCBI to your local machine.
The URL API is easy to maintain and keep backward compatible.
It is not necessary to hack the NCBI BLAST CGI to run scripts.
Instructions for the URL API are located in three formats: HTML, PDF, and PostScript.
BLAST PROCESS BLAST works through use of a heuristic algorithm. Using a heuristic method, BLAST finds homologous sequences, not by comparing either sequence in its entirety, but rather by locating short matches between the two sequences. This process of finding initial words is called seeding. It is after this first match that BLAST begins to make local alignments. While attempting to find homology in sequences, sets of common letters, known as words, are very important. For example, let’s say that the sequence contains the following stretch of letters, GLKFA. If aBLASTp was being conducted under default conditions, the word size would be 3 letters. In this case, using the given stretch of letters, the searched words would be GLK, LKF, KFA. The heuristic algorithm of BLAST locates all common words between the sequence of interest and the hit sequence, or sequences, from the database. These results will then be used to build an alignment. After making words for the sequence of interest, neighborhood words are also assembled. These words must satisfy a requirement of having a score of at least the threshold, T, when compared by using a scoring matrix. Along the lines of terms stated above, if a BLASTp were being conducted, the scoring matrix that would be used would most likely beBLOSUM62. Once both words and neighborhood words are assembled and compiled, they are compared to the sequences in the database in order to find matches. The threshold score, T, determines whether a particular word will be included in the alignment or not. Once seeding has been conducted, the alignment, which is only 3 residues long, is extended in both directions by the algorithm used by BLAST. Each extension impacts the score of the alignment by either increasing or decreasing it. Should this score be higher than a pre-determined T, the alignment will be included in the results given by BLAST. However, should this score be lower than this pre-determined T, the alignment will cease to extend, preventing areas of poor alignment to be included in the BLAST results. Note, that increasing the T score limits the amount of space available to search, decreasing the number of neighborhood words, while at the same time speeding up the process of BLAST. BLAST PROGRAM The BLAST program can either be downloaded and run as a command-line utility "
or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms. The BLAST program is based on an open-source format, giving everyone access to it and enabling them to have the ability to change the program code. This has led to the creation of several BLAST "
. There are now a handful of different BLAST programs available, which can be used depending on what one is attempting to do and what they are working with. These different programs vary in query sequence input, the database being searched, and what is being compared. These programs and their details are listed below: BLAST is actually a family of programs (all included in the blastall executable). These include: Nucleotide-nucleotide BLAST (blastn) This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies. Protein-protein BLAST (blastp) This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies. Position-Specific Iterative BLAST (PSI-BLAST) This program is used to find distant relatives of a protein. First, a list of all closely related proteins is created. These proteins are combined into a general "
sequence, which summarises significant features present in these sequences. A query against the protein database is then run using this profile, and a larger group of proteins is found. This larger group is used to construct another profile, and the process is repeated. By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST. Nucleotide 6-frame translation-protein (blastx) This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx) This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences. Protein-nucleotide 6-frame translation (tblastn) This program compares a protein query against the all six reading frames of a nucleotide sequence database. Large numbers of query sequences (megablast) When comparing large numbers of input sequences via the command-line BLAST, "
is much faster than running BLAST multiple times. It concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyze the search results to glean individual alignments and statistical values. Of these programs, BLASTn and BLASTp are the most commonly used because they use direct comparisons, and do not require translations. However, since protein sequences are better conserved evolutionarily than nucleotide sequences, tBLASTn, tBLASTx, and BLASTx, produce more reliable and accurate results. They also enable one to be able to directly see the function of the protein sequence, since by translating the sequence of interest before searching often gives you annotated protein hits. BLAST INFORMATION This BLAST information guide is to assist new and veteran users in employing NCBI tools such as BLAST and PSI-BLAST in their research. The two tutorials (Query, and BLAST ) offer starting points for users with different backgrounds. A novice should start with the Query tutorial. . Sequence alignments provide a powerful way to compare novel sequences with previously characterized genes. Both functional and evolutionary information can be inferred from well designed queries and alignments. BLAST 2.0, (Basic Local Alignment Search Tool), provides a method for rapid searching of nucleotide and protein databases. Since the BLAST algorithm detects local as well as global alignments, regions of similarity embedded in otherwise unrelated proteins can be detected. Both types of similarity may provide important clues to the function of uncharacterized proteins. QUERY TUTORIAL 1. Introduction This online tutorial is designed to help the first time BLAST user. This tutorial will teach you to input a sequence into the Basic BLAST web page, choose a program and database, and examine the results. The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "
. This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases. The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships. Instead of relying on global alignments (commonly seen in multiple sequence alignment programs) BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to find homology, but a program to locate regions of sequence similarity with a view to comparing structure and function. 2. Selecting the BLAST Program The BLAST search pages allow you to select from several different programs. Below is a table of these programs. Program DescriptionblastpCompares an amino acid query sequence against a protein sequence database.blastnCompares a nucleotide query sequence against a nucleotide sequence database.blastxCompares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.tblastnCompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.tblastxCompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. To select a BLAST program for your search 1. Open the Basic BLAST search page. 2. From the "
Pull Down Menu select the appropriate program. Figure 1. Using the pull down menu to select a BLAST program. 3. Selecting the BLAST Database You can select several NCBI databases to compare your query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot). Proteins DatabaseDescriptionNrAll non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF MonthAll new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. SwissprotThe last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL.PatentsProtein sequences derived from the Patent division of GenBank.YeastYeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome.E. coliE. coli (Escherichia coli) genomic CDS translations.PdbSequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.kabat [kabatpro]Kabat's database of sequences of immunological interest. For more informationhttp://immuno.bme.nwu.edu/AluTranslations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu . See "
byClaverie and Makalowski, Nature vol. 371, page 752 (1994). Nucleotides DatabaseDescriptionnrAll non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).monthAll new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.dbestNon-redundant database of GenBank+EMBL+DDBJ EST Divisions.dbstsNon-redundant database of GenBank+EMBL+DDBJ STS Divisions.mouse estsThe non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.human estsThe Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.other estsThe non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.yeastYeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome.E. coliE. coli (Escherichia coli) genomic nucleotide sequences.pdbSequences derived from the 3-dimensional structure of proteins.kabat [kabatnuc]Kabat's database of sequences of immunological interest. For more information http://immuno.bme.nwu.edu/patentsNucleotide sequences derived from the Patent division of GenBank.vectorVector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/directory).mitoDatabase of mitochondrial sequences (Rel. 1.0, July 1995).AluSelect Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu . See "
by epdEukaryotic Promotor Database ISREC in Epalinges s/Lausanne (Switzerland).gssGenome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.htgsHigh Throughput Genomic Sequences. Figure 2. Using the Pull Down Menu to select the BLAST database. 4. Entering your Sequence The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs. 5. FASTA Format A description of the FASTA format is located on the Basic BLAST search pages. 1. Open your FASTA formatted sequence in a text editor as plain text. 2. Use your mouse to highlight the entire sequence. 3. Select Edit/Copy from the menu in your text editor. 4. Go to the BLAST search page in your web browser. 5. Use your mouse to select the main input field titled "
Enter your input data here"
, by clicking it once. 6. Select Edit/Paste from the browser's menu. 7. You should now see your FASTA sequence in this field. 8. Set the pull down menu to "
Sequence in FASTA format"
. Figure 3. Example of a FASTA sequence in the input field. 6. Accession or GI number If you know the Accession number or the GI of a sequence in GenBank, you can use this as the query sequence in a BLAST search 1. Go to the BLAST search page in your web browser. 2. Use your mouse to select the main input field titled "
Enter your input data here"
, by clicking it once. 3. Using the keyboard enter the GenBank Accession number or the GI number. 4. Set the Pull Down Menu to "
Accession or GI"
. 7. Submitting your Search 1. Make sure you have selected the correct BLAST program and BLAST database. 2. If you have entered your FASTA sequence or an Accession or GI number, click the "
Submit Query Button"
. 3. BLAST will now open a new window and tell you it is working on your search. 4. Once your results are computed they will be presented in the window. BLAST TUTORIAL This BLAST tutorial is designed to help both the novice and experienced BLAST user to set up and perform a BLAST search, decipher the output and analyze the results. The tutorial illustrates the potential for BLAST and PSI-BLAST searches to identify even weak (subtle) homologies to annotated entries in the database. It demonstrates that BLAST and PSI-BLAST (see separate PSI-BLAST tutorial) are important tools for predicting both biochemical activities and function from sequence relationships. In addition to the tutorial, the BLAST guide may be useful in becoming acquainted with the ins and outs of BLAST searching. Introduction to a BLAST QueryOpen a new browser window so that the BLAST program can be compared to the tutorial. Notice that the tutorial page resembles the Query form for an ADVANCED BLAST search, however, the elements of the Query form have been reorganized on the tutorial page to facilitate describing them. Explanatory notes have been added in light grey boxes. Additional details about BLAST are available through the buttons.The BLAST browser window may be left open and used in parallel, or it may be closed while browsing through this tutorial. Scroll down the tutorial page to learn how to submit a BLAST search, step by step. When you are ready, the button will take you to the BLAST output page where the results of this search can be examined. Step 1. Choose the program to use and the database to search. Top of Form Program Database As an example, consider the uncharacterized archaebacterial protein, MJ0577, from Methanococcus jannaschii. The amino acid sequence derived from the MJ0577 open reading frame will be used as a query in a search for sequence relatives in the amino acid database, nr (non-redundant). blastp is the appropriate search routine for all searches in which an amino acid query is to be compared to an amino acid database. The nr database is a good choice for a comprehensive search. Step 2. Input the data. Query data is formatted as Adjust the pull down menu (above) so that the selection (FASTA format vs. Accession or GI) matches the format of the sequence in the input window (below). In this case, FASTA format is chosen above to correspond with the FASTA formatted sequence in the input box. The GI (GenBank Identifier) is the number (2501594, in this case) located between two vertical lines following the "
in the Entrez entry (shown below). Entering "
or the Accession # "
in the sequence box will accomplish the same thing as entering the sequence in FASTA format. Step 3. Set the program options or choose defaults. note: certain options (*) are available only when using the Advanced BLAST Web site. Perform ungapped alignment Leaving this box unchecked will allow gaps to be introduced into sequence alignments. This default option ensures that any similarities, even those that define a domain within the coding region will be identified, if the extent of local similarity is high enough. * Choose an Organism from the list to limit your search: * or enter your Organism Name or Taxonomic Class here: This search is not limited to a particular organism. Relationships to proteins in any kingdom may provide clues about the functional classification of the hypothetical ORF in question. Top of Form * Expect Bottom of Form The E value threshold for the MJ0577 search has been changed from the default value of 10 to a setting of 1. Although hits with E values much higher than 0.1 are unlikely to reflect true sequence relatives, it is useful to examine hits with lower significance (E values between 0.1 and 10) for short regions of similarity. In the absence of longer similarities, these short regions may allow the tentative assignment of biochemical activities to the ORF in question. The significance of any such regions must be assessed on a case by case basis.In trying to find a function for the unannotated open reading frame, MJ0577, look first for homologous proteins in other organisms that may already be annotated. Secondarily, note any short regions that bear significant similarity to portions of one or more proteins in the database that have been biochemically characterized. In this example, we will restrict our interest to BLAST hits with E values less than or equal to 1.0. Top of Form Filter Low complexity * Human repeats Bottom of Form It is appropriate to filter most queries for low complexity sequences. By taking an advance peek at the first alignment in the BLAST output, it can be seen that MJ0577 has no low complexity regions that are detected by the SEG filtering algorithm. Low complexity regions would appear as X's in the alignment of MJ0577 with itself.Since this is not a human sequence, the human repeat check box is left unchecked.Some types of low complexity sequences may not be detected by the filtering option in BLAST. For example, coiled-coil and transmembrane regions need to be detected using the appropriate programs outside of BLAST. As an example, the COILS algorithm was used to perform an analysis of the MJ0577 open reading frame for the presence of coiled-coil regions. It is apparent from the analysis that MJ0577 does, in fact, have a coiled-coil region. Since coiled-coil encoding sequence can lead to matches with other coiled-coil proteins and thus obscure more meaningful hits, the user might consider manually masking the region to optmize the sensitivity of the search. To do this, the amino acids between aa 71(SLLL) and aa 120 (IIVV) would be replaced with X's. A query window in which this has already been done can be viewed here. Top of Form * Query Genetic Codes (blastx only) Bottom of Form When employing the blastx program (in which a translated nucleotide sequence is used as a query against a protein database), the genetic code to be used in the translation can be specified here. The standard genetic code is used by default. Since this tutorial employs blastp and not blastx, this option is not pertinent. * Matrix Gap existence cost Per residue gap cost Lambda ratioTop of FormBottom of Form BLOSUM62 is a general purpose matrix and the default choice in BLAST 2.0. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur among consensus blocks within related proteins. BLOSUM62 is among the best of the available matrices for detecting weak protein similarities. Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45. Top of Form * Other advanced options: In the "
field it is possible to specify gap costs, word size, and other parameters not otherwise selectable on the query form. Output formatting options may also be adjusted here in case the formatting choices available through the form (see Step 4 below) are not adequate. For example, the user might type: "
to cause 150 descriptions (rather than 100 or 250 available through the pull-down menu) to be displayed. Find out how to specify these options using the details button. Step 4. Set the output formatting options Top of Form NCBI-gi Graphical overview Bottom of Form Top of Form Alignment view Descriptions Alignments Bottom of Form These items are needed only for formatting. Note, however, that for queries with numerous significant hits in the selected database, the choice of a low number of descriptions or alignments may override the chosen E value threshold. For instance, a list of the 100 most significant hits (descriptions = 100) may (depending on the query) only contain sequences with E values less than 1. Though the E value threshold may have been set at 10, hits with E values between 1 and 10 will not be listed.In the current example, the number of descriptions to be displayed has been left at the default value of 100. In this example, alignments have been set at 50 to save space. Step 5. Perform the search Top of Form Send reply to the Email address: In HTML formatBottom of Form Top of FormBottom of Form Click on the search button now to initiate the search. In a short time, the query sequence has been compared to all of the entries in the specified database. Each comparison is scored and the top scores are listed in rank order. You will be automatically taken to an intermediate formatting page from which point you can change several of the formatting options. If no changes are desired, simply click on the "
button to see the Results of your search. USES OF BLAST BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison. Identifying Species With the use of BLAST, you can possibly correctly identify a species and/or find homologous species. This can be useful, for example, when you are working with a DNA sequence from an unknown species. Locating Domains When working with a protein sequence you can input it into BLAST, to locate known domains within the sequence of interest. Establishing Phylogeny Using the results received through BLAST, you can create a phylogenetic tree using the BLAST web-page. DNA Mapping When working with a known species, and looking to sequence a gene at an unknown location, BLAST can compare the chromosomal position of the sequence of interest, to relevant sequences in the database(s). Comparison When working with genes, BLAST can locate common genes in two related species, and can be used to map annotations from one organism to another. SMITH-WATERMAN ALGORITHM The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences. Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. While both Smith-Waterman and BLAST are used to find homologous sequences by searching and comparing a query sequence with those in the databases, they do have their differences. Due to the fact that BLAST is based on a heuristic algorithm, the results received through BLAST, in terms of the hits found, may not be the best possible results, as it will not provide you with all the hits within the database. BLAST misses hard to find matches. A better alternative in order to find the best possible results would be to use the Smith-Waterman algorithm. This method varies from the BLAST method in two areas, accuracy and speed. The Smith-Waterman option provides better accuracy, in that it finds matches that BLAST cannot, because it does not miss any information. Therefore, it is necessary for remote homology. However, when compared to BLAST, it is more time consuming, not to mention that it requires large amounts of computer usage and space. Fortunately, technologies to speed up the Smith-Waterman process have been found to improve the time necessary to perform a search dramatically. These technologies include FPGA chips and SIMD technology. FASTA FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. The original FASTP program was designed for protein sequence similarity searching. FASTA added the ability to do DNA:DNA searches, translated protein:DNA searches, and also provided a more sophisticated shuffling program for evaluating statistical significance.There are several programs in this package that allow the alignment of protein sequences and DNA sequences. FASTA is pronounced "
, and stands for "
, because it works with any alphabet, an extension of "
(protein) and "
(nucleotide) alignment. The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle frameshift errors (which six-frame-translated searches do not handle very well) when comparing nucleotide to protein sequence data. In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal Smith-Waterman algorithm. A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology. The FASTA package is available from fasta.bioch.virginia.edu. The web-interface to submit sequences for running a search of the European Bioinformatics Institute (EBI)'s online databases is also available using the FASTA programs. The FASTA file format used as input for this software is now largely used by other sequence database search tools (such as BLAST) and sequence alignment programs. CONCLUSION REFERENCES
BIOINFORMATICS - D.R.Westhead , J.H.Parish and R.M.Twyman
APPENDIX - GLOSSARY Alignment The process of lining up two or more sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Algorithm A fixed procedure embodied in a computer program. Bioinformatics The merger of biotechnology and information technology with the goal of revealing new insights and principles in biology. Bit score The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches. BLAST Basic Local Alignment Search Tool. ( HYPERLINK "
Altschul et al.) A sequence comparison algorithm optimized for speed used to search sequence databases for optimal local alignments to a query. The initial search is done for a word of length "
that scores at least "
when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "
. The "
parameter dictates the speed and sensitivity of the search. BLOSUM Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. ( HYPERLINK "
Henikoff and Henikoff) Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Domain A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function. DUST A program for filtering low complexity regions from nucleic acid sequences. HYPERLINK "
E value Expectation value. The number of different alignents with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score. FASTA The first widely used algorithm for database similarity searching. The program looks for optimal local alignments by scanning the sequence for small matches called "
. Initially, the scores of segments in which there are multiple word hits are calculated ("
). Later the scores of several segments may be summed to generate an "
score. An optimized alignment that includes gaps is shown in the output as "
. The sensitivity and speed of the search are inversely related and controlled by the "
variable which specifies the size of a "
. ( HYPERLINK "
Pearson and Lipman) Filtering Also known as Masking. The process of hiding regions of (nucleic acid or amino acid) sequence having characteristics that frequently lead to spurious high scores. See SEG and DUST. HYPERLINK "
Gap A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment. Global Alignment The alignment of two nucleic acid or protein sequences over their entire length. H H is the relative entropy of the target and background residue frequencies. (Karlin and Altschul, 1990). H can be thought of as a measure of the average information (in bits) available per position that distinguishes an alignment from chance. At high values of H, short alignments can be distinguished by chance, whereas at lower H values, a longer alignment may be necessary. (Altschul, 1991) Homolog Similarity attributed to descent from a common ancestor. HSP High-scoring segment pair. Local alignments with no gaps that achieve one of the top alignment scores in a given search. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. HYPERLINK "
K A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for search space size. The value K is used in converting a raw score (S) to a bit score (S'). Lambda A statistical parameter used in calculating BLAST scores that can be thought of as a natural scale for scoring system. The value lambda is used in converting a raw score (S) to a bit score (S'). Local Alignment The alignment of some portion of two nucleic acid or protein sequences Low Complexity Region (LCR) Regions of biased composition including homopolymeric runs, short-period repeats, and more subtle overrepresentation of one or a few residues. The SEG program is used to mask or filter LCRs in amino acid queries. The DUST program is used to mask or filter LCRs in nucleic acid queries. Masking Also known as Filtering. The removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence. Motif A short conserved region in a protein sequence. Motifs are frequently highly conserved parts of domains. Multiple Sequence Alignment An alignment of three or more sequences with gaps inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column. Clustal W is one of the most widely used multiple sequence alignment programs Optimal Alignment An alignment of two sequences with the highest possible score. HYPERLINK "
Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. HYPERLINK "
P value The probability of an alignment occurring with the score in question or better. The p value is calculated by relating the observed alignment score, S, to the expected distribution of HSP scores from comparisons of random sequences of the same length and composition as the query to the database. The most highly significant P values will be those close to 0. P values and E values are different ways of representing the significance of the alignment. PAM Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution which will change, on average, 1% of amino acids in a protein sequence. A PAM(x) substitution matrix is a look-up table in which scores for each amino acid substitution have been calculated based on the frequency of that substitution in closely related proteins that have experienced a certain amount (x) of evolutionary divergence. HYPERLINK "
Paralogous Homologous sequences within a single species that arose by gene duplication. HYPERLINK "
Profile A table that lists the frequencies of each amino acid in each position of protein sequence. Frequencies are calculated from multiple alignments of sequences containing a domain of interest. See also PSSM. Proteomics Systematic analysis of protein expression of normal and diseased tissues that involves the separation, identification and characterization of all of the proteins in an organism. PSI-BLAST Position-Specific Iterative BLAST. An iterative search using the BLAST algorithm. A profile is built after the initial search, which is then used in subsequent searches. The process may be repeated, if desired with new sequences found in each cycle used to refine the profile. Details can be found in this discussion of PSI-BLAST. ( HYPERLINK "
Altschul et al.) PSSM Position-specific scoring matrix; see profile. The PSSM gives the log-odds score for finding a particular matching amino acid in a target sequence. Query The input sequence (or other type of search term) with which all of the entries in a database are to be compared. Similarity The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST similarity refers to a positive matrix score. SEG A program for filtering low complexity regions in amino acid sequences. Residues that have been masked are represented as "
in an alignment. SEG filtering is performed by default in the blastp subroutine of BLAST 2.0. (Wootton and Federhen) Substitution The presence of a non-identical amino acid at a given position in an alignment. If the aligned residues have similar physico-chemical properties the substitution is said to be "
. HYPERLINK "
Substitution Matrix A substitution matrix containing values proportional to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. such matrices are constructed by assembling a large and diverse sample of verified pairwise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution. Unitary Matrix Also known as Identity Matrix. A scoring system in which only identical characters receive a positive score. Top of Form