Bioinformatics – An OverviewKudipudi.SrinivasResearch Scholar, Dept of Computer Science, S.V.K.P & Dr.K.S Raju Atrs & Science College,Penugonda-534320, IndiaKudipudi_sri@yahoo.comABSTRACT : This presentation gives an overview of Bioinformatics covering major databasesavailable online as well as at major research centers. The major databases called mother databasesare the nucleic acid databases and protein sequence databases. Bioinformatics has been visualizedas an interface between biological information and information technology that are employed forProtein sequencing, DNA sequencing etc. The concept of Transcription and Translation processesare explained by the central dogma of molecular biology, which states that the sequences of a strandof DNA correspond to the amino acid sequence of a protein. Representation of two or moresequences can be compared by alignment methods such as Pairwise and Multiple alignments. Somedatabase search tools like BLAST, FASTA are some of the programs which do intensive pairwisealignment of our query sequence to all the database sequence entries and gives out the sequenceswith best scores. Phylogenetic methods are used to reconstruct the relationships betweenmacromolecular sequences finding the genetic connections and relationships between species. Thepaper also explains the application of bioinformatics in the various industries e.g. Food,Pharmaceutical, Agricultural, Medical, etc., and the technologies that have enabled the analysis ofbiological problems in multiple dimensions.Keywords: Protein, DNA, FASTA, BLAST, Phylogenetic Tree, OrthologusIntroduction: • Bioinformatics is the application of computational techniques to the management and analysis of biological information. • Bioinformatics describes using computational techniques to access, analyze, and interpret the biological information in any of the available biological databases.
1. DATABASES: 1.1. Primary Databases Sequences obtained by various sequencing techniques like • EST: Expressed Sequence Tags • GSS: Genome Survey Sequences • STS: Sequence Tagged Sites and • HTG: High Throughput Sequences have been put in different nucleic acid and protein databases, which can be accessed by the people all over the world through World Wide Web. The major databases called mother databases are the nucleic acid and protein sequence. 1.1.1. Nucleic Acid Databases: The nucleic acid sequence databases consists of complete annotation of all the nucleic acid sequences (DNA and RNA) like information of organism (source) from regions, date on which it is sequenced etc., The major nucleic acid data bases are: • European Molecular biology laboratory(EMBL) http://www.ebi.ac.uk/ • GenBank (National center for Biotechnology Information ,NCBI) http://www.ncbi.nlm.nih.gov/ • DNA databank of Japan (DDBJ). http://www.ddbj.nig.ac.jp/ These are three databases under mutual collaboration facilitate the mutual exchange of data everyday. 1.1.2. Protein Sequence Databases: A protein sequence database consists of information of all the proteins that have been translated from the RNA sequences and the proteins sequenced by methods like N-terminal sequencing. The major protein sequence databases are • Protein Information Resource(PIR) http://pir.georgetown.edu/ • Swiss-Prot http://us.expasy.org/sprot/
1.2. Secondary Databases: The derived databases which are obtained by making use of the sequence information available in the primary databases are called secondary databases. Databases like, CUTG: Codon Usage Database of Japan COGS: Cluster of Orthologus Groups of Protein from NCBI PROSITE for regular expressions PRINTS having aligned motifs and BLOCKS having aligned motifs as blocks are fine examples of secondary databases. 1.3. Structure Databases: The major structure databases consist of the structural data of the proteins or DNA whosestructure has been determined by either X-ray crystallography or NMR (Nuclear MagneticResonance). Protein Data Bank gives details of the coordinates bond angles, torsion angles ofvarious proteins and nucleic acid database gives the same details about DNA and its types i.e., A-DNA or B-DNA etc.,Protein Data Bank (PDB)http://www.resb.org/pdb/The Nucleic Acid Databases (NDB)http://ndbserver.rutgers.edu/NDB/ndb.htmlCambridge Structural Databases (CSD)http://www.ccdc.cam.ac.uk/ These databases are an organized way to store the tremendous amount of sequenceinformation that accumulates from laboratories worldwide. Each database has its own specificformat. Three major database organizations around the world are responsible for maintaining most ofthis data; they largely ‘mirror’ one another.
2. The Central Dogma of Biology: Central Dogma: Flow of Information This concept is explained by the central dogma of molecular biology, which states that thesequences of a strand of DNA correspond to the amino acid sequence of a protein. 2.1. Transcription Transcription is the process where messenger RNA (mRNA) molecules are synthesized from DNA molecules. Transcription takes place in the nucleus. During transcription only one of the strands of DNA corresponding to a gene (template strand) is copied into mRNA. This mRNA molecule will be complementary to the bases that compose the template strand. The mRNA molecules have short lives. They travel out to the cytoplasm where they direct the synthesis of a Protein and then they are destroyed.
Transcription depends on complementary base pairings. A pairs with U, U with A, C withG and G with C. Only one of the DNA molecules is transcribed and therefore the resulting mRNAmolecule is single stranded. The amount of transcription of any given gene can be directlycontrolled by the cell. Once the mRNA molecules leave the nucleus and enter the cytoplasm, theyare loaded onto the ribosome. It is at the ribosomes that protein synthesis occurs by a processcalled translation. The ribosomes are composed of ribosomal RNA (rRNA) proteins and ribosomalproteins.2.2. Translation Translation is the process where mRNA moleculesare translated into proteins at the ribosome. The nucleotidesof the mRNA molecule are read by the ribosome so thateach set of three nucleotides called a codon, specifies asingle amino acid. Therefore, the first three nucleotides ofthe mRNA will encode the first amino acid, the second threebases the second amino acid and so on. The rules by whichthe base sequence of the mRNA molecule is translated intothe primary amino acid sequence of a protein are called the genetic code. There are 64 different possible codons (this is because there are 4 bases: A, U, C, G, andeach codon has 3 bases, so 43 = 64) and 20 amino acids. Some codons code for more than oneamino acid and therefore the genetic code is said to be degenerate. No codon codes for morethan one amino acid. Three of the codons do not specify the incorporation of any amino acids. These are knownas the stop codons - UAA, UAG and UGA. They are found at the end of the mRNA codingsequence and they tell the ribosome to stop translating the message and release the protein. ThemRNA is translated from the 5 end and read one codon at a time to the 3 end. Translationusually starts at a start codon (AUG) which codes for methionine. Each successive codon is read and the amino acid incorporated into the protein chain untila stop codon is encountered. The codons in a mRNA molecule do not directly recognize theamino acids that must be incorporated. Instead this process is directed by a group of adapterproteins called transfer RNAs (tRNAs). Every codon, except the stop codons, has its own tRNAmolecule. A tRNA molecule has an anti-codon end, which is made of a set of three base pairs.These base pairs can base pair with the complementary codon in the mRNA. The 3 end of a
tRNA molecule is attached to an amino acid. In the translation process, a ribosome reads a mRNA molecule codon by codon. At each codon, a tRNA molecule with an anti-codon complementary to that codon attaches to the mRNA. It brings with it the appropriate amino acid that is then incorporated into the growing polypeptide chain. Once the amino acid has been added, the tRNA molecule is released and the ribosome moves onto reading the next codon in the mRNA chain. This process continues until the ribosome reads a stop codon. At this point the ribosome releases the mRNA molecule and the completed protein. The tRNA molecule functions as an interpreter reading codons in the mRNA molecule and translating them into amino acids. In this way, the sequence of base pairs in a given gene determines the amino acid sequence of the protein.3. Alignment: Representation of two or more protein or nucleotide sequences where homologous aminoacids or nucleotides are in the same columns while missing amino acids or nucleotides replaced withgaps. 3.1. Pair wise Alignment: Pairwise alignment, in which only two sequences are compared. Two sequences can be compared either by global alignment or local alignment. In global alignment the sequences are stretched over the entire length to get the maximum number of matches and minimum number of gaps. In local alignment, the alignment is restricted or stopped at the region, which is having the number of matches of similarity. Local alignment uses Smith and Waterman algorithms and Global alignment uses Needleman and Wunsch algorithms. The best alignment is chosen by the alignment having maximum score, which is obtained for matches and negative scores for gaps and mismatches. Pairwise alignment is used to find the function of unknown genes or proteins by finding similar sequences of known function. Comparing the unknown sequence with that of the whole nucleic acid or protein databases does this. Some database search tools like BLAST, FASTA are some of the programs which do intensive pairwise alignment of our query sequence to all the database sequence entries and gives out the sequences with best scores.
3.2. Multiple Alignment : Multiple alignment , in which more than two sequences are compared, is used for findingconserved regions among gene sequences and protein sequences, to study phylogeneticrelationship of macromolecular sequences i.e., to find evolutionarily related organisms. The majormultiple alignment software are clustalW, clustalX and Tcofee.ClustalW: It is a general purpose multiple sequence alignments program for DNA or proteinssequences. It gives biologically meaningful multiple sequence alignments of divergent sequencesand calculates the best match for the selected sequences, and lines them up so that the identities,similarities and differences can be seen. Cladograms or Phylograms obtained is used to see theevolutionary relationships between species. This can be either downloaded are used online athttp://www.ebi.ac.uk/clustalW/. ClustalX is the X-window based user-friendly version of clustalW,which can be downloaded and used locally on our machine. Tcofee is more accurate than clustalWfor sequences with less than 30% identity, but it is slower.http://www.ch.embnet.org/software/TCoffee.html Basic Local Alignment Search Tool (BLAST): BLAST is the heuristic search algorithm for sequence similarity searching – for example to identify homologs to a query sequence. If a particular sequence is submitted to BLAST program, it searches with the whole database sequences of users’ choice and in the result produces those sequences that are showing percent identity of more than a particular threshold value. The threshold value is set depending on user choice. BLASTing Protein sequences: BLASTing protein sequences is what we want to do if we already have a protein sequence and we want to find other similar protein sequences in a sequence database. Two flavors of BLAST that exist and deal with proteins are blastp : Compares a protein sequence with a protein database. tblastn : Compares a protein sequence with a nucleotide database. FASTA: FASTA is the first widely used program for database similarity searching. For nucleotide searches, FastA may be more sensitive than BLAST. FastA can be very specific when identifying long regions of low similarity especially for highly diverged sequences. FastA submission form can be obtained at http://www.ebi.ac.uk/fasta33/
4. Phylogenetic Analysis: Phylogenetic methods are used to reconstruct the relationships between macromolecularsequences finding the genetic connections and relationships between species. The results ofphylogenetic analysis may be depicted as a hierarchical branching diagram, a ‘cladogram’ or‘phylogenetic tree’. Programs for Phylogenetic analysis are available athttp://evolution.genetics.washington.edu/phylip.html. This software can be downloaded free of costand used locally or it can be used online at http://bioportal.bic.nus.edu.sg/phylip/. Tree view andphylodraw are the major user – friendly software to show the hierarchical clustering in differentformats used for publishing and easy analyzing. Other than this phylip software there are othersoftware like PAUP, Mega, TreeconW and Winboot popular for phylogenetic analysis.5. Applications of Bioinformatics 5.1. Food Industry: Functional genomics is playing a major role in food biotechnology industry. The complete genome sequence information available in different databases generates information that can be used for finding metabolic pathways, various digestive enzymes, improving cell factories and development of novel presentation methods. The information about the various microbes, which assist in food digestion like E.coli, also plays a vital role in the major achievements of the food industry using Bioinformatics. 5.2. Agriculture: Crops are improved by producing plants that have disease resistant genes to pathogens like fungui and bacteria. Homology searches, finding conserved motifs, and molecular modeling is useful in identifying disease resistant genes. Pesticides and insecticides that can efficiently kill the pathogens and pests are designed by molecular modeling. 5.3. Pharmaceutical industry and Medical science: Bioinformatics, computational biology and cheminformatics are playing a key role in pharmaceutical industry to design new drug targets from genomic data at a very faster rate. Disease causing genes are identified using the tools of genomics and proteomics. Drug lead identification and drug optimization became easy using the tools of genomics and proteomics. Not only drugs, pharmaceutical industry is using the sequence information in the production of vaccines and therapeutic proteins. The processes of designing a new drug using bioinformatics
tools has been of great help in identifying Target Disease, interesting lead compounds, and by docking studies finding the effective interaction between the drug and the compound. Pharmacoinformatics is the area of Medical Informatics concerned with modeling and simulation of the behavior of drugs, and control of such behavior by individualized dosage regimens for each patient to achieve explicitly chosen therapeutic goals. The credibility of serum concentration data is a major factor in such modeling. Medical informatics is a scientific discipline, which is concerned with the systematic processing of data, information and knowledge in medicine and health care. Computerization of the patient record is expected to resolve long – standing problems with the current paper – based system.6. Bioinformatics in India In India there are various research and development units, centers and sub centers,pharmaceuticals industries doing research on various aspects of bioinformatics like proteomics,genomics, developing sequence analysis tools, molecular modeling, drug designing etc. Departmentof Biotechnology(DBT), New Delhi have emphasized on starting Bioinformatics centers with the helpof BTISnet (Biotechnology Information System) for the proper application of Bioinformatics in varioussectors of science and technology for the benefit of researchers. DBT has sponsored variousBioinformatics Distributed Information Centers (DICs) and Distributed Information sub Centers (Sub –DICs) all over India. The list of the DICs and the Sub DICs can be seen in the following websites. http://dbtindia.nic.in/btis/dic.html http://dbtindia.nic.in/bits/subdic.htmlReferences:1. Bioinformatics – A Beginner’s Guide by Jean - Michel Claverie, PhD & Cedric Notredame, PhD2. Introduction to Bioinformatics by Arthu