916215 bioinformatics-over-view

Bioinformatics – An Overview
Kudipudi.Srinivas

Research Scholar, Dept of Computer Science, S.V.K.P & Dr.K.S Raju Atrs & Science College,Penugonda-534320, India
Kudipudi_sri@yahoo.com

ABSTRACT : This presentation gives an overview of Bioinformatics covering major databases
available online as well as at major research centers. The major databases called mother databases
are the nucleic acid databases and protein sequence databases. Bioinformatics has been visualized
as an interface between biological information and information technology that are employed for
Protein sequencing, DNA sequencing etc. The concept of Transcription and Translation processes
are explained by the central dogma of molecular biology, which states that the sequences of a strand
of DNA correspond to the amino acid sequence of a protein. Representation of two or more
sequences can be compared by alignment methods such as Pairwise and Multiple alignments. Some
database search tools like BLAST, FASTA are some of the programs which do intensive pairwise
alignment of our query sequence to all the database sequence entries and gives out the sequences
with best scores. Phylogenetic methods are used to reconstruct the relationships between
macromolecular sequences finding the genetic connections and relationships between species. The
paper also explains the application of bioinformatics in the various industries e.g. Food,
Pharmaceutical, Agricultural, Medical, etc., and the technologies that have enabled the analysis of
biological problems in multiple dimensions.

Keywords: Protein, DNA, FASTA, BLAST, Phylogenetic Tree, Orthologus

Introduction:
• Bioinformatics is the application of computational techniques to the management and analysis
of biological information.

• Bioinformatics describes using computational techniques to access, analyze, and interpret the
biological information in any of the available biological databases.

1. DATABASES:
1.1. Primary Databases
Sequences obtained by various sequencing techniques like
• EST: Expressed Sequence Tags
• GSS: Genome Survey Sequences
• STS: Sequence Tagged Sites and
• HTG: High Throughput Sequences
have been put in different nucleic acid and protein databases, which can be accessed by the
people all over the world through World Wide Web. The major databases called mother
databases are the nucleic acid and protein sequence.

1.1.1. Nucleic Acid Databases:
The nucleic acid sequence databases consists of complete annotation of all the
nucleic acid sequences (DNA and RNA) like information of organism (source) from regions,
date on which it is sequenced etc.,
The major nucleic acid data bases are:
• European Molecular biology laboratory(EMBL)
http://www.ebi.ac.uk/
• GenBank (National center for Biotechnology Information ,NCBI)
http://www.ncbi.nlm.nih.gov/
• DNA databank of Japan (DDBJ).
http://www.ddbj.nig.ac.jp/
These are three databases under mutual collaboration facilitate the mutual exchange of data
everyday.

1.1.2. Protein Sequence Databases:
A protein sequence database consists of information of all the proteins that have been
translated from the RNA sequences and the proteins sequenced by methods like N-terminal
sequencing.
The major protein sequence databases are
• Protein Information Resource(PIR)
http://pir.georgetown.edu/
• Swiss-Prot
http://us.expasy.org/sprot/

1.2. Secondary Databases:

The derived databases which are obtained by making use of the sequence information
available in the primary databases are called secondary databases. Databases like,
CUTG: Codon Usage Database of Japan
COGS: Cluster of Orthologus Groups of Protein from NCBI
PROSITE for regular expressions
PRINTS having aligned motifs and
BLOCKS having aligned motifs as blocks are fine examples of secondary databases.

1.3. Structure Databases:
The major structure databases consist of the structural data of the proteins or DNA whose
structure has been determined by either X-ray crystallography or NMR (Nuclear Magnetic
Resonance). Protein Data Bank gives details of the coordinates bond angles, torsion angles of
various proteins and nucleic acid database gives the same details about DNA and its types i.e., A-
DNA or B-DNA etc.,
Protein Data Bank (PDB)
http://www.resb.org/pdb/
The Nucleic Acid Databases (NDB)
http://ndbserver.rutgers.edu/NDB/ndb.html
Cambridge Structural Databases (CSD)
http://www.ccdc.cam.ac.uk/

These databases are an organized way to store the tremendous amount of sequence
information that accumulates from laboratories worldwide. Each database has its own specific
format. Three major database organizations around the world are responsible for maintaining most of
this data; they largely ‘mirror’ one another.

2. The Central Dogma of Biology:

Central Dogma: Flow of Information

This concept is explained by the central dogma of molecular biology, which states that the
sequences of a strand of DNA correspond to the amino acid sequence of a protein.

2.1. Transcription

Transcription is the process where messenger RNA (mRNA) molecules are synthesized
from DNA molecules. Transcription takes place in the nucleus. During transcription only one of
the strands of DNA corresponding to a gene (template strand) is copied into mRNA. This mRNA
molecule will be complementary to the bases that compose the template strand. The mRNA
molecules have short lives. They travel out to the cytoplasm where they direct the synthesis of a
Protein and then they are destroyed.

Transcription depends on complementary base pairings. A pairs with U, U with A, C with
G and G with C. Only one of the DNA molecules is transcribed and therefore the resulting mRNA
molecule is single stranded. The amount of transcription of any given gene can be directly
controlled by the cell. Once the mRNA molecules leave the nucleus and enter the cytoplasm, they
are loaded onto the ribosome. It is at the ribosomes that protein synthesis occurs by a process
called translation. The ribosomes are composed of ribosomal RNA (rRNA) proteins and ribosomal
proteins.

2.2. Translation

Translation is the process where mRNA molecules
are translated into proteins at the ribosome. The nucleotides
of the mRNA molecule are read by the ribosome so that
each set of three nucleotides called a codon, specifies a
single amino acid. Therefore, the first three nucleotides of
the mRNA will encode the first amino acid, the second three
bases the second amino acid and so on. The rules by which
the base sequence of the mRNA molecule is translated into
the primary amino acid sequence of a protein are called the genetic code.
There are 64 different possible codons (this is because there are 4 bases: A, U, C, G, and
each codon has 3 bases, so 43 = 64) and 20 amino acids. Some codons code for more than one
amino acid and therefore the genetic code is said to be degenerate. No codon codes for more
than one amino acid.
Three of the codons do not specify the incorporation of any amino acids. These are known
as the stop codons - UAA, UAG and UGA. They are found at the end of the mRNA coding
sequence and they tell the ribosome to stop translating the message and release the protein. The
mRNA is translated from the 5' end and read one codon at a time to the 3' end. Translation
usually starts at a start codon (AUG) which codes for methionine.
Each successive codon is read and the amino acid incorporated into the protein chain until
a stop codon is encountered. The codons in a mRNA molecule do not directly recognize the
amino acids that must be incorporated. Instead this process is directed by a group of adapter
proteins called transfer RNAs (tRNAs). Every codon, except the stop codons, has its own tRNA
molecule. A tRNA molecule has an anti-codon end, which is made of a set of three base pairs.
These base pairs can base pair with the complementary codon in the mRNA. The 3' end of a

tRNA molecule is attached to an amino acid. In the translation process, a ribosome reads a
mRNA molecule codon by codon.

At each codon, a tRNA molecule with an anti-codon complementary to that codon attaches
to the mRNA. It brings with it the appropriate amino acid that is then incorporated into the growing
polypeptide chain. Once the amino acid has been added, the tRNA molecule is released and the
ribosome moves onto reading the next codon in the mRNA chain. This process continues until the
ribosome reads a stop codon. At this point the ribosome releases the mRNA molecule and the
completed protein. The tRNA molecule functions as an interpreter reading codons in the mRNA
molecule and translating them into amino acids. In this way, the sequence of base pairs in a given
gene determines the amino acid sequence of the protein.

3. Alignment:
Representation of two or more protein or nucleotide sequences where homologous amino
acids or nucleotides are in the same columns while missing amino acids or nucleotides replaced with
gaps.

3.1. Pair wise Alignment:
Pairwise alignment, in which only two sequences are compared. Two sequences can be
compared either by global alignment or local alignment. In global alignment the sequences are
stretched over the entire length to get the maximum number of matches and minimum number of
gaps. In local alignment, the alignment is restricted or stopped at the region, which is having the
number of matches of similarity. Local alignment uses Smith and Waterman algorithms and
Global alignment uses Needleman and Wunsch algorithms. The best alignment is chosen by the
alignment having maximum score, which is obtained for matches and negative scores for gaps
and mismatches.
Pairwise alignment is used to find the function of unknown genes or proteins by finding similar
sequences of known function. Comparing the unknown sequence with that of the whole nucleic
acid or protein databases does this. Some database search tools like BLAST, FASTA are some of
the programs which do intensive pairwise alignment of our query sequence to all the database
sequence entries and gives out the sequences with best scores.

3.2. Multiple Alignment :
Multiple alignment , in which more than two sequences are compared, is used for finding
conserved regions among gene sequences and protein sequences, to study phylogenetic
relationship of macromolecular sequences i.e., to find evolutionarily related organisms. The major
multiple alignment software are clustalW, clustalX and Tcofee.

ClustalW: It is a general purpose multiple sequence alignments program for DNA or proteins
sequences. It gives biologically meaningful multiple sequence alignments of divergent sequences
and calculates the best match for the selected sequences, and lines them up so that the identities,
similarities and differences can be seen. Cladograms or Phylograms obtained is used to see the
evolutionary relationships between species. This can be either downloaded are used online at
http://www.ebi.ac.uk/clustalW/. ClustalX is the X-window based user-friendly version of clustalW,
which can be downloaded and used locally on our machine. Tcofee is more accurate than clustalW
for sequences with less than 30% identity, but it is slower.
http://www.ch.embnet.org/software/TCoffee.html

Basic Local Alignment Search Tool (BLAST):
BLAST is the heuristic search algorithm for sequence similarity searching – for example to
identify homologs to a query sequence. If a particular sequence is submitted to BLAST program, it
searches with the whole database sequences of users’ choice and in the result produces those
sequences that are showing percent identity of more than a particular threshold value. The
threshold value is set depending on user choice.
BLASTing Protein sequences:
BLASTing protein sequences is what we want to do if we already have a protein sequence
and we want to find other similar protein sequences in a sequence database. Two flavors of
BLAST that exist and deal with proteins are
blastp : Compares a protein sequence with a protein database.
tblastn : Compares a protein sequence with a nucleotide database.

FASTA:
FASTA is the first widely used program for database similarity searching. For nucleotide
searches, FastA may be more sensitive than BLAST. FastA can be very specific when identifying
long regions of low similarity especially for highly diverged sequences. FastA submission form
can be obtained at http://www.ebi.ac.uk/fasta33/

4. Phylogenetic Analysis:
Phylogenetic methods are used to reconstruct the relationships between macromolecular
sequences finding the genetic connections and relationships between species. The results of
phylogenetic analysis may be depicted as a hierarchical branching diagram, a ‘cladogram’ or
‘phylogenetic tree’. Programs for Phylogenetic analysis are available at
http://evolution.genetics.washington.edu/phylip.html. This software can be downloaded free of cost
and used locally or it can be used online at http://bioportal.bic.nus.edu.sg/phylip/. Tree view and
phylodraw are the major user – friendly software to show the hierarchical clustering in different
formats used for publishing and easy analyzing. Other than this phylip software there are other
software like PAUP, Mega, TreeconW and Winboot popular for phylogenetic analysis.

5. Applications of Bioinformatics
5.1. Food Industry:
Functional genomics is playing a major role in food biotechnology industry. The complete
genome sequence information available in different databases generates information that can be
used for finding metabolic pathways, various digestive enzymes, improving cell factories and
development of novel presentation methods. The information about the various microbes, which
assist in food digestion like E.coli, also plays a vital role in the major achievements of the food
industry using Bioinformatics.

5.2. Agriculture:
Crops are improved by producing plants that have disease resistant genes to pathogens
like fungui and bacteria. Homology searches, finding conserved motifs, and molecular modeling is
useful in identifying disease resistant genes. Pesticides and insecticides that can efficiently kill the
pathogens and pests are designed by molecular modeling.

5.3. Pharmaceutical industry and Medical science:
Bioinformatics, computational biology and cheminformatics are playing a key role in
pharmaceutical industry to design new drug targets from genomic data at a very faster rate.
Disease causing genes are identified using the tools of genomics and proteomics. Drug lead
identification and drug optimization became easy using the tools of genomics and proteomics. Not
only drugs, pharmaceutical industry is using the sequence information in the production of
vaccines and therapeutic proteins. The processes of designing a new drug using bioinformatics

tools has been of great help in identifying Target Disease, interesting lead compounds, and by
docking studies finding the effective interaction between the drug and the compound.
Pharmacoinformatics is the area of Medical Informatics concerned with modeling and
simulation of the behavior of drugs, and control of such behavior by individualized dosage
regimens for each patient to achieve explicitly chosen therapeutic goals. The credibility of serum
concentration data is a major factor in such modeling.
Medical informatics is a scientific discipline, which is concerned with the systematic
processing of data, information and knowledge in medicine and health care. Computerization of
the patient record is expected to resolve long – standing problems with the current paper – based
system.

6. Bioinformatics in India

In India there are various research and development units, centers and sub centers,
pharmaceuticals industries doing research on various aspects of bioinformatics like proteomics,
genomics, developing sequence analysis tools, molecular modeling, drug designing etc. Department
of Biotechnology(DBT), New Delhi have emphasized on starting Bioinformatics centers with the help
of BTISnet (Biotechnology Information System) for the proper application of Bioinformatics in various
sectors of science and technology for the benefit of researchers. DBT has sponsored various
Bioinformatics Distributed Information Centers (DICs) and Distributed Information sub Centers (Sub –
DICs) all over India.

The list of the DICs and the Sub DICs can be seen in the following websites.
http://dbtindia.nic.in/btis/dic.html
http://dbtindia.nic.in/bits/subdic.html

References:

1. Bioinformatics – A Beginner’s Guide by Jean - Michel Claverie, PhD & Cedric Notredame, PhD
2. Introduction to Bioinformatics by Arthu

916215 bioinformatics-over-view

More Related Content

What's hot

Viewers also liked

Similar to 916215 bioinformatics-over-view

916215 bioinformatics-over-view