LECTURE NOTES ON BIOINFORMATICS

Lecture notes in Bioinformatics
2018
SARDAR HUSSAIN
[COMPANY NAME] | [Company address]

1 Lecture notes in Bioinformatics
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding
biological data. As an interdisciplinary field of science, bioinformatics combines Computer Science,
Biology, Mathematics, and Engineering to analyze and interpret biological data. Bioinformatics has
been used for in silico analyses of biological queries using mathematical and statistical techniques.
More broadly, bioinformatics is applied statistics and computing to biological science.
Bioinformatics is both an umbrella term for the body of biological studies that use computer
programming as part of their methodology, as well as a reference to specific analysis "pipelines" that
are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the
identification of candidate genes and single nucleotide polymorphisms (SNPs). Often, such
identification is made with the aim of better understanding the genetic basis of disease, unique
adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a
less formal way, bioinformatics also tries to understand the organizational principles within nucleic acid
and protein sequences, called proteomics.[1]
Introduction
Bioinformatics has become an important part of many areas of biology. In experimental molecular
biology, bioinformatics techniques such as image and signal processing allow extraction of useful
results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and
annotating genomes and their observed mutations. It plays a role in the text mining of biological
literature and the development of biological and gene ontologies to organize and query biological data.
It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools
aid in the comparison of genetic and genomic data and more generally in the understanding of
evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue
the biological pathways and networks that are an important part of systems biology. In structural
biology, it aids in the simulation and modeling of DNA,[2] RNA,[2][3] proteins[4] as well as biomolecular
interactions.[5][6][7]
History
Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and Ben
Hesper coined it in 1970 to refer to the study of information processes in biotic systems.[8][9][10] This
definition placed bioinformatics as a field parallel to biophysics (the study of physical processes in
biological systems) or biochemistry (the study of chemical processes in biological systems).[8]
Sequences
Sequences of genetic material are frequently used in bioinformatics and are easier to manage using
computers than manually.
Computers became essential in molecular biology when protein sequences became available after
Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences
manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff, who has
been hailed by David Lipman, director of the National Center for Biotechnology Information, as the
"mother and father of bioinformatics." Dayhoff compiled one of the first protein sequence databases,
initially published as books[12] and pioneered methods of sequence alignment and molecular

evolution.[13] Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological
sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te
Wu between 1980 and 1991.[14]
Goals/scope
To study how normal cellular activities are altered in different disease states, the biological data must
be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics
has evolved such that the most pressing task now involves the analysis and interpretation of various
types of data. This includes nucleotide and amino acid sequences, protein domains, and protein
structures.[15] The actual process of analyzing and interpreting data is referred to as computational
biology. Important sub-disciplines within bioinformatics and computational biology include:
 Development and implementation of computer programs that enable efficient access to, use
and management of, various types of information
 Development of new algorithms (mathematical formulas) and statistical measures that assess
relationships among members of large data sets. For example, there are methods to locate a
gene within a sequence, to predict protein structure and/or function, and to cluster protein
sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it
apart from other approaches, however, is its focus on developing and applying computationally
intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine
learning algorithms, and visualization. Major research efforts in the field include sequence alignment,
gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein
structure prediction, prediction of gene expression and protein–protein interactions, genome-wide
association studies, the modeling of evolution and cell division/mitosis.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and
statistical techniques, and theory to solve formal and practical problems arising from the management
and analysis of biological data.
Over the past few decades, rapid developments in genomic and other molecular research technologies
and developments in information technologies have combined to produce a tremendous amount of
information related to molecular biology. Bioinformatics is the name given to these mathematical and
computing approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein sequences,
aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein
structures.
Relation to other fields
Bioinformatics is a science field that is similar to but distinct from biological computation, while it is
often considered synonymous to computational biology. Biological computation uses bioengineering
and biology to build biological computers, whereas bioinformatics uses computation to better
understand biology. Bioinformatics and computational biology involve the analysis of biological data,
particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive
growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in
DNA sequencing technology.

Analyzing biological data to produce meaningful information involves writing and running software
programs that use algorithms from graph theory, artificial intelligence[16], soft computing, data mining,
image processing, and computer simulation. The algorithms in turn depend on theoretical foundations
such as discrete mathematics, control theory, system theory, information theory, and statistics.
Sequence analysis
Main articles: Sequence alignment and Sequence database
The sequences of different genes or proteins may be aligned side-by-side to measure their similarity.
This alignment have to compares protein sequences and genomic sequences .containing WPP domains.
Since the Phage Φ-X174 was sequenced in 1977,[17] the DNA sequences of thousands of organisms have
been decoded and stored in databases. This sequence information is analyzed to determine genes that
encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A
comparison of genes within a species or between different species can show similarities between
protein functions, or relations between species (the use of molecular systematics to construct
phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA
sequences manually. Today, computer programs such as BLAST are used daily to search sequences
from more than 260 000 organisms, containing over 190 billion nucleotides.[18] These programs can
compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify
sequences that are related, but not identical. A variant of this sequence alignment is used in the
sequencing process itself.
DNA sequencing
Before sequences can be analyzed they have to be obtained. DNA sequencing is still a non-trivial
problem as the raw data may be noisy or afflicted by weak signals. Algorithms have been developed for
base calling for the various experimental approaches to DNA sequencing.
Sequence assembly
Main article: Sequence assembly
Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to
obtain complete gene or genome sequences. The so-called shotgun sequencing technique (which was
used, for example, by The Institute for Genomic Research (TIGR) to sequence the first bacterial
genome, Haemophilus influenzae)[19] generates the sequences of many thousands of small DNA
fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The
ends of these fragments overlap and, when aligned properly by a genome assembly program, can be
used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the
task of assembling the fragments can be quite complicated for larger genomes. For a genome as large
as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers
to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be

filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today,
and genome assembly algorithms are a critical area of bioinformatics research.
See also: sequence analysis, sequence mining, sequence profiling tool, and sequence motif
Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. This process needs to be automated because most genomes are too large
to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate
of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes
have recognisable start and stop regions, although the exact sequence found in these regions can vary
between genes.
The first description of a comprehensive genome annotation system was published in 1995 [19] by the
team at The Institute for Genomic Research that performed the first complete sequencing and analysis
of the genome of a free-living organism, the bacterium Haemophilus influenzae.[19] Owen White
designed and built a software system to identify the genes encoding all proteins, transfer RNAs,
ribosomal RNAs (and other sites) and to make initial functional assignments. Most current genome
annotation systems work similarly, but the programs available for analysis of genomic DNA, such as the
GeneMark program trained and used to find protein-coding genes in Haemophilus influenzae, are
constantly changing and improving.
Following the goals that the Human Genome Project left to achieve after its closure in 2003, a new
project developed by the National Human Genome Research Institute in the U.S appeared. The so-
called ENCODE project is a collaborative data collection of the functional elements of the human
genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays,
technologies able to automatically generate large amounts of data at a dramatically reduced per-base
cost but with the same accuracy (base call error) and fidelity (assembly error).
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time.
Informatics has assisted evolutionary biologists by enabling researchers to:
 trace the evolution of a large number of organisms by measuring changes in their DNA, rather
than through physical taxonomy or physiological observations alone,
 more recently, compare entire genomes, which permits the study of more complex evolutionary
events, such as gene duplication, horizontal gene transfer, and the prediction of factors
important in bacterial speciation,
 build complex computational population genetics models to predict the outcome of the system
over time[20]
 track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes confused with
computational evolutionary biology, but the two areas are not necessarily related.
Comparative genomics

The core of comparative genome analysis is the establishment of the correspondence between genes
(orthology analysis) or other genomic features in different organisms. It is these intergenomic maps
that make it possible to trace the evolutionary processes responsible for the divergence of two
genomes. A multitude of evolutionary events acting at various organizational levels shape genome
evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large
chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and
insertion.[21] Ultimately, whole genomes are involved in processes of hybridization, polyploidization and
endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many
exciting challenges to developers of mathematical models and algorithms, who have recourse to a
spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed
parameter and approximation algorithms for problems based on parsimony models to Markov chain
Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
Pan genomics
Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took root in
bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic group: although
initially applied to closely related strains of a species, it can be applied to a larger context like genus,
phylum etc. It is divided in two parts- The Core genome: Set of genes common to all the genomes under
study (These are often housekeeping genes vital for survival) and The Dispensable/Flexible Genome:
Set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can
be used to characterize the Pan Genome of bacterial species.[23]
Genetics of disease
With the advent of next-generation sequencing we are obtaining enough sequence data to map the
genes of complex diseases such as diabetes,[24] infertility,[25] breast cancer[26] or Alzheimer's Disease.[27]
Genome-wide association studies are a useful approach to pinpoint the mutations responsible for such
complex diseases.[28] Through these studies, thousands of DNA variants have been identified that are
associated with similar diseases and traits.[29] Furthermore, the possibility for genes to be used at
prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing
both the promising ways to choose the genes to be used and the problems and pitfalls of using genes
to predict disease presence or prognosis.[30]
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety of
genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the
sheer volume of sequence data produced, and they create new algorithms and software to compare
the sequencing results to the growing collection of human genome sequences and germline
polymorphisms. New physical detection technologies are employed, such as oligonucleotide
microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and
single-nucleotide polymorphism arrays to detect known point mutations. These detection methods
simultaneously measure several hundred thousand sites throughout the genome, and when used in
high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again
the massive amounts and new types of data generate new opportunities for bioinformaticians. The

data is often found to contain considerable variability, or noise, and thus Hidden Markov model and
change-point analysis methods are being developed to infer real copy number changes.
Two important principles can be used in the analysis of cancer genomes bioinformatically pertaining to
the identification of mutations in the exome. First, cancer is a disease of accumulated somatic
mutations in genes. Second cancer contains driver mutations which need to be distinguished from
passengers.[31]
With the breakthroughs that this next-generation sequencing technology is providing to the field of
Bioinformatics, cancer genomics could drastically change. These new methods and software allow
bioinformaticians to sequence many cancer genomes quickly and affordably. This could create a more
flexible process for classifying types of cancer by analysis of cancer driven mutations in the genome.
Furthermore, tracking of patients while the disease progresses may be possible in the future with the
sequence of cancer samples.[32]
Another type of data that requires novel informatics development is the analysis of lesions found to be
recurrent among many tumors.
Gene and protein expression
Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple techniques
including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene
expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also
known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed
in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the
biological measurement, and a major research area in computational biology involves developing
statistical tools to separate signal from noise in high-throughput gene expression studies.[33] Such
studies are often used to determine the genes implicated in a disorder: one might compare microarray
data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that
are up-regulated and down-regulated in a particular population of cancer cells.
Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the
proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein
microarray and HT MS data; the former approach faces similar problems as with microarrays targeted
at mRNA, the latter involves the problem of matching large amounts of mass data against predicted
masses from protein sequence databases, and the complicated statistical analysis of samples where
multiple, but incomplete peptides from each protein are detected. Cellular protein localization in a
tissue context can be achieved through affinity proteomics displayed as spatial data based on
immunohistochemistry and tissue microarrays.[34]
Analysis of regulation
Regulation is the complex orchestration of events by which a signal, potentially an extracellular signal
such as a hormone, eventually leads to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process.

For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis
involves the identification and study of sequence motifs in the DNA surrounding the coding region of a
gene. These motifs influence the extent to which that region is transcribed into mRNA. Enhancer
elements far away from the promoter can also regulate gene expression, through three-dimensional
looping interactions. These interactions can be determined by bioinformatic analysis of chromosome
conformation capture experiments.
Expression data can be used to infer gene regulation: one might compare microarray data from a wide
variety of states of an organism to form hypotheses about the genes involved in each state. In a single-
cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat
shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine
which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes
can be searched for over-represented regulatory elements. Examples of clustering algorithms applied
in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering, and
consensus clustering methods.
Analysis of cellular organization
Several approaches have been developed to analyze the location of organelles, genes, proteins, and
other components within cells. This is relevant as the location of these components affects the events
within a cell and thus helps us to predict the behavior of biological systems. A gene ontology category,
cellular compartment, has been devised to capture subcellular localization in many biological databases.
Microscopy and image analysis
Microscopic pictures allow us to locate both organelles as well as molecules. It may also help us to
distinguish between normal and abnormal cells, e.g. in cancer.
Protein localization
The localization of proteins helps us to evaluate the role of a protein. For instance, if a protein is found
in the nucleus it may be involved in gene regulation or splicing. By contrast, if a protein is found in
mitochondria, it may be involved in respiration or other metabolic processes. Protein localization is thus
an important component of protein function prediction. There are well developed protein subcellular
localization prediction resources available, including protein subcellualr location databases, and
prediction tools.[35][36]
Nuclear organization of chromatin
Data from high-throughput chromosome conformation capture experiments, such as Hi-C (experiment)
and ChIA-PET, can provide information on the spatial proximity of DNA loci. Analysis of these
experiments can determine the three-dimensional structure and nuclear organization of chromatin.
Bioinformatic challenges in this field include partitioning the genome into domains, such as
Topologically Associating Domains (TADs), that are organised together in three-dimensional space.[37]
Structural bioinformatics

3-dimensional protein structures such as this one are common subjects in bioinformatic analyses.
Protein structure prediction is another important application of bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the sequence on
the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a
structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform
encephalopathy – a.k.a. Mad Cow Disease – prion.) Knowledge of this structure is vital in understanding
the function of the protein. Structural information is usually classified as one of secondary, tertiary and
quaternary structure. A viable general solution to such predictions remains an open problem. Most
efforts have so far been directed towards heuristics that work most of the time.[citation needed]
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of
bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose
function is known, is homologous to the sequence of gene B, whose function is unknown, one could
infer that B may share A's function. In the structural branch of bioinformatics, homology is used to
determine which parts of a protein are important in structure formation and interaction with other
proteins. In a technique called homology modeling, this information is used to predict the structure of a
protein once the structure of a homologous protein is known. This currently remains the only way to
predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the
hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the
organism. Though both of these proteins have completely different amino acid sequences, their protein
structures are virtually identical, which reflects their near identical purposes.[38]
Other techniques for predicting protein structure include protein threading and de novo (from scratch)
physics-based modeling.
Network and systems biology
Network analysis seeks to understand the relationships within biological networks such as metabolic or
protein–protein interaction networks. Although biological networks can be constructed from a single
type of molecule or entity (such as genes), network biology often attempts to integrate many different
data types, such as proteins, small molecules, gene expression data, and others, which are all
connected physically, functionally, or both.

Systems biology involves the use of computer simulations of cellular subsystems (such as the networks
of metabolites and enzymes that comprise metabolism, signal transduction pathways and gene
regulatory networks) to both analyze and visualize the complex connections of these cellular
processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the
computer simulation of simple (artificial) life forms.
Molecular interaction networks
Interactions between proteins are frequently visualized and analyzed using networks. This network is
made up of protein–protein interactions from Treponema pallidum, the causative agent of syphilis and
other diseases.
Tens of thousands of three-dimensional protein structures have been determined by X-ray
crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR) and a central
question in structural bioinformatics is whether it is practical to predict possible protein–protein
interactions only based on these 3D shapes, without performing protein–protein interaction
experiments. A variety of methods have been developed to tackle the protein–protein docking
problem, though it seems that there is still much work to be done in this field.
Other interactions encountered in the field include Protein–ligand (including drug) and protein–
peptide. Molecular dynamic simulation of movement of atoms about rotatable bonds is the
fundamental principle behind computational algorithms, termed docking algorithms, for studying
molecular interactions.
 Sequence development
We start with a very basic review of biology, necessary for any further work, but largely sufficient
for getting started in computational biology. One can (and must) learn more “on the job”.
Biomolecules are sequences of monomers (DNA, RNA=nucleotide sequences, proteins=amino
acid sequences). DNA is the molecule that contains the entire blueprint for an organism. It contains
genes that encode the sequences for every protein in the organism, as well as non-coding regions
that, among other things, contain regulatory mechanisms for when and in what order different
genes get turned on, and may have other functions as well.
Most genes code for proteins; some genes code for RNA molecules that play various roles
in the cell. Both DNA and RNA are polymers of “nucleotides” which are bases of four kinds
[adenine=A, cytosine=C, guanine=G, thymine=T (DNA only), uracil=U (RNA only)] attached
to sugar-phosphate backbones. Apart from the one difference in bases, RNA and DNA are very
similar except that DNA usually exists in double-stranded “base-paired” form and RNA is in
single-stranded form.
The backbone of DNA (or RNA) is not symmetrical: each monomer has a 5’-phosphate group
at one end and a 3’-hydroxyl group at the other. Each strand is usually read from the 5’ to the 3’
end. The two strands go in opposite directions. The nucleic acids are base-paired A to T, G to C.

A-T bonds are weaker (double-bonds), G-C bonds are stronger (triple-bonds).Proteins are the “building
blocks” of life, responsible for a vast number of cellular processes.They regulate genes, catalyse various
biochemical reactions, form machinery for synthesis of othermolecules (including other proteins) and
are important parts of organelles and tissues. They arepolymers of amino acids (carboxylic acids with
an amide group and a side chain). There are twentynaturally occurring amino acids, differing in their
side chains.Proteins tend to “fold” into complex three-dimensional conformations; usually the fold is
unique and misfolding is rare. The details of the fold are biochemically important. Usually a few active
“domains” (for example, binding to DNA, interaction with other proteins) help the protein play
important roles in gene regulation, catalysis, etc; these domains tend to be well conserved across
species, while the rest of the protein sequence can mutate a lot. Much computational effort goes
into studying protein structure and function, but we will not discuss this vast subject here.
Genes that code for proteins are first “transcribed” to “messenger RNA” (mRNA) molecules,
and then the RNA is “translated” to proteins. Each “codon” of three nucleotides corresponds to a
unique amino acid. Since there are 4 nucleotides, there are 64 possible codons; three of these are
“stop codons” (TAA, TAG, TGA) (sometimes called “nonsense codons”) and don’t code for amino
acids, instead indicating a stop to transcription. The remaining 61 code for 20 amino acids. Several
codons (up to six) thus can code for the same amino acid. The “start codon” is ATG, which codes
for the amino acid methionine.
What are the biological problems?
There are of course a huge number of problems in biology that can benefit from a quantitative
treatment, ranging from single molecule behaviour to population biology and ecology. From the
title, we are already restricting ourselves to bioinformatics, but we will mainly focus on DNA
sequence analysis, with only occasional mention of proteins.
The following are a few issues of interest to biologists (and often of medical importance) that
could benefit from analysis of DNA sequence:
• Cellular processes: how the cell carries out its normal tasks; how it responds to external
events like heat shock and starvation; how it carries out complex cascades of events such as
the process of cell division (mitosis).
• Development: How a complex organism (eg a worm, a fly, a human) develops from a single
fertilised egg. As this embryonic cell divides, the daughter cells also slowly differentiate into
functions. This happens as a result of “gradients” of various factors (some of them maternal)
that change gene regulation in different parts of the embryo and ultimately cause different
cells to develop in highly specialised ways.
• Evolution: How different species evolve, how new functionality develops.
All cellular and developmental processes are controlled by genes that get turned on in response
to some external condition (stress, starvation, embryonic gradients) or cyclically (cell cycle).
Computational
study of how these genes are regulated and how they function is very useful. This is
done by analysing the gene sequence and regulatory DNA sequence of the organism itself, and
by comparison of this sequence with already-annotated sequence from other organisms. Highly
similar (homologous) genes exist among widely different organisms; such genes are called
“orthologues”.
Many subsystems in widely different organisms are very similar and are regulated by
orthologous proteins; some proteins exist largely unchanged from primitive archaebacteria all the
way to humans.
Moreover, many genes with heavy sequence identity often exist in the same organism, arising
from ancestral “gene duplication” events; their function is often slightly differentiated, and in fact

this is a major driving factor in evolution.
There are now “high-throughput” microarray experiments that can essentially give the response
of every gene in the genome; analysing, clustering and interpreting this data, and combining it with
other computational tasks in gene regulation, is of great interest.
Finally, the study of phylogeny (evolutionary history of organisms) and the classification (taxonomy)
of organisms has been revolutionised by DNA sequencing.
 Aims and tasks of Bioinformatics
The aims of bioinformatics are threefold. First, at its simplest bioinformatics organizes data in a way
that allows researchers to access existing information and to submit new entries as they are produced,
eg the Protein Data Bank for 3D macromolecular structures. While data- curation is an essential task,
the information stored in these databases is essentially useless until analyzed. Thus the purpose of
bioinformatics extends much further.
The second aim is to develop tools and resources that aid in the analysis of data. For example, having
sequenced a particular protein, it is of interest to compare it with previously characterized sequences.
This needs more than just a simple text-based search and programs such as FASTA and PSI-BLAST must
consider what comprises a biologically significant match. Development of such resources dictates
expertise in computational theory as well as a thorough understanding of biology.
The third aim is to use these tools to analyze the data and interpret the results in a biologically
meaningful manner. Traditionally, biological studies examined individual systems in detail, and
frequently compared those with a few that are related. In bioinformatics, we can now conduct global
analyses of all the available data with the aim of uncovering common principles that apply across many
systems and highlight novel features.
 Application of Bioinformatics in current research
Currently almost every field of biological research has accepted this biological research weapon and
following it, whether it is molecular biology or genetics, or even agriculture. There a complete new
emerging field of genome informatics is there which is completely based on bioinformatics tools . Apart
from these there are so many areas where bioinformatics is readily being accepted with primary role in
prediction of structure similarity and functional similarity in novel drug molecule research also. They
perform initially tasks such as
• Submitting DNA Sequences to the Databases
This is one of important thing in biological research, where scientists sequence DNA, and RNA, but until
it is not getting deposited to any public sequence database, that cannot be beneficial for scientific
community. It became very essential to submit all the sequenced data to public sequence repositories.
Some of the important public repositories are DDBJ, EMBL, and Genebank.
These sequence data can be submitted to repositories in two ways, either by email submission or by
online submission through sequence submission tools. There are specific tools for every public
sequence repository (Table 1).

Table 1: Public sequence depositorieS
Table 2: Human Genome Databases, Browsers and Variation Resources

Table 2a: Vertebrate databases and genome browsers

Table 3: List of some invertebrate databases and genome browsers
After submission every database provides an unique accession number to submitted sequence after
verification and duplication checks. If it is an unique sequence, then accession number is given as a
single letter followed by 5 digit number, but recently due to huge number of submission two letters
followed by 6 digit of number for accession number is now proposed.
• Genomic Mapping and Mapping Databases
Gene mapping is one of the technique to estimate accurate position of gene and corresponding
distance between related genes of similar type.
After complete evaluation we can reach to a conclusion of genome map for complete genome for that
particular organism.
• Information Retrieval From Biological Database
Developing biological database and its availability online was one of the primary concerns at initial
stage of biological research, but now as we have many biological database and data is in form of text,
table and pictures and many other formats. We should really know that how to retrieve exact data
from a suitable database. Database may be of text retrieval, sequence retrieval or it may also include
structural data retrieval importance.

• Sequence Alignment and Database Searching
Alignment of sequence with compare to other relevant and similar sequence is very much needed in
biological research to understand relation between two sequences and also to predict structure and
function based on sequence similarity.
For basic alignment of sequences use of BLAST is very common. Based on number of sequences
involved in sequencing, we can classify these alignments into pairwise alignment or multiple sequence
alignment.
• Predictive Methods Using DNA Sequences
Gene-finding strategies can be Classified into three major categories.
Content-based methods rely on the overall, bulk properties of a sequence in making determination.
Characteristics considered here include how often particular codons are used, the periodicity of
repeats, and the compositional complexity of the sequence. Because different organisms use
synonymous codons with different frequency, such clues can provide insight into determining regions
that are more likely to be exons.
In site-based methods, the focus turns to the presence or absence of a specific sequence, pattern, or
consensus. These methods are used to detect features such as donor and acceptor splice sites, binding
sites for transcription factors, polyA tracts, and start and stop codons.
comparative methods make determinations based on sequence homology. Here, translated sequences
are subjected to database searches against protein sequences to determine whether a previously
characterized coding region corresponds to a region in the query sequence. Although this is
conceptually the most straightforward of the methods, it is restrictive because most newly discovered
genes do not have gene products that match anything in the protein databases.
Tools associated with these are Grail, Genscan, Fgenes, procrustes and many others developed with
bioinformatics.
• Predictive Methods Using Protein Sequences
o There are tools based on predictive methods using protein sequences, such as PSLPred, NRpred,
PSEAPred. There are other methods also based on motif level, residue level, signal level, peptide level,
domain level and profile based].
• Sequences Assembly and Finishing Methods
At present,the sequencing process is often talked of as consisting of two parts, namely, assembly and
finishing, but in practice there is considerable overlap between the two. Assembly is the process of
attempting to order and align the readings, and finishing is the task of checking and editing the
assembled data. This includes performing new sequencing experiments to fill gaps or to cover
segments where the data is poor and adjudicating between conflicting readings when editing.
• Phylogenetic Analysis
o Phylogenetic analysis is also one of the important implementation of bioinformatics in biological
research. Phylogenetic analysis is study of ancestral history of an organism. Here after sequence and
structural similarities we try to relate organism’s ancestral history to show how origin of organism was
related to each other and “what was order of evolution”. We actually do evolutionary history analysis
by phylogenetic analysis.
o There are many tools available online as well as commercial packages also like PHYLIP. It uses tree
generation methods with algorithms based on methods such as UPGMA, and neighbor joining.
• Comparative Genome Analysis
o Comparative genome analysis is also being performed in various researches at many levels such as
academics and professional researches. By comparing the finished reference sequence of the human
genome with genomes of other organisms, researchers can identify regions of similarity and difference.
This information can help scientists better understand the structure and function of human genes and
thereby develop new strategies to combat human disease. Comparative genomics also provides a
powerful tool for studying evolutionary changes among organisms, helping to identify genes that are
conserved among species, as well as genes that give each organism its unique characteristics.

• Large-Scale Genome Analysis
Large scale genome analysis is complete genome sequencing, and this application has much
advancement as next generation sequencing and bioinformatics tools like illumina have been
developed to analyze them very quickly. These tools are generally termed as sequencer and playing a
vital role in modern biological research.
There are so many other application in pharmaceutical research also have been seen these days as it
also deals with systems biology and pathways of metabolites and their relation to biological functioning
similarity.
Recent Advancement
However bioinformatics is still in its nascent stage, but continuous improvement is making it more
efficient . Mostly with inclusion of various computer language incorporation in this field and
development of software packages for analysis of biological data is contributing to recent
advancement in this field . Drug designing software packages like “Sanjeevni” developed by IIT Delhi,
India maestro from Schrodinger also is contributing a lot to it. Indian agricultural statistical research
institute (IASRI) also is making a huge contribution towards bioinformatics research by creating so
many databases on agricultural and biological area. Most recent use of bioinformatics has been seen in
novel drug molecule discovery and ligand analysis for protein targets in human physiological cycle to
receive most possible cure for lethal diseases in short period]. There is plenty of docking software
available, which are very efficient and proved their accuracy also
Review and conclusion
With inclusion of large number of tools and implementation of bioinformatics in various biological
research areas, it is now showing its existence and importance simultaneously. Now a day every
experiment in biological research is getting associated with bioinformatics. It has made research very
simple and fast, but still validation for various techniques are still in process for its accuracy. We are
able to get a lot of results in a minute, which was not possible by using wet lab techniques in
biotechnology.
 Guide to NCBI Databases tools & Services/uses.
The National Center for Biotechnology Information (NCBI), a division of the National
Library of Medicine (NLM), has created a large number of databases that are freely
available to researchers. These databases represent a vast store of information about
genetics, genomics, proteomics, and medicine. All of these databases can be reached
from the Entrez search page. This page also allows cross searching of all the NCBI
databases, a feature called Entrez Global Query.
 There are two main functions of biological databases:
1. Make biological data available to scientists.
As much as possible of a particular type of information should be available in one single place
(book, site, database). Published data may be difficult to find or access, and collecting it from
the literature is very time-consuming. And not all data is actually published in an article.
2. To make biological data available in computer-readable form.
Since analysis of biological data almost always involves computers, having the data in computer-
readable form (rather than printed on paper) is a necessary first step.
One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and
Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein
sequences determined at the time, and new editions of the book were published well into the 1970s. Its
data became the foundation for the PIR database.

The computer became the storage medium of choice as soon as it was accessible to ordinary scientists.
Databases were distributed on tape, and later on various kinds of disks. When universities and
academic institutes were connected to the Internet or its precursors (national computer networks), it is
easy to understand why it became the medium of choice. And it is even easier to see why the World
Wide Web (WWW, based on the Internet protocol HTTP) since the beginning of the 1990s is the
standard method of communication and access for nearly all biological databases.
As biology has increasingly turned into a data-rich science, the need for storing and communicating
large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the
protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular
NMR. An new field of science dealing with issues, challenges and new possibilities created by these
databases has emerged: bioinformatics. Other types of data that are or will soon be available in
databases are metabolic pathways, gene expression data (microarrays) and other types of data relating
to biological function and processes.
One very important issue is the frequency and type of errors that the entries in a database have.
Naturally, this depends strongly on the type of data, and whether the database is curated (added,
deleted, or modified by a defined group of people) or not. For the sequence databases, the errors may
be either in the sequence itself (misprint, wrong on entry, genuine experimental error...) or in the
annotation (mistaken features, errors in references,...). In the 3D structure database (PDB), structures
have been deposited which were later discovered to contain severe errors. The error handling policy
differs considerably between databases. If one needs to use any particular database heavily, then the
implications of its particular policy need to be considered.
The present document will touch on only the largest and most frequently used databases.
We will begin with an introduction to the Entrez search interface and will then proceed to
the details of some of the individual NCBI databases.
Entrez
Entrez is the unified search interface for NCBI databases. This common interface allows
easy linking between results in different databases.
Entrez Search Tips
• The Boolean operators AND, OR, and NOT may be used and must be in all caps.
• To see exactly how Entrez has interpreted your query, see the “Search details”
box on the right side of the screen.
• Use quotation marks to enclose a phrase.
• Use the asterisk for truncation (e.g., bacteri* will retrieve bacteria, bacterium,
bacteriophage).
• Enter author names in the format Johnson AB with no punctuation. Entrez
will recognize this as an author name and search only that field. When in doubt,
or when the initials are not known, use Johnson[AUTHOR].
• Clicking on “Advanced Search” will display a numbered list of searches for the
current session. Previously run searches may be combined using the syntax #2
AND #3.
• Field specific searching is also available under “Advanced Search.” Alternately,
one can use Entrez search field qualifiers (e.g., rbcL[GENE] to search only the
gene name field).
BLAST
Like Entrez, BLAST (Basic Local Alignment Search Tool) is not a database itself, but a
means of accessing the data in NCBI databases, particularly in the nucleotide and protein
databases. BLAST allows researchers to directly search nucleotide or protein sequence

data. For instance, a researcher can submit a sequence through BLAST to see if there are
similar sequences already in the NCBI databases. In addition, BLAST can be used to
align two sequences using a tool called bl2seq.
PubMed
PubMed is the major bibliographic database from NCBI. It searches MEDLINE, a
database from the NLM that covers “medicine, nursing, dentistry, veterinary medicine,
the health care system, and the preclinical sciences, such as molecular biology.” PubMed
also allows access to articles that are out of scope for MEDLINE, but which appear in
journals indexed by MEDLINE. All material included in PubMed Central (NLM’s
online journal archive) is indexed, as well as a few additional databases from NLM.
PubMed employs a system called Automatic Term Mapping to match search terms to the
Medical Subject Headings (MeSH) vocabulary. To see how terms have been mapped,
see the “Search details” box on the right side of the screen. This is an invaluable tool for
troubleshooting a search.
Clicking “Advanced Search” on the PubMed search page provides many options for
customizing a search. Limits include human vs. animal subjects, male vs. female
subjects, age of subjects, article type, and journal type. Advanced search will also
display your search history.
The PubMed help file provides guidance on structuring searches and managing search
results.
Online Mendelian Inheritance in Man (OMIM)
OMIM is a database from Johns Hopkins University for human genetics containing short
articles with references on genetic disorders. It is an excellent starting point for any
question involving human genetics as it links out to bibliographic records in PubMed and
to sequence records.
Nucleotide Databases
The nucleotide sequence data in NCBI is a composite of the data from GenBank, the
European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan
(DDBJ).
NCBI’s nucleotide data is divided into three sub-databases:
1. GenBank Expressed Sequence Tags (EST) – these are generally short
sequences derived from mRNA isolated from a particular tissue at a particular
stage of development.
2. GenBank Genome Survey Sequence (GSS) – these are sequences derived from
whole-genome sequencing projects.
3. CoreNucleotide – all nucleotide sequences that are not ESTs or GSSs.
Confusingly, the links on the Entrez search page to EST, GSS, and CoreNucleotide all go
to the same Entrez Nucleotide search interface, so when a search is performed in any one
of the sub-databases, results are returned from all three. Links are provided at the top of
the results page so that results from a particular sub-database may be isolated.
Among the nucleotide sequences, there are some that are uncurated, meaning that they
are in the database just as they were submitted by researchers. Other records are referred
to as reference sequences (or RefSeq) and are curated by NCBI. RefSeq records are
identified by accession numbers beginning with two letters and an underscore (e.g., NM_,
XP_).
For more information about the nucleotide databases, see Chapter 1 of the NCBI
Handbook. For more information about the RefSeq project, see Chapter 18.
Protein Database
The protein database contains data from GenBank, EMBL, and DDBJ as well as
sequences submitted to various other sources including SWISS-PROT. As with the

nucleotide database, RefSeq records are identified by accession numbers beginning with
two letters and an underscore.
Genome Database
The genome database provides views of entire genomes and chromosomes. Results are
displayed via NCBI’s Map Viewer, from which the user can zoom in on a region of
interest. The Map Viewer is highly customizable, allowing users to control what types of
maps are displayed and the level of resolution. Links to other NCBI databases are
provided. For help using the Map Viewer, see Chapter 20 of the NCBI Handbook.
Structure Database
The structure database contains three-dimensional images of proteins from the protein
database. It is searchable by keyword or by protein or nucleotide sequence. Protein
images can be manipulated using the free CN3D tool. Help is available from the
database help screen.
Gene Database
The gene database allows the user to search for individual genes from among the
genomes represented in RefSeq, providing useful summary statements about the gene and
links to other NCBI databases. As in the Genome database, results may be examined
using the sequence viewer.
Taxonomy
The taxonomy database contains the names of all organisms that are represented by
nucleotide or protein sequences in the NCBI databases. Records contain links to higher
taxa, nomenclatural synonyms, and links to the various databases in which records for a
given organism reside.
 DNA and Protein sequencing and analysis.
Introduction: Before 1970’s there was no direct method to determine the nucleotide sequence. In the
mid of 1970’s, two methods developed for the direct sequencing of DNA. These were the Sanger
Coulson’s chain termination method and Maxam Gilbert’s chain termination method. For which they
shared Nobel Prize in Chemistry (1980).
DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule.
It includes any method or technology that is used to determine the order of the four bases—adenine,
guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has
greatly accelerated biological and medical research and discovery.[1]
Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous
applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological
systematics. The rapid speed of sequencing attained with modern DNA sequencing technology has
been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and
species of life, including the human genome and other complete DNA sequences of many animal, plant,
and microbial species.
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions
(i.e. clusters of genes or operons), full chromosomes or entire genomes, of any organism. DNA
sequencing is also the most efficient way to sequence RNA or proteins (via their open reading frames).
In fact, DNA sequencing has become a key technology in many areas of biology and other sciences such
as medicine, forensics, or anthropology

An example of the results of automated chain-termination DNA sequencing.
The first DNA sequences were obtained in the early 1970s by academic researchers using laborious
methods based on two-dimensional chromatography. Following the development of fluorescence-
based sequencing methods with a DNA sequencer A sequencing has become easier and orders of
magnitude faster.
Methods of DNA sequencing.
Maxam-Gilbert sequencing
Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based on chemical
modification of DNA and subsequent cleavage at specific bases. Also known as chemical sequencing,
this method allowed purified samples of double-stranded DNA to be used without further cloning. This
method's use of radioactive labeling and its technical complexity discouraged extensive use after
refinements in the Sanger methods had been made.
Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and purification of the
DNA fragment to be sequenced. Chemical treatment then generates breaks at a small proportion of
one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The concentration of
the modifying chemicals is controlled to introduce on average one modification per DNA molecule.
Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each
molecule. The fragments in the four reactions are electrophoresed side by side in denaturing
acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for
autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment,
from which the sequence may be inferred.
Chain-termination methods
The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the
method of choice, owing to its relative ease and reliability When invented, the chain-terminator method
used fewer toxic chemicals and lower amounts of radioactivity than the Maxam and Gilbert method.
Because of its comparative ease, the Sanger method was soon automated and was the method used in
the first generation of DNA sequencers.

Sanger sequencing is the method which prevailed from the 1980s until the mid-2000s. Over that period,
great advances were made in the technique, such as fluorescent labelling, capillary electrophoresis, and
general automation. These developments allowed much more efficient sequencing, leading to lower
costs. The Sanger method, in mass production form, is the technology which produced the first human
genome in 2001, ushering in the age of genomics. However, later in the decade, radically different
approaches reached the market, bringing the cost per genome down from $100 million in 2001 to
$10,000 in 2011.
Some tasks in DNA sequence analysis
• Large quantities of sequence data are being published, for organisms from bacteria to higher
mammals.
It will take decades to analyse the data.
Here are a few of the tasks involved in the analysis:
• Sequence assembly: Sequencing involves a “shotgun” approach where DNA fragments 1000bp
long are sequenced with significant over coverage. These then have to be “assembled”.
• Annotation: The assembled genomes have to be “annotated”: genes identified and marked
out, their functions identified, and so on.
• Motif finding: Non-coding DNA contains “regulatory” regions where proteins called “transcription
factors” bind to “turn on” genes. Identifying such regions, and binding sites for
individual TFs, is of great importance. TFs typically bind to small “motifs”, so the task is
to find overrepresented short “motifs” in larger quantities of sequence.
• Sequence alignment: In the last two tasks, it is very useful to compare genomes of previously
sequenced species. “Comparative genomics” is becoming a very important subfield. Detection
and alignment of homologous sequence is an important task here.
• Phylogenetic trees: Given sequence data from different species, it is useful to reconstruct their
phylogenetic relationship.
Algorithms exist for all these tasks, but all are evolving with increasing understanding of the
function of non-coding DNA, increasing mathematical and algorithmic sophistication in the methods,
and increasing raw computational power available to tackle these tasks.
There is not much explicit mention of parallel programming in what follows. But most problems
are intrinsically parallelisable, and many tasks require several independent runs that can be done
trivially in parallel.

An overview of DNA Sequencing
 Protein sequencing and analysis
Is the practical process of determining the amino acid sequence of all or part of a protein or
peptide. This may serve to identify the protein or characterize its post-translational
modifications. Typically, partial sequencing of a protein provides sufficient information (one or
more sequence tags) to identify it with reference to databases of protein sequences derived
from the conceptual translation of genes.
The two major direct methods of protein sequencing are mass spectrometry and Edman degradation
using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for
protein sequencing and identification but Edman degradation remains a valuable tool for characterizing
a protein's N-terminus.
Determining amino acid composition
It is often desirable to know the unordered amino acid composition of a protein prior to attempting to
find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the
sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of
certain amino acids may also be used to choose which protease to use for digestion of the protein. The
misincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be

determined.[1] A generalized method often referred to as amino acid analysis[2] for determining amino
acid frequency is as follows:
1. Hydrolyse a known quantity of protein into its constituent amino acids.
2. Separate and quantify the amino acids in some way.
Hydrolysis
Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours
or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However,
these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan,
glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests
heating separate samples for different times, analysing each resulting solution, and extrapolating back
to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as
thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising
cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of
amide hydrolysis.
Separation and quantitation
The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their
detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC.
An example of the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as
a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through
the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the
amino acids have been separated, their respective quantities are determined by adding a reagent that
will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be
used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino
acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With
very small quantities, down to 10 pmol, fluorescent derivatives can be formed using reagents such as
ortho-phthaldehyde (OPA) or fluorescamine.
Pre-column derivatization may use the Edman reagent to produce a derivative that is detected by UV
light. Greater sensitivity is achieved using a reagent that generates a fluorescent derivative. The
derivatized amino acids are subjected to reversed phase chromatography, typically using a C8 or C18
silica column and an optimised elution gradient. The eluting amino acids are detected using a UV or
fluorescence detector and the peak areas compared with those for derivatised standards in order to
quantify each amino acid in the sample.
N-terminal amino acid analysis
Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's
reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide

Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid
the ordering of individual peptide fragments' sequences into a whole chain, and because the first round
of Edman degradation is often contaminated by impurities and therefore does not give an accurate
determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis
follows:
1. React the peptide with a reagent that will selectively label the terminal amino acid.
2. Hydrolyse the protein.
3. Determine the amino acid by chromatography and comparison with standards.
There are many different reagents which can be used to label terminal amino acids. They all react with
amine groups and will therefore also bind to amine groups in the side chains of amino acids such as
lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the
right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-
dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for
the Edman degradation, can also be used. The same questions apply here as in the determination of
amino acid composition, with the exception that no stain is needed, as the reagents produce coloured
derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from
the chromatography column, just compared with a standard. Another consideration to take into
account is that, since any amine groups will have reacted with the labelling reagent, ion exchange
chromatography cannot be used, and thin layer chromatography or high-pressure liquid
chromatography should be used instead.
C-terminal amino acid analysis
The number of methods available for C-terminal amino acid analysis is much smaller than the number of
available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a
solution of the protein, take samples at regular intervals, and determine the terminal amino acid by
analysing a plot of amino acid concentrations against time. This method will be very useful in the case
of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying
the primary structures of proteins predicted from DNA sequences and to detect any postranslational
processing of gene products from known codon sequences.
Edman degradation
The Edman degradation is a very important reaction for protein sequencing, because it allows the
ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now
in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A
reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are
elaborated on subsequently.
1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A
protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming.
2. Separate and purify the individual chains of the protein complex, if there are more than one.
3. Determine the amino acid composition of each chain.
4. Determine the terminal amino acids of each chain.
5. Break each chain into fragments under 50 amino acids long.
6. Separate and purify the fragments.
7. Determine the sequence of each fragment.
8. Repeat with a different pattern of cleavage.
9. Construct the sequence of the overall protein.
Digestion into peptide fragments
Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman
degradation. Because of this, long protein chains need to be broken up into small fragments that can
then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or

by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns,
and the overlap between fragments can be used to construct an overall sequence.
Reaction
The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre
coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added
to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts
with the amine group of the N-terminal amino acid.
The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The
derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and
identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%,
which allows about 50 amino acids to be reliably determined.
A Beckman-Coulter Porton LF3000G protein sequencing machine
Protein sequenator
A protein sequenator is a machine that performs Edman degradation in an automated manner. A
sample of the protein or peptide is immobilized in the reaction vessel of the protein sequenator and the
Edman degradation is performed. Each cycle releases and derivatises one amino acid from the protein
or peptide's N-terminus and the released amino-acid derivative is then identified by HPLC. The
sequencing process is done repetitively for the whole polypeptide until the entire measurable sequence
is established or for a pre-determined number of cycles.
Identification by mass spectrometry
Protein identification is the process of assigning a name to a protein of interest (POI), based on its
amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined
experimentally in order to identify the protein with reference to databases of protein sequences
deduced from the DNA sequences of their genes. Further protein characterization may include
confirmation of the actual N- and C-termini of the POI, determination of sequence variants and
identification of any post-translational modifications present.
Proteolytic digests
A general scheme for protein identification is described. The POI is isolated, typically by SDS-PAGE or
chromatography.
 The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-
amidomethylation or S-carboxymethylation).
 The POI is digested with a specific protease to generate peptides. Trypsin, which cleaves
selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used
protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high
specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides
for mass spectrometry.
 The peptides may be desalted to remove ionizable contaminants and subjected to MALDI-TOF
mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient
information to identify the protein (see Peptide mass fingerprinting) but further fragmentation
of the peptides inside the mass spectrometer is often used to gain information about the
peptides’ sequences. Alternatively, peptides may be desalted and separated by reversed phase

HPLC and introduced into a mass spectrometer via an ESI source. LC-ESI-MS may provide more
information than MALDI-MS for protein identification but uses more instrument time.
 Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a
variety of mechanisms such as Collision-induced dissociation (CID) or Post-source decay (PSD).
In each case, the pattern of fragment ions of a peptide provides information about its sequence.
 Information including the measured mass of the putative peptide ions and those of their
fragment ions is then matched against calculated mass values from the conceptual (in-silico)
proteolysis and fragmentation of databases of protein sequences. A successful match will be
found if its score exceeds a threshold based on the analysis parameters. Even if the actual
protein is not represented in the database, error-tolerant matching allows for the putative
identification of a protein based on similarity to homologous proteins. A variety of software
packages are available to perform this analysis.
 Software packages usually generate a report showing the identity (accession code) of each
identified protein, its matching score, and provide a measure of the relative strength of the
matching where multiple proteins are identified.
 A diagram of the matched peptides on the sequence of the identified protein is often used to
show the sequence coverage (% of the protein detected as peptides). Where the POI is thought
to be significantly smaller than the matched protein, the diagram may suggest whether the POI
is an N- or C-terminal fragment of the identified protein.
De novo sequencing
The pattern of fragmentation of a peptide allows for direct determination of its sequence by de novo
sequencing. This sequence may be used to match databases of protein sequences or to investigate
post-translational or chemical modifications. It may provide additional evidence for protein
identifications performed as above.
N- and C-termini
The peptides matched during protein identification do not necessarily include the N- or C-termini
predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to
identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N-
terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or
truncated termini may be identified by closer examination of the data (i.e. de novo sequencing). A
repeat digest using a protease of different specificity may also be useful.
Post-translational modifications
Whilst detailed comparison of the MS data with predictions based on the known protein sequence may
be used to define post-translational modifications, targeted approaches to data acquisition may also be
used. For instance, specific enrichment of phosphopeptides may assist in identifying phosphorylation
sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as ETD
or ECD, may give complementary sequence information.
Whole-mass determination
The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water
molecule and adjusted for any post-translational modifications. Although proteins ionize less well than
the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its
mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the
termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the
presence or absence of many post-translational modifications.
Limitations
Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of
the POI. The fragmentation of peptides in the mass spectrometer often does not yield ions
corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not
necessarily complete. The standard methods of fragmentation do not distinguish between leucine and
isoleucine residues since they are isomeric.

Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-
terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman
degradation is generally not useful to determine the positions of disulfide bridges. It also requires
peptide amounts of 1 picomole or above for discernible results, making it less sensitive than mass
spectrometry.
Introduction to sequence alignment
Sequence similarity search and sequence alignment
We can do a similarity search to learn if our sequenced DNA can be found in a public nucleotide
database (i.e. it has already been cloned by others) and/or whether it is evolutionally related (i.e.
homologous) to other sequences. In a simple similarity search, one can compare a sequence with
sequences found in an entire nucleotide database (see later the BLAST program), while for a homology
search the method of choice is multiple sequence alignment by the ClustalW program. By comparing
either nucleotide or amino acid sequences we can find homologs. If these are from different species
(that had a common ancestor) but have identical or similar functions they are called orthologs; while
those homologs that are found in the same organism and originate from a gene duplication event
followed by divergent evolution within the species are called paralogs. We will not cover the
construction of evolutionary trees in this e-book—one can learn about these in bioinformatics or
evolutionary biology courses.
The BLAST program
If we sequence a DNA clone, the first bioinformatics analysis is a similarity search against a nucleotide
database. The most widely used similarity search program accessible on the internet is BLAST (Basic

Local Alignment Search Tool), which will be described here and will be used by the students during the
laboratory practice. The BLAST program is available online at several servers including the one at NCBI:
http://blast.ncbi.nlm.nih.gov/Blast.cgi.
BLAST uses a heuristic algorithm that makes it possible to search a huge database in a very short period
of time by using a query sequence. The high speed of the algorithm stems from the fact that the query
sequence is divided into short „words” that are used, instead of the full-length sequence, during the
alignment process. These words are searched in the database first (called „seeding”, i.e. finding the
best local alignments). The most relevant hits are then scored with the help of a scoring matrix,
extended to neighbouring words, and finally assembled and compiled into a final list of similarity hits. It
is important that the query sequences must be in the so-called FASTA format (FASTA was a previously
popular but much slower similarity search program). The FASTA format is shown in Figure 11.10.
Figure 11.10. The FASTA sequence format
If we want to search using a nucleotide query sequence within a nucleotide database, we can use the
BLASTN version of the program. If we have an amino acid sequence, we can search a protein database
by the BLASTP version of the program. The BLASTX version of the program translates a nucleotide
sequence in all six reading frames (three on each strand) and allows searching a protein database.
Finally, with the TBLAST subprogram, we can search against a translated nucleotide database using
either a protein (TBLASTN) or a nucleotide (TBLASTX) query sequence. These similarity search options
are summarised in Figure 11.11.
Figure 11.11. Search possibilities in the BLAST program
The result of a BLAST analysis is a list a sequences from the searched database that show significant
similarity to the query sequence. Besides the sequence identifiers of the similar sequence hits in the
database, the final list of alignments contains a score number and a statistical significance number, the
E-value. The E-value is a parameter that describes the number of hits one can expect to see by chance

when searching a database of a particular size. It decreases exponentially as the score (S) of the match
increases. Essentially, the E-value describes the random background noise. The lower the E-value, or the
closer it is to zero, the more "significant" the match (E > 0.01 is usually considered to reflect a
homologous, i.e. evolutionarily-related sequence). The score value is calculated based on the
alignment, taking into account the gaps and the similarity of the amino acids at the aligned positions.
The most often used similarity matrix (an amino acid substitution matrix) is the BLOSUM (BLOcks
SUbstitution Matrix) matrix. The numbers within a BLOSUM are “log-odds” scores that measure, in an
alignment, the logarithm of the ratio of the likelihood of two amino acids appearing with a biological
sense and the likelihood of the same amino acids appearing by chance.
The similarity hits can be found and downloaded from the database using their accession number
(identifier). BLAST hits are usually hyperlinked directly to the corresponding entries in the GenBank
database where we can learn much more about the related sequences, the gene, cDNA and/or the
coded protein. As we have already mentioned, the most comprehensive information on a given protein
can be found in the UniProt database. In Figure 11.12, a detail of a BLAST run is shown in which the
BLASTP program was used to search the UniProt database using a human skeletal actin query
sequence.
Figure :Result of a sequence similarity search by the BLAST program (human skeletal muscle actin was
used as a query sequence against the UniProt database)
It is important to note that, since 3-D structure is more conserved than primary structure, it is easier to
recognise two related proteins by comparing their three-dimensional structure than their amino acid
sequence. Obviously, it is more convenient to compare primary sequences, since they are available for
much more proteins than the atomic-resolution structures. Similarity searches and protein structure
comparisons are dealt with in more detail in bioinformatics (or structural bioinformatics) courses.

Figure : The wide range of in silico analysis possibilities of protein sequences. (Most of these options are
also available for nucleic acid sequences.)
 FASTA
Is a DNA and protein sequence alignment software package first described (as FASTP) by David J.
Lipman and William R. Pearson in 1985.[1] Its legacy is the FASTA format which is now ubiquitous in
bioinformatics.
The original FASTP program was designed for protein sequence similarity searching. Because of the
exponentially expanding genetic information and the limited speed and memory of computers in the
1980s heuristic methods were introduced aligning a query sequence to entire data-bases. FASTA
(developed in 1988) added the ability to do DNA:DNA searches, translated protein:DNA searches, and
also provided a more sophisticated shuffling program for evaluating statistical significance.[2] There are
several programs in this package that allow the alignment of protein sequences and DNA sequences.
Nowadays, increased computer performance makes it possible to perform searches for local alignment
detection in a database using the Smith-Waterman algorithm.
Uses:FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an
extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA
(with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package
include special translated search algorithms that correctly handle frameshift errors (which six-frame-
translated searches do not handle very well) when comparing nucleotide to protein sequence data.
In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an
implementation of the optimal Smith-Waterman algorithm.
A major focus of the package is the calculation of accurate similarity statistics, so that biologists can
judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer
homology. The FASTA package is available from the University of Virginia[3] and the European
Bioinformatics Institute.[4]
The web-interface to submit sequences for running a search of the European Bioinformatics Institute
(EBI)'s online databases is also available using the FASTA programs.

The FASTA file format used as input for this software is now largely used by other sequence database
search tools (such as BLAST) and sequence alignment programs (Clustal, T-Coffee, etc.).
FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence
database by using local sequence alignment to find matches of similar database sequences.
The FASTA program follows a largely heuristic method which contributes to the high speed of its
execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and
marks potential matches before performing a more time-consuming optimized search using a Smith-
Waterman type of algorithm.
The size taken for a word, given by the parameter kmer, controls the sensitivity and speed of the
program. Increasing the kmer value decreases number of background hits that are found. From the
word hits that are returned the program looks for segments that contain a cluster of nearby hits. It
then investigates these segments for a possible match.
There are some differences between fastn and fastp relating to the type of sequences used but both
use four steps and calculate three scores to describe and format the sequence similarity results. These
are:
 Identify regions of highest density in each sequence comparison. Taking a kmer to equal 1 or 2.
In this step all or a group of the identities between two sequences are found using a look up
table. The kmer value determines how many consecutive identities are required for a match to
be declared. Thus the lesser the kmer value: the more sensitive the search. kmer=2 is frequently
taken by users for protein sequences and kmer=4 or 6 for nucleotide sequences. Short
oligonucleotides are usually run with kmer= 1. The program then finds all similar local regions,
represented as diagonals of a certain length in a dot plot, between the two sequences by
counting kmer matches and penalizing for intervening mismatches. This way, local regions of
highest density matches in a diagonal are isolated from background hits. For protein sequences
BLOSUM50 values are used for scoring kmer matches. This ensures that groups of identities
with high similarity scores contribute more to the local diagonal score than to identities with low
similarity scores. Nucleotide sequences use the identity matrix for the same purpose. The best
10 local regions selected from all the diagonals put together are then saved.
 Rescan the regions taken using the scoring matrices. trimming the ends of the region to include
only those contributing to the highest score.
Rescan the 10 regions taken. This time use the relevant scoring matrix while rescoring to allow
runs of identities shorter than the kmer value. Also while rescoring conservative replacements
that contribute to the similarity score are taken. Though protein sequences use the BLOSUM50
matrix, scoring matrices based on the minimum number of base changes required for a specific
replacement, on identities alone, or on an alternative measure of similarity such as PAM, can
also be used with the program. For each of the diagonal regions rescanned this way, a subregion
with the maximum score is identified. The initial scores found in step1 are used to rank the
library sequences. The highest score is referred to as init1 score.
 In an alignment if several initial regions with scores greater than a CUTOFF value are found,
check whether the trimmed initial regions can be joined to form an approximate alignment with

gaps. Calculate a similarity score that is the sum of the joined regions penalising for each gap 20
points. This initial similarity score (initn) is used to rank the library sequences. The score of the
single best initial region found in step 2 is reported (init1).
Here the program calculates an optimal alignment of initial regions as a combination of
compatible regions with maximal score. This optimal alignment of initial regions can be rapidly
calculated using a dynamic programming algorithm. The resulting score initn is used to rank the
library sequences.This joining process increases sensitivity but decreases selectivity. A carefully
calculated cut-off value is thus used to control where this step is implemented, a value that is
approximately one standard deviation above the average score expected from unrelated
sequences in the library. A 200-residue query sequence with kmer 2 uses a value 28.
 Use a banded Smith-Waterman algorithm to calculate an optimal score for alignment.
This step uses a banded Smith-Waterman algorithm to create an optimised score (opt) for each
alignment of query sequence to a database(library) sequence. It takes a band of 32 residues
centered on the init1 region of step2 for calculating the optimal alignment. After all sequences
are searched the program plots the initial scores of each database sequence in a histogram, and
calculates the statistical significance of the "opt" score. For protein sequences, the final
alignment is produced using a full Smith-Waterman alignment. For DNA sequences, a banded
alignment is provided.
FASTA cannot remove low complexity regions before aligning the sequences as it is possible with
BLAST. This might be problematic as when the query sequence contains such regions, e.g. mini- or
microsatellites repeating the same short sequence frequent times, this increases the score of not
familiar sequences in the database which only match in this repeats, which occur quite frequently.
Therefore the program PRSS is added in the FASTA distribution package. PRSS shuffles the matching
sequences in the database either on the one-letter level or it shuffles short segments which length the
user can determine. The shuffled sequences are now aligned again and if the score is still higher than
expected this is caused by the low complexity regions being mixed up still mapping to the query. By the
amount of the score the shuffled sequences still attain PRSS now can predict the significance of the
score of the original sequences. The higher the score of the shuffled sequences the less significant the
matches found between original database and query sequence.[5]
The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either
by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other
programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be
used to infer functional and evolutionary relationships between sequences as well as help identify
members of gene families.
Protein
 Protein–protein FASTA.
 Protein–protein Smith–Waterman (ssearch).
 Global protein–protein (Needleman–Wunsch) (ggsearch)
 Global/local protein–protein (glsearch)
 Protein–protein with unordered peptides (fasts)
 Protein–protein with mixed peptide sequences (fastf)
Nucleotide

 Nucleotide–nucleotide (DNA/RNA fasta)
 Ordered nucleotides vs nucleotide (fastm)
 Unordered nucleotides vs nucleotide (fasts)
Translated
 Translated DNA (with frameshifts, e.g. ESTs) vs proteins (fastx/fasty)
 Protein vs translated DNA (with frameshifts) (tfastx/tfasty)
 Peptides vs translated DNA (tfasts)
Statistical significance
 Protein vs protein shuffle (prss)
 DNA vs DNA shuffle (prss)
 Translated DNA vs protein shuffle (prfx)
Local duplications
 Local protein alignments (lalign)
 Plot protein alignment "dot-plot" (plalign)
 Local DNA alignments (lalign)
 Plot DNA alignment "dot-plot" (plalign)

LECTURE NOTES ON BIOINFORMATICS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to LECTURE NOTES ON BIOINFORMATICS

Similar to LECTURE NOTES ON BIOINFORMATICS (20)

More from MSCW Mysore

More from MSCW Mysore (20)

Recently uploaded

Recently uploaded (20)

LECTURE NOTES ON BIOINFORMATICS