SlideShare a Scribd company logo
1 of 34
Lecture notes in Bioinformatics
2018
SARDAR HUSSAIN
[COMPANY NAME] | [Company address]
1 Lecture notes in Bioinformatics
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding
biological data. As an interdisciplinary field of science, bioinformatics combines Computer Science,
Biology, Mathematics, and Engineering to analyze and interpret biological data. Bioinformatics has
been used for in silico analyses of biological queries using mathematical and statistical techniques.
More broadly, bioinformatics is applied statistics and computing to biological science.
Bioinformatics is both an umbrella term for the body of biological studies that use computer
programming as part of their methodology, as well as a reference to specific analysis "pipelines" that
are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the
identification of candidate genes and single nucleotide polymorphisms (SNPs). Often, such
identification is made with the aim of better understanding the genetic basis of disease, unique
adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a
less formal way, bioinformatics also tries to understand the organizational principles within nucleic acid
and protein sequences, called proteomics.[1]
Introduction
Bioinformatics has become an important part of many areas of biology. In experimental molecular
biology, bioinformatics techniques such as image and signal processing allow extraction of useful
results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and
annotating genomes and their observed mutations. It plays a role in the text mining of biological
literature and the development of biological and gene ontologies to organize and query biological data.
It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools
aid in the comparison of genetic and genomic data and more generally in the understanding of
evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue
the biological pathways and networks that are an important part of systems biology. In structural
biology, it aids in the simulation and modeling of DNA,[2] RNA,[2][3] proteins[4] as well as biomolecular
interactions.[5][6][7]
History
Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and Ben
Hesper coined it in 1970 to refer to the study of information processes in biotic systems.[8][9][10] This
definition placed bioinformatics as a field parallel to biophysics (the study of physical processes in
biological systems) or biochemistry (the study of chemical processes in biological systems).[8]
Sequences
Sequences of genetic material are frequently used in bioinformatics and are easier to manage using
computers than manually.
Computers became essential in molecular biology when protein sequences became available after
Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences
manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff, who has
been hailed by David Lipman, director of the National Center for Biotechnology Information, as the
"mother and father of bioinformatics." Dayhoff compiled one of the first protein sequence databases,
initially published as books[12] and pioneered methods of sequence alignment and molecular
2 Lecture notes in Bioinformatics
evolution.[13] Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological
sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te
Wu between 1980 and 1991.[14]
Goals/scope
To study how normal cellular activities are altered in different disease states, the biological data must
be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics
has evolved such that the most pressing task now involves the analysis and interpretation of various
types of data. This includes nucleotide and amino acid sequences, protein domains, and protein
structures.[15] The actual process of analyzing and interpreting data is referred to as computational
biology. Important sub-disciplines within bioinformatics and computational biology include:
 Development and implementation of computer programs that enable efficient access to, use
and management of, various types of information
 Development of new algorithms (mathematical formulas) and statistical measures that assess
relationships among members of large data sets. For example, there are methods to locate a
gene within a sequence, to predict protein structure and/or function, and to cluster protein
sequences into families of related sequences.
The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it
apart from other approaches, however, is its focus on developing and applying computationally
intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine
learning algorithms, and visualization. Major research efforts in the field include sequence alignment,
gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein
structure prediction, prediction of gene expression and protein–protein interactions, genome-wide
association studies, the modeling of evolution and cell division/mitosis.
Bioinformatics now entails the creation and advancement of databases, algorithms, computational and
statistical techniques, and theory to solve formal and practical problems arising from the management
and analysis of biological data.
Over the past few decades, rapid developments in genomic and other molecular research technologies
and developments in information technologies have combined to produce a tremendous amount of
information related to molecular biology. Bioinformatics is the name given to these mathematical and
computing approaches used to glean understanding of biological processes.
Common activities in bioinformatics include mapping and analyzing DNA and protein sequences,
aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein
structures.
Relation to other fields
Bioinformatics is a science field that is similar to but distinct from biological computation, while it is
often considered synonymous to computational biology. Biological computation uses bioengineering
and biology to build biological computers, whereas bioinformatics uses computation to better
understand biology. Bioinformatics and computational biology involve the analysis of biological data,
particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive
growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in
DNA sequencing technology.
3 Lecture notes in Bioinformatics
Analyzing biological data to produce meaningful information involves writing and running software
programs that use algorithms from graph theory, artificial intelligence[16], soft computing, data mining,
image processing, and computer simulation. The algorithms in turn depend on theoretical foundations
such as discrete mathematics, control theory, system theory, information theory, and statistics.
Sequence analysis
Main articles: Sequence alignment and Sequence database
The sequences of different genes or proteins may be aligned side-by-side to measure their similarity.
This alignment have to compares protein sequences and genomic sequences .containing WPP domains.
Since the Phage Φ-X174 was sequenced in 1977,[17] the DNA sequences of thousands of organisms have
been decoded and stored in databases. This sequence information is analyzed to determine genes that
encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A
comparison of genes within a species or between different species can show similarities between
protein functions, or relations between species (the use of molecular systematics to construct
phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA
sequences manually. Today, computer programs such as BLAST are used daily to search sequences
from more than 260 000 organisms, containing over 190 billion nucleotides.[18] These programs can
compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify
sequences that are related, but not identical. A variant of this sequence alignment is used in the
sequencing process itself.
DNA sequencing
Before sequences can be analyzed they have to be obtained. DNA sequencing is still a non-trivial
problem as the raw data may be noisy or afflicted by weak signals. Algorithms have been developed for
base calling for the various experimental approaches to DNA sequencing.
Sequence assembly
Main article: Sequence assembly
Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to
obtain complete gene or genome sequences. The so-called shotgun sequencing technique (which was
used, for example, by The Institute for Genomic Research (TIGR) to sequence the first bacterial
genome, Haemophilus influenzae)[19] generates the sequences of many thousands of small DNA
fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The
ends of these fragments overlap and, when aligned properly by a genome assembly program, can be
used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the
task of assembling the fragments can be quite complicated for larger genomes. For a genome as large
as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers
to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be
4 Lecture notes in Bioinformatics
filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today,
and genome assembly algorithms are a critical area of bioinformatics research.
See also: sequence analysis, sequence mining, sequence profiling tool, and sequence motif
Genome annotation
In the context of genomics, annotation is the process of marking the genes and other biological
features in a DNA sequence. This process needs to be automated because most genomes are too large
to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate
of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes
have recognisable start and stop regions, although the exact sequence found in these regions can vary
between genes.
The first description of a comprehensive genome annotation system was published in 1995 [19] by the
team at The Institute for Genomic Research that performed the first complete sequencing and analysis
of the genome of a free-living organism, the bacterium Haemophilus influenzae.[19] Owen White
designed and built a software system to identify the genes encoding all proteins, transfer RNAs,
ribosomal RNAs (and other sites) and to make initial functional assignments. Most current genome
annotation systems work similarly, but the programs available for analysis of genomic DNA, such as the
GeneMark program trained and used to find protein-coding genes in Haemophilus influenzae, are
constantly changing and improving.
Following the goals that the Human Genome Project left to achieve after its closure in 2003, a new
project developed by the National Human Genome Research Institute in the U.S appeared. The so-
called ENCODE project is a collaborative data collection of the functional elements of the human
genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays,
technologies able to automatically generate large amounts of data at a dramatically reduced per-base
cost but with the same accuracy (base call error) and fidelity (assembly error).
Computational evolutionary biology
Evolutionary biology is the study of the origin and descent of species, as well as their change over time.
Informatics has assisted evolutionary biologists by enabling researchers to:
 trace the evolution of a large number of organisms by measuring changes in their DNA, rather
than through physical taxonomy or physiological observations alone,
 more recently, compare entire genomes, which permits the study of more complex evolutionary
events, such as gene duplication, horizontal gene transfer, and the prediction of factors
important in bacterial speciation,
 build complex computational population genetics models to predict the outcome of the system
over time[20]
 track and share information on an increasingly large number of species and organisms
Future work endeavours to reconstruct the now more complex tree of life.
The area of research within computer science that uses genetic algorithms is sometimes confused with
computational evolutionary biology, but the two areas are not necessarily related.
Comparative genomics
5 Lecture notes in Bioinformatics
The core of comparative genome analysis is the establishment of the correspondence between genes
(orthology analysis) or other genomic features in different organisms. It is these intergenomic maps
that make it possible to trace the evolutionary processes responsible for the divergence of two
genomes. A multitude of evolutionary events acting at various organizational levels shape genome
evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large
chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and
insertion.[21] Ultimately, whole genomes are involved in processes of hybridization, polyploidization and
endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many
exciting challenges to developers of mathematical models and algorithms, who have recourse to a
spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed
parameter and approximation algorithms for problems based on parsimony models to Markov chain
Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on the homology detection and protein families computation.
Pan genomics
Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took root in
bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic group: although
initially applied to closely related strains of a species, it can be applied to a larger context like genus,
phylum etc. It is divided in two parts- The Core genome: Set of genes common to all the genomes under
study (These are often housekeeping genes vital for survival) and The Dispensable/Flexible Genome:
Set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can
be used to characterize the Pan Genome of bacterial species.[23]
Genetics of disease
With the advent of next-generation sequencing we are obtaining enough sequence data to map the
genes of complex diseases such as diabetes,[24] infertility,[25] breast cancer[26] or Alzheimer's Disease.[27]
Genome-wide association studies are a useful approach to pinpoint the mutations responsible for such
complex diseases.[28] Through these studies, thousands of DNA variants have been identified that are
associated with similar diseases and traits.[29] Furthermore, the possibility for genes to be used at
prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing
both the promising ways to choose the genes to be used and the problems and pitfalls of using genes
to predict disease presence or prognosis.[30]
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways.
Massive sequencing efforts are used to identify previously unknown point mutations in a variety of
genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the
sheer volume of sequence data produced, and they create new algorithms and software to compare
the sequencing results to the growing collection of human genome sequences and germline
polymorphisms. New physical detection technologies are employed, such as oligonucleotide
microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and
single-nucleotide polymorphism arrays to detect known point mutations. These detection methods
simultaneously measure several hundred thousand sites throughout the genome, and when used in
high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again
the massive amounts and new types of data generate new opportunities for bioinformaticians. The
6 Lecture notes in Bioinformatics
data is often found to contain considerable variability, or noise, and thus Hidden Markov model and
change-point analysis methods are being developed to infer real copy number changes.
Two important principles can be used in the analysis of cancer genomes bioinformatically pertaining to
the identification of mutations in the exome. First, cancer is a disease of accumulated somatic
mutations in genes. Second cancer contains driver mutations which need to be distinguished from
passengers.[31]
With the breakthroughs that this next-generation sequencing technology is providing to the field of
Bioinformatics, cancer genomics could drastically change. These new methods and software allow
bioinformaticians to sequence many cancer genomes quickly and affordably. This could create a more
flexible process for classifying types of cancer by analysis of cancer driven mutations in the genome.
Furthermore, tracking of patients while the disease progresses may be possible in the future with the
sequence of cancer samples.[32]
Another type of data that requires novel informatics development is the analysis of lesions found to be
recurrent among many tumors.
Gene and protein expression
Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple techniques
including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene
expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also
known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed
in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the
biological measurement, and a major research area in computational biology involves developing
statistical tools to separate signal from noise in high-throughput gene expression studies.[33] Such
studies are often used to determine the genes implicated in a disorder: one might compare microarray
data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that
are up-regulated and down-regulated in a particular population of cancer cells.
Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the
proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein
microarray and HT MS data; the former approach faces similar problems as with microarrays targeted
at mRNA, the latter involves the problem of matching large amounts of mass data against predicted
masses from protein sequence databases, and the complicated statistical analysis of samples where
multiple, but incomplete peptides from each protein are detected. Cellular protein localization in a
tissue context can be achieved through affinity proteomics displayed as spatial data based on
immunohistochemistry and tissue microarrays.[34]
Analysis of regulation
Regulation is the complex orchestration of events by which a signal, potentially an extracellular signal
such as a hormone, eventually leads to an increase or decrease in the activity of one or more proteins.
Bioinformatics techniques have been applied to explore various steps in this process.
7 Lecture notes in Bioinformatics
For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis
involves the identification and study of sequence motifs in the DNA surrounding the coding region of a
gene. These motifs influence the extent to which that region is transcribed into mRNA. Enhancer
elements far away from the promoter can also regulate gene expression, through three-dimensional
looping interactions. These interactions can be determined by bioinformatic analysis of chromosome
conformation capture experiments.
Expression data can be used to infer gene regulation: one might compare microarray data from a wide
variety of states of an organism to form hypotheses about the genes involved in each state. In a single-
cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat
shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine
which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes
can be searched for over-represented regulatory elements. Examples of clustering algorithms applied
in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering, and
consensus clustering methods.
Analysis of cellular organization
Several approaches have been developed to analyze the location of organelles, genes, proteins, and
other components within cells. This is relevant as the location of these components affects the events
within a cell and thus helps us to predict the behavior of biological systems. A gene ontology category,
cellular compartment, has been devised to capture subcellular localization in many biological databases.
Microscopy and image analysis
Microscopic pictures allow us to locate both organelles as well as molecules. It may also help us to
distinguish between normal and abnormal cells, e.g. in cancer.
Protein localization
The localization of proteins helps us to evaluate the role of a protein. For instance, if a protein is found
in the nucleus it may be involved in gene regulation or splicing. By contrast, if a protein is found in
mitochondria, it may be involved in respiration or other metabolic processes. Protein localization is thus
an important component of protein function prediction. There are well developed protein subcellular
localization prediction resources available, including protein subcellualr location databases, and
prediction tools.[35][36]
Nuclear organization of chromatin
Data from high-throughput chromosome conformation capture experiments, such as Hi-C (experiment)
and ChIA-PET, can provide information on the spatial proximity of DNA loci. Analysis of these
experiments can determine the three-dimensional structure and nuclear organization of chromatin.
Bioinformatic challenges in this field include partitioning the genome into domains, such as
Topologically Associating Domains (TADs), that are organised together in three-dimensional space.[37]
Structural bioinformatics
8 Lecture notes in Bioinformatics
3-dimensional protein structures such as this one are common subjects in bioinformatic analyses.
Protein structure prediction is another important application of bioinformatics. The amino acid
sequence of a protein, the so-called primary structure, can be easily determined from the sequence on
the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a
structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform
encephalopathy – a.k.a. Mad Cow Disease – prion.) Knowledge of this structure is vital in understanding
the function of the protein. Structural information is usually classified as one of secondary, tertiary and
quaternary structure. A viable general solution to such predictions remains an open problem. Most
efforts have so far been directed towards heuristics that work most of the time.[citation needed]
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of
bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose
function is known, is homologous to the sequence of gene B, whose function is unknown, one could
infer that B may share A's function. In the structural branch of bioinformatics, homology is used to
determine which parts of a protein are important in structure formation and interaction with other
proteins. In a technique called homology modeling, this information is used to predict the structure of a
protein once the structure of a homologous protein is known. This currently remains the only way to
predict protein structures reliably.
One example of this is the similar protein homology between hemoglobin in humans and the
hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the
organism. Though both of these proteins have completely different amino acid sequences, their protein
structures are virtually identical, which reflects their near identical purposes.[38]
Other techniques for predicting protein structure include protein threading and de novo (from scratch)
physics-based modeling.
Network and systems biology
Network analysis seeks to understand the relationships within biological networks such as metabolic or
protein–protein interaction networks. Although biological networks can be constructed from a single
type of molecule or entity (such as genes), network biology often attempts to integrate many different
data types, such as proteins, small molecules, gene expression data, and others, which are all
connected physically, functionally, or both.
9 Lecture notes in Bioinformatics
Systems biology involves the use of computer simulations of cellular subsystems (such as the networks
of metabolites and enzymes that comprise metabolism, signal transduction pathways and gene
regulatory networks) to both analyze and visualize the complex connections of these cellular
processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the
computer simulation of simple (artificial) life forms.
Molecular interaction networks
Interactions between proteins are frequently visualized and analyzed using networks. This network is
made up of protein–protein interactions from Treponema pallidum, the causative agent of syphilis and
other diseases.
Tens of thousands of three-dimensional protein structures have been determined by X-ray
crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR) and a central
question in structural bioinformatics is whether it is practical to predict possible protein–protein
interactions only based on these 3D shapes, without performing protein–protein interaction
experiments. A variety of methods have been developed to tackle the protein–protein docking
problem, though it seems that there is still much work to be done in this field.
Other interactions encountered in the field include Protein–ligand (including drug) and protein–
peptide. Molecular dynamic simulation of movement of atoms about rotatable bonds is the
fundamental principle behind computational algorithms, termed docking algorithms, for studying
molecular interactions.
 Sequence development
We start with a very basic review of biology, necessary for any further work, but largely sufficient
for getting started in computational biology. One can (and must) learn more “on the job”.
Biomolecules are sequences of monomers (DNA, RNA=nucleotide sequences, proteins=amino
acid sequences). DNA is the molecule that contains the entire blueprint for an organism. It contains
genes that encode the sequences for every protein in the organism, as well as non-coding regions
that, among other things, contain regulatory mechanisms for when and in what order different
genes get turned on, and may have other functions as well.
Most genes code for proteins; some genes code for RNA molecules that play various roles
in the cell. Both DNA and RNA are polymers of “nucleotides” which are bases of four kinds
[adenine=A, cytosine=C, guanine=G, thymine=T (DNA only), uracil=U (RNA only)] attached
to sugar-phosphate backbones. Apart from the one difference in bases, RNA and DNA are very
similar except that DNA usually exists in double-stranded “base-paired” form and RNA is in
single-stranded form.
The backbone of DNA (or RNA) is not symmetrical: each monomer has a 5’-phosphate group
at one end and a 3’-hydroxyl group at the other. Each strand is usually read from the 5’ to the 3’
end. The two strands go in opposite directions. The nucleic acids are base-paired A to T, G to C.
10 Lecture notes in Bioinformatics
A-T bonds are weaker (double-bonds), G-C bonds are stronger (triple-bonds).Proteins are the “building
blocks” of life, responsible for a vast number of cellular processes.They regulate genes, catalyse various
biochemical reactions, form machinery for synthesis of othermolecules (including other proteins) and
are important parts of organelles and tissues. They arepolymers of amino acids (carboxylic acids with
an amide group and a side chain). There are twentynaturally occurring amino acids, differing in their
side chains.Proteins tend to “fold” into complex three-dimensional conformations; usually the fold is
unique and misfolding is rare. The details of the fold are biochemically important. Usually a few active
“domains” (for example, binding to DNA, interaction with other proteins) help the protein play
important roles in gene regulation, catalysis, etc; these domains tend to be well conserved across
species, while the rest of the protein sequence can mutate a lot. Much computational effort goes
into studying protein structure and function, but we will not discuss this vast subject here.
Genes that code for proteins are first “transcribed” to “messenger RNA” (mRNA) molecules,
and then the RNA is “translated” to proteins. Each “codon” of three nucleotides corresponds to a
unique amino acid. Since there are 4 nucleotides, there are 64 possible codons; three of these are
“stop codons” (TAA, TAG, TGA) (sometimes called “nonsense codons”) and don’t code for amino
acids, instead indicating a stop to transcription. The remaining 61 code for 20 amino acids. Several
codons (up to six) thus can code for the same amino acid. The “start codon” is ATG, which codes
for the amino acid methionine.
What are the biological problems?
There are of course a huge number of problems in biology that can benefit from a quantitative
treatment, ranging from single molecule behaviour to population biology and ecology. From the
title, we are already restricting ourselves to bioinformatics, but we will mainly focus on DNA
sequence analysis, with only occasional mention of proteins.
The following are a few issues of interest to biologists (and often of medical importance) that
could benefit from analysis of DNA sequence:
• Cellular processes: how the cell carries out its normal tasks; how it responds to external
events like heat shock and starvation; how it carries out complex cascades of events such as
the process of cell division (mitosis).
• Development: How a complex organism (eg a worm, a fly, a human) develops from a single
fertilised egg. As this embryonic cell divides, the daughter cells also slowly differentiate into
functions. This happens as a result of “gradients” of various factors (some of them maternal)
that change gene regulation in different parts of the embryo and ultimately cause different
cells to develop in highly specialised ways.
• Evolution: How different species evolve, how new functionality develops.
All cellular and developmental processes are controlled by genes that get turned on in response
to some external condition (stress, starvation, embryonic gradients) or cyclically (cell cycle).
Computational
study of how these genes are regulated and how they function is very useful. This is
done by analysing the gene sequence and regulatory DNA sequence of the organism itself, and
by comparison of this sequence with already-annotated sequence from other organisms. Highly
similar (homologous) genes exist among widely different organisms; such genes are called
“orthologues”.
Many subsystems in widely different organisms are very similar and are regulated by
orthologous proteins; some proteins exist largely unchanged from primitive archaebacteria all the
way to humans.
Moreover, many genes with heavy sequence identity often exist in the same organism, arising
from ancestral “gene duplication” events; their function is often slightly differentiated, and in fact
11 Lecture notes in Bioinformatics
this is a major driving factor in evolution.
There are now “high-throughput” microarray experiments that can essentially give the response
of every gene in the genome; analysing, clustering and interpreting this data, and combining it with
other computational tasks in gene regulation, is of great interest.
Finally, the study of phylogeny (evolutionary history of organisms) and the classification (taxonomy)
of organisms has been revolutionised by DNA sequencing.
 Aims and tasks of Bioinformatics
The aims of bioinformatics are threefold. First, at its simplest bioinformatics organizes data in a way
that allows researchers to access existing information and to submit new entries as they are produced,
eg the Protein Data Bank for 3D macromolecular structures. While data- curation is an essential task,
the information stored in these databases is essentially useless until analyzed. Thus the purpose of
bioinformatics extends much further.
The second aim is to develop tools and resources that aid in the analysis of data. For example, having
sequenced a particular protein, it is of interest to compare it with previously characterized sequences.
This needs more than just a simple text-based search and programs such as FASTA and PSI-BLAST must
consider what comprises a biologically significant match. Development of such resources dictates
expertise in computational theory as well as a thorough understanding of biology.
The third aim is to use these tools to analyze the data and interpret the results in a biologically
meaningful manner. Traditionally, biological studies examined individual systems in detail, and
frequently compared those with a few that are related. In bioinformatics, we can now conduct global
analyses of all the available data with the aim of uncovering common principles that apply across many
systems and highlight novel features.
 Application of Bioinformatics in current research
Currently almost every field of biological research has accepted this biological research weapon and
following it, whether it is molecular biology or genetics, or even agriculture. There a complete new
emerging field of genome informatics is there which is completely based on bioinformatics tools . Apart
from these there are so many areas where bioinformatics is readily being accepted with primary role in
prediction of structure similarity and functional similarity in novel drug molecule research also. They
perform initially tasks such as
• Submitting DNA Sequences to the Databases
This is one of important thing in biological research, where scientists sequence DNA, and RNA, but until
it is not getting deposited to any public sequence database, that cannot be beneficial for scientific
community. It became very essential to submit all the sequenced data to public sequence repositories.
Some of the important public repositories are DDBJ, EMBL, and Genebank.
These sequence data can be submitted to repositories in two ways, either by email submission or by
online submission through sequence submission tools. There are specific tools for every public
sequence repository (Table 1).
12 Lecture notes in Bioinformatics
Table 1: Public sequence depositorieS
Table 2: Human Genome Databases, Browsers and Variation Resources
13 Lecture notes in Bioinformatics
Table 2a: Vertebrate databases and genome browsers
14 Lecture notes in Bioinformatics
Table 3: List of some invertebrate databases and genome browsers
After submission every database provides an unique accession number to submitted sequence after
verification and duplication checks. If it is an unique sequence, then accession number is given as a
single letter followed by 5 digit number, but recently due to huge number of submission two letters
followed by 6 digit of number for accession number is now proposed.
• Genomic Mapping and Mapping Databases
Gene mapping is one of the technique to estimate accurate position of gene and corresponding
distance between related genes of similar type.
After complete evaluation we can reach to a conclusion of genome map for complete genome for that
particular organism.
• Information Retrieval From Biological Database
Developing biological database and its availability online was one of the primary concerns at initial
stage of biological research, but now as we have many biological database and data is in form of text,
table and pictures and many other formats. We should really know that how to retrieve exact data
from a suitable database. Database may be of text retrieval, sequence retrieval or it may also include
structural data retrieval importance.
15 Lecture notes in Bioinformatics
• Sequence Alignment and Database Searching
Alignment of sequence with compare to other relevant and similar sequence is very much needed in
biological research to understand relation between two sequences and also to predict structure and
function based on sequence similarity.
For basic alignment of sequences use of BLAST is very common. Based on number of sequences
involved in sequencing, we can classify these alignments into pairwise alignment or multiple sequence
alignment.
• Predictive Methods Using DNA Sequences
Gene-finding strategies can be Classified into three major categories.
Content-based methods rely on the overall, bulk properties of a sequence in making determination.
Characteristics considered here include how often particular codons are used, the periodicity of
repeats, and the compositional complexity of the sequence. Because different organisms use
synonymous codons with different frequency, such clues can provide insight into determining regions
that are more likely to be exons.
In site-based methods, the focus turns to the presence or absence of a specific sequence, pattern, or
consensus. These methods are used to detect features such as donor and acceptor splice sites, binding
sites for transcription factors, polyA tracts, and start and stop codons.
comparative methods make determinations based on sequence homology. Here, translated sequences
are subjected to database searches against protein sequences to determine whether a previously
characterized coding region corresponds to a region in the query sequence. Although this is
conceptually the most straightforward of the methods, it is restrictive because most newly discovered
genes do not have gene products that match anything in the protein databases.
Tools associated with these are Grail, Genscan, Fgenes, procrustes and many others developed with
bioinformatics.
• Predictive Methods Using Protein Sequences
o There are tools based on predictive methods using protein sequences, such as PSLPred, NRpred,
PSEAPred. There are other methods also based on motif level, residue level, signal level, peptide level,
domain level and profile based].
• Sequences Assembly and Finishing Methods
At present,the sequencing process is often talked of as consisting of two parts, namely, assembly and
finishing, but in practice there is considerable overlap between the two. Assembly is the process of
attempting to order and align the readings, and finishing is the task of checking and editing the
assembled data. This includes performing new sequencing experiments to fill gaps or to cover
segments where the data is poor and adjudicating between conflicting readings when editing.
• Phylogenetic Analysis
o Phylogenetic analysis is also one of the important implementation of bioinformatics in biological
research. Phylogenetic analysis is study of ancestral history of an organism. Here after sequence and
structural similarities we try to relate organism’s ancestral history to show how origin of organism was
related to each other and “what was order of evolution”. We actually do evolutionary history analysis
by phylogenetic analysis.
o There are many tools available online as well as commercial packages also like PHYLIP. It uses tree
generation methods with algorithms based on methods such as UPGMA, and neighbor joining.
• Comparative Genome Analysis
o Comparative genome analysis is also being performed in various researches at many levels such as
academics and professional researches. By comparing the finished reference sequence of the human
genome with genomes of other organisms, researchers can identify regions of similarity and difference.
This information can help scientists better understand the structure and function of human genes and
thereby develop new strategies to combat human disease. Comparative genomics also provides a
powerful tool for studying evolutionary changes among organisms, helping to identify genes that are
conserved among species, as well as genes that give each organism its unique characteristics.
16 Lecture notes in Bioinformatics
• Large-Scale Genome Analysis
Large scale genome analysis is complete genome sequencing, and this application has much
advancement as next generation sequencing and bioinformatics tools like illumina have been
developed to analyze them very quickly. These tools are generally termed as sequencer and playing a
vital role in modern biological research.
There are so many other application in pharmaceutical research also have been seen these days as it
also deals with systems biology and pathways of metabolites and their relation to biological functioning
similarity.
Recent Advancement
However bioinformatics is still in its nascent stage, but continuous improvement is making it more
efficient . Mostly with inclusion of various computer language incorporation in this field and
development of software packages for analysis of biological data is contributing to recent
advancement in this field . Drug designing software packages like “Sanjeevni” developed by IIT Delhi,
India maestro from Schrodinger also is contributing a lot to it. Indian agricultural statistical research
institute (IASRI) also is making a huge contribution towards bioinformatics research by creating so
many databases on agricultural and biological area. Most recent use of bioinformatics has been seen in
novel drug molecule discovery and ligand analysis for protein targets in human physiological cycle to
receive most possible cure for lethal diseases in short period]. There is plenty of docking software
available, which are very efficient and proved their accuracy also
Review and conclusion
With inclusion of large number of tools and implementation of bioinformatics in various biological
research areas, it is now showing its existence and importance simultaneously. Now a day every
experiment in biological research is getting associated with bioinformatics. It has made research very
simple and fast, but still validation for various techniques are still in process for its accuracy. We are
able to get a lot of results in a minute, which was not possible by using wet lab techniques in
biotechnology.
 Guide to NCBI Databases tools & Services/uses.
The National Center for Biotechnology Information (NCBI), a division of the National
Library of Medicine (NLM), has created a large number of databases that are freely
available to researchers. These databases represent a vast store of information about
genetics, genomics, proteomics, and medicine. All of these databases can be reached
from the Entrez search page. This page also allows cross searching of all the NCBI
databases, a feature called Entrez Global Query.
 There are two main functions of biological databases:
1. Make biological data available to scientists.
As much as possible of a particular type of information should be available in one single place
(book, site, database). Published data may be difficult to find or access, and collecting it from
the literature is very time-consuming. And not all data is actually published in an article.
2. To make biological data available in computer-readable form.
Since analysis of biological data almost always involves computers, having the data in computer-
readable form (rather than printed on paper) is a necessary first step.
One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and
Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein
sequences determined at the time, and new editions of the book were published well into the 1970s. Its
data became the foundation for the PIR database.
17 Lecture notes in Bioinformatics
The computer became the storage medium of choice as soon as it was accessible to ordinary scientists.
Databases were distributed on tape, and later on various kinds of disks. When universities and
academic institutes were connected to the Internet or its precursors (national computer networks), it is
easy to understand why it became the medium of choice. And it is even easier to see why the World
Wide Web (WWW, based on the Internet protocol HTTP) since the beginning of the 1990s is the
standard method of communication and access for nearly all biological databases.
As biology has increasingly turned into a data-rich science, the need for storing and communicating
large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the
protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular
NMR. An new field of science dealing with issues, challenges and new possibilities created by these
databases has emerged: bioinformatics. Other types of data that are or will soon be available in
databases are metabolic pathways, gene expression data (microarrays) and other types of data relating
to biological function and processes.
One very important issue is the frequency and type of errors that the entries in a database have.
Naturally, this depends strongly on the type of data, and whether the database is curated (added,
deleted, or modified by a defined group of people) or not. For the sequence databases, the errors may
be either in the sequence itself (misprint, wrong on entry, genuine experimental error...) or in the
annotation (mistaken features, errors in references,...). In the 3D structure database (PDB), structures
have been deposited which were later discovered to contain severe errors. The error handling policy
differs considerably between databases. If one needs to use any particular database heavily, then the
implications of its particular policy need to be considered.
The present document will touch on only the largest and most frequently used databases.
We will begin with an introduction to the Entrez search interface and will then proceed to
the details of some of the individual NCBI databases.
Entrez
Entrez is the unified search interface for NCBI databases. This common interface allows
easy linking between results in different databases.
Entrez Search Tips
• The Boolean operators AND, OR, and NOT may be used and must be in all caps.
• To see exactly how Entrez has interpreted your query, see the “Search details”
box on the right side of the screen.
• Use quotation marks to enclose a phrase.
• Use the asterisk for truncation (e.g., bacteri* will retrieve bacteria, bacterium,
bacteriophage).
• Enter author names in the format Johnson AB with no punctuation. Entrez
will recognize this as an author name and search only that field. When in doubt,
or when the initials are not known, use Johnson[AUTHOR].
• Clicking on “Advanced Search” will display a numbered list of searches for the
current session. Previously run searches may be combined using the syntax #2
AND #3.
• Field specific searching is also available under “Advanced Search.” Alternately,
one can use Entrez search field qualifiers (e.g., rbcL[GENE] to search only the
gene name field).
BLAST
Like Entrez, BLAST (Basic Local Alignment Search Tool) is not a database itself, but a
means of accessing the data in NCBI databases, particularly in the nucleotide and protein
databases. BLAST allows researchers to directly search nucleotide or protein sequence
18 Lecture notes in Bioinformatics
data. For instance, a researcher can submit a sequence through BLAST to see if there are
similar sequences already in the NCBI databases. In addition, BLAST can be used to
align two sequences using a tool called bl2seq.
PubMed
PubMed is the major bibliographic database from NCBI. It searches MEDLINE, a
database from the NLM that covers “medicine, nursing, dentistry, veterinary medicine,
the health care system, and the preclinical sciences, such as molecular biology.” PubMed
also allows access to articles that are out of scope for MEDLINE, but which appear in
journals indexed by MEDLINE. All material included in PubMed Central (NLM’s
online journal archive) is indexed, as well as a few additional databases from NLM.
PubMed employs a system called Automatic Term Mapping to match search terms to the
Medical Subject Headings (MeSH) vocabulary. To see how terms have been mapped,
see the “Search details” box on the right side of the screen. This is an invaluable tool for
troubleshooting a search.
Clicking “Advanced Search” on the PubMed search page provides many options for
customizing a search. Limits include human vs. animal subjects, male vs. female
subjects, age of subjects, article type, and journal type. Advanced search will also
display your search history.
The PubMed help file provides guidance on structuring searches and managing search
results.
Online Mendelian Inheritance in Man (OMIM)
OMIM is a database from Johns Hopkins University for human genetics containing short
articles with references on genetic disorders. It is an excellent starting point for any
question involving human genetics as it links out to bibliographic records in PubMed and
to sequence records.
Nucleotide Databases
The nucleotide sequence data in NCBI is a composite of the data from GenBank, the
European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan
(DDBJ).
NCBI’s nucleotide data is divided into three sub-databases:
1. GenBank Expressed Sequence Tags (EST) – these are generally short
sequences derived from mRNA isolated from a particular tissue at a particular
stage of development.
2. GenBank Genome Survey Sequence (GSS) – these are sequences derived from
whole-genome sequencing projects.
3. CoreNucleotide – all nucleotide sequences that are not ESTs or GSSs.
Confusingly, the links on the Entrez search page to EST, GSS, and CoreNucleotide all go
to the same Entrez Nucleotide search interface, so when a search is performed in any one
of the sub-databases, results are returned from all three. Links are provided at the top of
the results page so that results from a particular sub-database may be isolated.
Among the nucleotide sequences, there are some that are uncurated, meaning that they
are in the database just as they were submitted by researchers. Other records are referred
to as reference sequences (or RefSeq) and are curated by NCBI. RefSeq records are
identified by accession numbers beginning with two letters and an underscore (e.g., NM_,
XP_).
For more information about the nucleotide databases, see Chapter 1 of the NCBI
Handbook. For more information about the RefSeq project, see Chapter 18.
Protein Database
The protein database contains data from GenBank, EMBL, and DDBJ as well as
sequences submitted to various other sources including SWISS-PROT. As with the
19 Lecture notes in Bioinformatics
nucleotide database, RefSeq records are identified by accession numbers beginning with
two letters and an underscore.
Genome Database
The genome database provides views of entire genomes and chromosomes. Results are
displayed via NCBI’s Map Viewer, from which the user can zoom in on a region of
interest. The Map Viewer is highly customizable, allowing users to control what types of
maps are displayed and the level of resolution. Links to other NCBI databases are
provided. For help using the Map Viewer, see Chapter 20 of the NCBI Handbook.
Structure Database
The structure database contains three-dimensional images of proteins from the protein
database. It is searchable by keyword or by protein or nucleotide sequence. Protein
images can be manipulated using the free CN3D tool. Help is available from the
database help screen.
Gene Database
The gene database allows the user to search for individual genes from among the
genomes represented in RefSeq, providing useful summary statements about the gene and
links to other NCBI databases. As in the Genome database, results may be examined
using the sequence viewer.
Taxonomy
The taxonomy database contains the names of all organisms that are represented by
nucleotide or protein sequences in the NCBI databases. Records contain links to higher
taxa, nomenclatural synonyms, and links to the various databases in which records for a
given organism reside.
 DNA and Protein sequencing and analysis.
Introduction: Before 1970’s there was no direct method to determine the nucleotide sequence. In the
mid of 1970’s, two methods developed for the direct sequencing of DNA. These were the Sanger
Coulson’s chain termination method and Maxam Gilbert’s chain termination method. For which they
shared Nobel Prize in Chemistry (1980).
DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule.
It includes any method or technology that is used to determine the order of the four bases—adenine,
guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has
greatly accelerated biological and medical research and discovery.[1]
Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous
applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological
systematics. The rapid speed of sequencing attained with modern DNA sequencing technology has
been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and
species of life, including the human genome and other complete DNA sequences of many animal, plant,
and microbial species.
DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions
(i.e. clusters of genes or operons), full chromosomes or entire genomes, of any organism. DNA
sequencing is also the most efficient way to sequence RNA or proteins (via their open reading frames).
In fact, DNA sequencing has become a key technology in many areas of biology and other sciences such
as medicine, forensics, or anthropology
20 Lecture notes in Bioinformatics
An example of the results of automated chain-termination DNA sequencing.
The first DNA sequences were obtained in the early 1970s by academic researchers using laborious
methods based on two-dimensional chromatography. Following the development of fluorescence-
based sequencing methods with a DNA sequencer A sequencing has become easier and orders of
magnitude faster.
Methods of DNA sequencing.
Maxam-Gilbert sequencing
Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based on chemical
modification of DNA and subsequent cleavage at specific bases. Also known as chemical sequencing,
this method allowed purified samples of double-stranded DNA to be used without further cloning. This
method's use of radioactive labeling and its technical complexity discouraged extensive use after
refinements in the Sanger methods had been made.
Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and purification of the
DNA fragment to be sequenced. Chemical treatment then generates breaks at a small proportion of
one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The concentration of
the modifying chemicals is controlled to introduce on average one modification per DNA molecule.
Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each
molecule. The fragments in the four reactions are electrophoresed side by side in denaturing
acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for
autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment,
from which the sequence may be inferred.
Chain-termination methods
The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the
method of choice, owing to its relative ease and reliability When invented, the chain-terminator method
used fewer toxic chemicals and lower amounts of radioactivity than the Maxam and Gilbert method.
Because of its comparative ease, the Sanger method was soon automated and was the method used in
the first generation of DNA sequencers.
21 Lecture notes in Bioinformatics
Sanger sequencing is the method which prevailed from the 1980s until the mid-2000s. Over that period,
great advances were made in the technique, such as fluorescent labelling, capillary electrophoresis, and
general automation. These developments allowed much more efficient sequencing, leading to lower
costs. The Sanger method, in mass production form, is the technology which produced the first human
genome in 2001, ushering in the age of genomics. However, later in the decade, radically different
approaches reached the market, bringing the cost per genome down from $100 million in 2001 to
$10,000 in 2011.
Some tasks in DNA sequence analysis
• Large quantities of sequence data are being published, for organisms from bacteria to higher
mammals.
It will take decades to analyse the data.
Here are a few of the tasks involved in the analysis:
• Sequence assembly: Sequencing involves a “shotgun” approach where DNA fragments 1000bp
long are sequenced with significant over coverage. These then have to be “assembled”.
• Annotation: The assembled genomes have to be “annotated”: genes identified and marked
out, their functions identified, and so on.
• Motif finding: Non-coding DNA contains “regulatory” regions where proteins called “transcription
factors” bind to “turn on” genes. Identifying such regions, and binding sites for
individual TFs, is of great importance. TFs typically bind to small “motifs”, so the task is
to find overrepresented short “motifs” in larger quantities of sequence.
• Sequence alignment: In the last two tasks, it is very useful to compare genomes of previously
sequenced species. “Comparative genomics” is becoming a very important subfield. Detection
and alignment of homologous sequence is an important task here.
• Phylogenetic trees: Given sequence data from different species, it is useful to reconstruct their
phylogenetic relationship.
Algorithms exist for all these tasks, but all are evolving with increasing understanding of the
function of non-coding DNA, increasing mathematical and algorithmic sophistication in the methods,
and increasing raw computational power available to tackle these tasks.
There is not much explicit mention of parallel programming in what follows. But most problems
are intrinsically parallelisable, and many tasks require several independent runs that can be done
trivially in parallel.
22 Lecture notes in Bioinformatics
An overview of DNA Sequencing
 Protein sequencing and analysis
Is the practical process of determining the amino acid sequence of all or part of a protein or
peptide. This may serve to identify the protein or characterize its post-translational
modifications. Typically, partial sequencing of a protein provides sufficient information (one or
more sequence tags) to identify it with reference to databases of protein sequences derived
from the conceptual translation of genes.
The two major direct methods of protein sequencing are mass spectrometry and Edman degradation
using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for
protein sequencing and identification but Edman degradation remains a valuable tool for characterizing
a protein's N-terminus.
Determining amino acid composition
It is often desirable to know the unordered amino acid composition of a protein prior to attempting to
find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the
sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of
certain amino acids may also be used to choose which protease to use for digestion of the protein. The
misincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be
23 Lecture notes in Bioinformatics
determined.[1] A generalized method often referred to as amino acid analysis[2] for determining amino
acid frequency is as follows:
1. Hydrolyse a known quantity of protein into its constituent amino acids.
2. Separate and quantify the amino acids in some way.
Hydrolysis
Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours
or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However,
these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan,
glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests
heating separate samples for different times, analysing each resulting solution, and extrapolating back
to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as
thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising
cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of
amide hydrolysis.
Separation and quantitation
The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their
detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC.
An example of the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as
a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through
the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the
amino acids have been separated, their respective quantities are determined by adding a reagent that
will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be
used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino
acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With
very small quantities, down to 10 pmol, fluorescent derivatives can be formed using reagents such as
ortho-phthaldehyde (OPA) or fluorescamine.
Pre-column derivatization may use the Edman reagent to produce a derivative that is detected by UV
light. Greater sensitivity is achieved using a reagent that generates a fluorescent derivative. The
derivatized amino acids are subjected to reversed phase chromatography, typically using a C8 or C18
silica column and an optimised elution gradient. The eluting amino acids are detected using a UV or
fluorescence detector and the peak areas compared with those for derivatised standards in order to
quantify each amino acid in the sample.
N-terminal amino acid analysis
Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's
reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide
24 Lecture notes in Bioinformatics
Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid
the ordering of individual peptide fragments' sequences into a whole chain, and because the first round
of Edman degradation is often contaminated by impurities and therefore does not give an accurate
determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis
follows:
1. React the peptide with a reagent that will selectively label the terminal amino acid.
2. Hydrolyse the protein.
3. Determine the amino acid by chromatography and comparison with standards.
There are many different reagents which can be used to label terminal amino acids. They all react with
amine groups and will therefore also bind to amine groups in the side chains of amino acids such as
lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the
right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4-
dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for
the Edman degradation, can also be used. The same questions apply here as in the determination of
amino acid composition, with the exception that no stain is needed, as the reagents produce coloured
derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from
the chromatography column, just compared with a standard. Another consideration to take into
account is that, since any amine groups will have reacted with the labelling reagent, ion exchange
chromatography cannot be used, and thin layer chromatography or high-pressure liquid
chromatography should be used instead.
C-terminal amino acid analysis
The number of methods available for C-terminal amino acid analysis is much smaller than the number of
available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a
solution of the protein, take samples at regular intervals, and determine the terminal amino acid by
analysing a plot of amino acid concentrations against time. This method will be very useful in the case
of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying
the primary structures of proteins predicted from DNA sequences and to detect any postranslational
processing of gene products from known codon sequences.
Edman degradation
The Edman degradation is a very important reaction for protein sequencing, because it allows the
ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now
in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A
reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are
elaborated on subsequently.
1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A
protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming.
2. Separate and purify the individual chains of the protein complex, if there are more than one.
3. Determine the amino acid composition of each chain.
4. Determine the terminal amino acids of each chain.
5. Break each chain into fragments under 50 amino acids long.
6. Separate and purify the fragments.
7. Determine the sequence of each fragment.
8. Repeat with a different pattern of cleavage.
9. Construct the sequence of the overall protein.
Digestion into peptide fragments
Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman
degradation. Because of this, long protein chains need to be broken up into small fragments that can
then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or
25 Lecture notes in Bioinformatics
by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns,
and the overlap between fragments can be used to construct an overall sequence.
Reaction
The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre
coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added
to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts
with the amine group of the N-terminal amino acid.
The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The
derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and
identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%,
which allows about 50 amino acids to be reliably determined.
A Beckman-Coulter Porton LF3000G protein sequencing machine
Protein sequenator
A protein sequenator is a machine that performs Edman degradation in an automated manner. A
sample of the protein or peptide is immobilized in the reaction vessel of the protein sequenator and the
Edman degradation is performed. Each cycle releases and derivatises one amino acid from the protein
or peptide's N-terminus and the released amino-acid derivative is then identified by HPLC. The
sequencing process is done repetitively for the whole polypeptide until the entire measurable sequence
is established or for a pre-determined number of cycles.
Identification by mass spectrometry
Protein identification is the process of assigning a name to a protein of interest (POI), based on its
amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined
experimentally in order to identify the protein with reference to databases of protein sequences
deduced from the DNA sequences of their genes. Further protein characterization may include
confirmation of the actual N- and C-termini of the POI, determination of sequence variants and
identification of any post-translational modifications present.
Proteolytic digests
A general scheme for protein identification is described. The POI is isolated, typically by SDS-PAGE or
chromatography.
 The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S-
amidomethylation or S-carboxymethylation).
 The POI is digested with a specific protease to generate peptides. Trypsin, which cleaves
selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used
protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high
specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides
for mass spectrometry.
 The peptides may be desalted to remove ionizable contaminants and subjected to MALDI-TOF
mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient
information to identify the protein (see Peptide mass fingerprinting) but further fragmentation
of the peptides inside the mass spectrometer is often used to gain information about the
peptides’ sequences. Alternatively, peptides may be desalted and separated by reversed phase
26 Lecture notes in Bioinformatics
HPLC and introduced into a mass spectrometer via an ESI source. LC-ESI-MS may provide more
information than MALDI-MS for protein identification but uses more instrument time.
 Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a
variety of mechanisms such as Collision-induced dissociation (CID) or Post-source decay (PSD).
In each case, the pattern of fragment ions of a peptide provides information about its sequence.
 Information including the measured mass of the putative peptide ions and those of their
fragment ions is then matched against calculated mass values from the conceptual (in-silico)
proteolysis and fragmentation of databases of protein sequences. A successful match will be
found if its score exceeds a threshold based on the analysis parameters. Even if the actual
protein is not represented in the database, error-tolerant matching allows for the putative
identification of a protein based on similarity to homologous proteins. A variety of software
packages are available to perform this analysis.
 Software packages usually generate a report showing the identity (accession code) of each
identified protein, its matching score, and provide a measure of the relative strength of the
matching where multiple proteins are identified.
 A diagram of the matched peptides on the sequence of the identified protein is often used to
show the sequence coverage (% of the protein detected as peptides). Where the POI is thought
to be significantly smaller than the matched protein, the diagram may suggest whether the POI
is an N- or C-terminal fragment of the identified protein.
De novo sequencing
The pattern of fragmentation of a peptide allows for direct determination of its sequence by de novo
sequencing. This sequence may be used to match databases of protein sequences or to investigate
post-translational or chemical modifications. It may provide additional evidence for protein
identifications performed as above.
N- and C-termini
The peptides matched during protein identification do not necessarily include the N- or C-termini
predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to
identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N-
terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or
truncated termini may be identified by closer examination of the data (i.e. de novo sequencing). A
repeat digest using a protease of different specificity may also be useful.
Post-translational modifications
Whilst detailed comparison of the MS data with predictions based on the known protein sequence may
be used to define post-translational modifications, targeted approaches to data acquisition may also be
used. For instance, specific enrichment of phosphopeptides may assist in identifying phosphorylation
sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as ETD
or ECD, may give complementary sequence information.
Whole-mass determination
The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water
molecule and adjusted for any post-translational modifications. Although proteins ionize less well than
the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its
mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the
termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the
presence or absence of many post-translational modifications.
Limitations
Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of
the POI. The fragmentation of peptides in the mass spectrometer often does not yield ions
corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not
necessarily complete. The standard methods of fragmentation do not distinguish between leucine and
isoleucine residues since they are isomeric.
27 Lecture notes in Bioinformatics
Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N-
terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman
degradation is generally not useful to determine the positions of disulfide bridges. It also requires
peptide amounts of 1 picomole or above for discernible results, making it less sensitive than mass
spectrometry.
Introduction to sequence alignment
Sequence similarity search and sequence alignment
We can do a similarity search to learn if our sequenced DNA can be found in a public nucleotide
database (i.e. it has already been cloned by others) and/or whether it is evolutionally related (i.e.
homologous) to other sequences. In a simple similarity search, one can compare a sequence with
sequences found in an entire nucleotide database (see later the BLAST program), while for a homology
search the method of choice is multiple sequence alignment by the ClustalW program. By comparing
either nucleotide or amino acid sequences we can find homologs. If these are from different species
(that had a common ancestor) but have identical or similar functions they are called orthologs; while
those homologs that are found in the same organism and originate from a gene duplication event
followed by divergent evolution within the species are called paralogs. We will not cover the
construction of evolutionary trees in this e-book—one can learn about these in bioinformatics or
evolutionary biology courses.
The BLAST program
If we sequence a DNA clone, the first bioinformatics analysis is a similarity search against a nucleotide
database. The most widely used similarity search program accessible on the internet is BLAST (Basic
28 Lecture notes in Bioinformatics
Local Alignment Search Tool), which will be described here and will be used by the students during the
laboratory practice. The BLAST program is available online at several servers including the one at NCBI:
http://blast.ncbi.nlm.nih.gov/Blast.cgi.
BLAST uses a heuristic algorithm that makes it possible to search a huge database in a very short period
of time by using a query sequence. The high speed of the algorithm stems from the fact that the query
sequence is divided into short „words” that are used, instead of the full-length sequence, during the
alignment process. These words are searched in the database first (called „seeding”, i.e. finding the
best local alignments). The most relevant hits are then scored with the help of a scoring matrix,
extended to neighbouring words, and finally assembled and compiled into a final list of similarity hits. It
is important that the query sequences must be in the so-called FASTA format (FASTA was a previously
popular but much slower similarity search program). The FASTA format is shown in Figure 11.10.
Figure 11.10. The FASTA sequence format
If we want to search using a nucleotide query sequence within a nucleotide database, we can use the
BLASTN version of the program. If we have an amino acid sequence, we can search a protein database
by the BLASTP version of the program. The BLASTX version of the program translates a nucleotide
sequence in all six reading frames (three on each strand) and allows searching a protein database.
Finally, with the TBLAST subprogram, we can search against a translated nucleotide database using
either a protein (TBLASTN) or a nucleotide (TBLASTX) query sequence. These similarity search options
are summarised in Figure 11.11.
Figure 11.11. Search possibilities in the BLAST program
The result of a BLAST analysis is a list a sequences from the searched database that show significant
similarity to the query sequence. Besides the sequence identifiers of the similar sequence hits in the
database, the final list of alignments contains a score number and a statistical significance number, the
E-value. The E-value is a parameter that describes the number of hits one can expect to see by chance
29 Lecture notes in Bioinformatics
when searching a database of a particular size. It decreases exponentially as the score (S) of the match
increases. Essentially, the E-value describes the random background noise. The lower the E-value, or the
closer it is to zero, the more "significant" the match (E > 0.01 is usually considered to reflect a
homologous, i.e. evolutionarily-related sequence). The score value is calculated based on the
alignment, taking into account the gaps and the similarity of the amino acids at the aligned positions.
The most often used similarity matrix (an amino acid substitution matrix) is the BLOSUM (BLOcks
SUbstitution Matrix) matrix. The numbers within a BLOSUM are “log-odds” scores that measure, in an
alignment, the logarithm of the ratio of the likelihood of two amino acids appearing with a biological
sense and the likelihood of the same amino acids appearing by chance.
The similarity hits can be found and downloaded from the database using their accession number
(identifier). BLAST hits are usually hyperlinked directly to the corresponding entries in the GenBank
database where we can learn much more about the related sequences, the gene, cDNA and/or the
coded protein. As we have already mentioned, the most comprehensive information on a given protein
can be found in the UniProt database. In Figure 11.12, a detail of a BLAST run is shown in which the
BLASTP program was used to search the UniProt database using a human skeletal actin query
sequence.
Figure :Result of a sequence similarity search by the BLAST program (human skeletal muscle actin was
used as a query sequence against the UniProt database)
It is important to note that, since 3-D structure is more conserved than primary structure, it is easier to
recognise two related proteins by comparing their three-dimensional structure than their amino acid
sequence. Obviously, it is more convenient to compare primary sequences, since they are available for
much more proteins than the atomic-resolution structures. Similarity searches and protein structure
comparisons are dealt with in more detail in bioinformatics (or structural bioinformatics) courses.
30 Lecture notes in Bioinformatics
Figure : The wide range of in silico analysis possibilities of protein sequences. (Most of these options are
also available for nucleic acid sequences.)
 FASTA
Is a DNA and protein sequence alignment software package first described (as FASTP) by David J.
Lipman and William R. Pearson in 1985.[1] Its legacy is the FASTA format which is now ubiquitous in
bioinformatics.
The original FASTP program was designed for protein sequence similarity searching. Because of the
exponentially expanding genetic information and the limited speed and memory of computers in the
1980s heuristic methods were introduced aligning a query sequence to entire data-bases. FASTA
(developed in 1988) added the ability to do DNA:DNA searches, translated protein:DNA searches, and
also provided a more sophisticated shuffling program for evaluating statistical significance.[2] There are
several programs in this package that allow the alignment of protein sequences and DNA sequences.
Nowadays, increased computer performance makes it possible to perform searches for local alignment
detection in a database using the Smith-Waterman algorithm.
Uses:FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an
extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment.
The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA
(with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package
include special translated search algorithms that correctly handle frameshift errors (which six-frame-
translated searches do not handle very well) when comparing nucleotide to protein sequence data.
In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an
implementation of the optimal Smith-Waterman algorithm.
A major focus of the package is the calculation of accurate similarity statistics, so that biologists can
judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer
homology. The FASTA package is available from the University of Virginia[3] and the European
Bioinformatics Institute.[4]
The web-interface to submit sequences for running a search of the European Bioinformatics Institute
(EBI)'s online databases is also available using the FASTA programs.
31 Lecture notes in Bioinformatics
The FASTA file format used as input for this software is now largely used by other sequence database
search tools (such as BLAST) and sequence alignment programs (Clustal, T-Coffee, etc.).
FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence
database by using local sequence alignment to find matches of similar database sequences.
The FASTA program follows a largely heuristic method which contributes to the high speed of its
execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and
marks potential matches before performing a more time-consuming optimized search using a Smith-
Waterman type of algorithm.
The size taken for a word, given by the parameter kmer, controls the sensitivity and speed of the
program. Increasing the kmer value decreases number of background hits that are found. From the
word hits that are returned the program looks for segments that contain a cluster of nearby hits. It
then investigates these segments for a possible match.
There are some differences between fastn and fastp relating to the type of sequences used but both
use four steps and calculate three scores to describe and format the sequence similarity results. These
are:
 Identify regions of highest density in each sequence comparison. Taking a kmer to equal 1 or 2.
In this step all or a group of the identities between two sequences are found using a look up
table. The kmer value determines how many consecutive identities are required for a match to
be declared. Thus the lesser the kmer value: the more sensitive the search. kmer=2 is frequently
taken by users for protein sequences and kmer=4 or 6 for nucleotide sequences. Short
oligonucleotides are usually run with kmer= 1. The program then finds all similar local regions,
represented as diagonals of a certain length in a dot plot, between the two sequences by
counting kmer matches and penalizing for intervening mismatches. This way, local regions of
highest density matches in a diagonal are isolated from background hits. For protein sequences
BLOSUM50 values are used for scoring kmer matches. This ensures that groups of identities
with high similarity scores contribute more to the local diagonal score than to identities with low
similarity scores. Nucleotide sequences use the identity matrix for the same purpose. The best
10 local regions selected from all the diagonals put together are then saved.
 Rescan the regions taken using the scoring matrices. trimming the ends of the region to include
only those contributing to the highest score.
Rescan the 10 regions taken. This time use the relevant scoring matrix while rescoring to allow
runs of identities shorter than the kmer value. Also while rescoring conservative replacements
that contribute to the similarity score are taken. Though protein sequences use the BLOSUM50
matrix, scoring matrices based on the minimum number of base changes required for a specific
replacement, on identities alone, or on an alternative measure of similarity such as PAM, can
also be used with the program. For each of the diagonal regions rescanned this way, a subregion
with the maximum score is identified. The initial scores found in step1 are used to rank the
library sequences. The highest score is referred to as init1 score.
 In an alignment if several initial regions with scores greater than a CUTOFF value are found,
check whether the trimmed initial regions can be joined to form an approximate alignment with
32 Lecture notes in Bioinformatics
gaps. Calculate a similarity score that is the sum of the joined regions penalising for each gap 20
points. This initial similarity score (initn) is used to rank the library sequences. The score of the
single best initial region found in step 2 is reported (init1).
Here the program calculates an optimal alignment of initial regions as a combination of
compatible regions with maximal score. This optimal alignment of initial regions can be rapidly
calculated using a dynamic programming algorithm. The resulting score initn is used to rank the
library sequences.This joining process increases sensitivity but decreases selectivity. A carefully
calculated cut-off value is thus used to control where this step is implemented, a value that is
approximately one standard deviation above the average score expected from unrelated
sequences in the library. A 200-residue query sequence with kmer 2 uses a value 28.
 Use a banded Smith-Waterman algorithm to calculate an optimal score for alignment.
This step uses a banded Smith-Waterman algorithm to create an optimised score (opt) for each
alignment of query sequence to a database(library) sequence. It takes a band of 32 residues
centered on the init1 region of step2 for calculating the optimal alignment. After all sequences
are searched the program plots the initial scores of each database sequence in a histogram, and
calculates the statistical significance of the "opt" score. For protein sequences, the final
alignment is produced using a full Smith-Waterman alignment. For DNA sequences, a banded
alignment is provided.
FASTA cannot remove low complexity regions before aligning the sequences as it is possible with
BLAST. This might be problematic as when the query sequence contains such regions, e.g. mini- or
microsatellites repeating the same short sequence frequent times, this increases the score of not
familiar sequences in the database which only match in this repeats, which occur quite frequently.
Therefore the program PRSS is added in the FASTA distribution package. PRSS shuffles the matching
sequences in the database either on the one-letter level or it shuffles short segments which length the
user can determine. The shuffled sequences are now aligned again and if the score is still higher than
expected this is caused by the low complexity regions being mixed up still mapping to the query. By the
amount of the score the shuffled sequences still attain PRSS now can predict the significance of the
score of the original sequences. The higher the score of the shuffled sequences the less significant the
matches found between original database and query sequence.[5]
The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either
by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other
programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be
used to infer functional and evolutionary relationships between sequences as well as help identify
members of gene families.
Protein
 Protein–protein FASTA.
 Protein–protein Smith–Waterman (ssearch).
 Global protein–protein (Needleman–Wunsch) (ggsearch)
 Global/local protein–protein (glsearch)
 Protein–protein with unordered peptides (fasts)
 Protein–protein with mixed peptide sequences (fastf)
Nucleotide
33 Lecture notes in Bioinformatics
 Nucleotide–nucleotide (DNA/RNA fasta)
 Ordered nucleotides vs nucleotide (fastm)
 Unordered nucleotides vs nucleotide (fasts)
Translated
 Translated DNA (with frameshifts, e.g. ESTs) vs proteins (fastx/fasty)
 Protein vs translated DNA (with frameshifts) (tfastx/tfasty)
 Peptides vs translated DNA (tfasts)
Statistical significance
 Protein vs protein shuffle (prss)
 DNA vs DNA shuffle (prss)
 Translated DNA vs protein shuffle (prfx)
Local duplications
 Local protein alignments (lalign)
 Plot protein alignment "dot-plot" (plalign)
 Local DNA alignments (lalign)
 Plot DNA alignment "dot-plot" (plalign)

More Related Content

What's hot

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to BioinformaticsAsad Afridi
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBIgeetikaJethra
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformaticsnadeem akhter
 
sequence alignment
sequence alignmentsequence alignment
sequence alignmentammar kareem
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matricesAshwini
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformaticsbiinoida
 
Introduction of bioinformatics
Introduction of bioinformaticsIntroduction of bioinformatics
Introduction of bioinformaticsDr NEETHU ASOKAN
 
Bioinformatics
BioinformaticsBioinformatics
BioinformaticsAmna Jalil
 
Role of bioinformatics in drug designing
Role of bioinformatics in drug designingRole of bioinformatics in drug designing
Role of bioinformatics in drug designingW Roseybala Devi
 

What's hot (20)

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Introduction to NCBI
Introduction to NCBIIntroduction to NCBI
Introduction to NCBI
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Prosite
PrositeProsite
Prosite
 
databases in bioinformatics
databases in bioinformaticsdatabases in bioinformatics
databases in bioinformatics
 
TOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBITOOLS AND DATA BASES OF NCBI
TOOLS AND DATA BASES OF NCBI
 
SEQUENCE ANALYSIS
SEQUENCE ANALYSISSEQUENCE ANALYSIS
SEQUENCE ANALYSIS
 
Biological database
Biological databaseBiological database
Biological database
 
NCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology InformationNCBI National Center for Biotechnology Information
NCBI National Center for Biotechnology Information
 
sequence alignment
sequence alignmentsequence alignment
sequence alignment
 
Ddbj
DdbjDdbj
Ddbj
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
Scoring matrices
Scoring matricesScoring matrices
Scoring matrices
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)European molecular biology laboratory (EMBL)
European molecular biology laboratory (EMBL)
 
Introduction of bioinformatics
Introduction of bioinformaticsIntroduction of bioinformatics
Introduction of bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment   Clustal W - Multiple Sequence alignment
Clustal W - Multiple Sequence alignment
 
Role of bioinformatics in drug designing
Role of bioinformatics in drug designingRole of bioinformatics in drug designing
Role of bioinformatics in drug designing
 

Similar to LECTURE NOTES ON BIOINFORMATICS

Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolJesminBinti
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptxAshuAsh15
 
introduction of Bioinformatics
introduction of Bioinformaticsintroduction of Bioinformatics
introduction of BioinformaticsVinaKhan1
 
Pcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iPcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iMuhammad Younis
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfkigaruantony
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformaticaMartín Arrieta
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirKAUSHAL SAHU
 
Health Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxHealth Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxArti Parab Academics
 
What_is_Bioinformatics_Dr_Sudha.pdf
What_is_Bioinformatics_Dr_Sudha.pdfWhat_is_Bioinformatics_Dr_Sudha.pdf
What_is_Bioinformatics_Dr_Sudha.pdfVishwanathAvanti
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSsandeshGM
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyZac Darcy
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYZac Darcy
 

Similar to LECTURE NOTES ON BIOINFORMATICS (20)

Bioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST ToolBioinformatics Introduction and Use of BLAST Tool
Bioinformatics Introduction and Use of BLAST Tool
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Bio informatics
Bio informaticsBio informatics
Bio informatics
 
Presentation.pptx
Presentation.pptxPresentation.pptx
Presentation.pptx
 
introduction of Bioinformatics
introduction of Bioinformaticsintroduction of Bioinformatics
introduction of Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Pcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture iPcmd bioinformatics-lecture i
Pcmd bioinformatics-lecture i
 
Bioinformatics .pptx
Bioinformatics .pptxBioinformatics .pptx
Bioinformatics .pptx
 
Introduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdfIntroduction to Bioinformatics-1.pdf
Introduction to Bioinformatics-1.pdf
 
Introducción a la bioinformatica
Introducción a la bioinformaticaIntroducción a la bioinformatica
Introducción a la bioinformatica
 
Bioinformatics.pptx
Bioinformatics.pptxBioinformatics.pptx
Bioinformatics.pptx
 
50120140504019 2
50120140504019 250120140504019 2
50120140504019 2
 
Bioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sirBioinformatics, application by kk sahu sir
Bioinformatics, application by kk sahu sir
 
Health Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptxHealth Informatics- Module 5-Chapter 3.pptx
Health Informatics- Module 5-Chapter 3.pptx
 
What_is_Bioinformatics_Dr_Sudha.pdf
What_is_Bioinformatics_Dr_Sudha.pdfWhat_is_Bioinformatics_Dr_Sudha.pdf
What_is_Bioinformatics_Dr_Sudha.pdf
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
Protein sequence classification in data mining– a study
Protein sequence classification in data mining– a studyProtein sequence classification in data mining– a study
Protein sequence classification in data mining– a study
 
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDYPROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
PROTEIN SEQUENCE CLASSIFICATION IN DATA MINING– A STUDY
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 

More from MSCW Mysore

Mod 4 regulation of gene expression -notes SH.pdf
Mod 4 regulation of gene expression -notes SH.pdfMod 4 regulation of gene expression -notes SH.pdf
Mod 4 regulation of gene expression -notes SH.pdfMSCW Mysore
 
Regulation of Gene Expression-SH.pdf
Regulation of Gene Expression-SH.pdfRegulation of Gene Expression-SH.pdf
Regulation of Gene Expression-SH.pdfMSCW Mysore
 
unit 1.Ribosomes.pdf sh.pdf
unit 1.Ribosomes.pdf sh.pdfunit 1.Ribosomes.pdf sh.pdf
unit 1.Ribosomes.pdf sh.pdfMSCW Mysore
 
unit 1 cell wall , vacuole.pdf
unit 1 cell wall , vacuole.pdfunit 1 cell wall , vacuole.pdf
unit 1 cell wall , vacuole.pdfMSCW Mysore
 
unit 1 SCOPE OF BIOTECHNOLOGY .pdf
unit 1 SCOPE OF BIOTECHNOLOGY .pdfunit 1 SCOPE OF BIOTECHNOLOGY .pdf
unit 1 SCOPE OF BIOTECHNOLOGY .pdfMSCW Mysore
 
unit 1 cytoskeletal structures ECM docx.pdf sh.pdf
unit 1 cytoskeletal structures ECM docx.pdf sh.pdfunit 1 cytoskeletal structures ECM docx.pdf sh.pdf
unit 1 cytoskeletal structures ECM docx.pdf sh.pdfMSCW Mysore
 
Biotechnology III sem Practical manual
Biotechnology III sem Practical manual Biotechnology III sem Practical manual
Biotechnology III sem Practical manual MSCW Mysore
 
Vitamins -Biochemistry /Biotechnology
Vitamins -Biochemistry /BiotechnologyVitamins -Biochemistry /Biotechnology
Vitamins -Biochemistry /BiotechnologyMSCW Mysore
 
Waste water treatment technology SH/pdf
Waste water treatment technology SH/pdfWaste water treatment technology SH/pdf
Waste water treatment technology SH/pdfMSCW Mysore
 
Biomolecules lecture notes
Biomolecules  lecture notesBiomolecules  lecture notes
Biomolecules lecture notesMSCW Mysore
 
AVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYAVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYMSCW Mysore
 
Immunology and cell culture techniques
Immunology and cell culture techniquesImmunology and cell culture techniques
Immunology and cell culture techniquesMSCW Mysore
 
Cell membrane permeability and functions
Cell membrane permeability and functionsCell membrane permeability and functions
Cell membrane permeability and functionsMSCW Mysore
 
DNA REPLICATION DAMAGE AND REPAIR
DNA REPLICATION DAMAGE AND REPAIRDNA REPLICATION DAMAGE AND REPAIR
DNA REPLICATION DAMAGE AND REPAIRMSCW Mysore
 
Role of genetically engineered microorganisms in biodegradation
Role of genetically engineered microorganisms in biodegradationRole of genetically engineered microorganisms in biodegradation
Role of genetically engineered microorganisms in biodegradationMSCW Mysore
 
structural biology-Protein structure function relationship
structural biology-Protein structure function relationshipstructural biology-Protein structure function relationship
structural biology-Protein structure function relationshipMSCW Mysore
 
BIOCHEMISTRY LAB MANUAL
BIOCHEMISTRY LAB MANUALBIOCHEMISTRY LAB MANUAL
BIOCHEMISTRY LAB MANUALMSCW Mysore
 
Protein: structure, classification,function and assay methods
Protein: structure, classification,function and assay methodsProtein: structure, classification,function and assay methods
Protein: structure, classification,function and assay methodsMSCW Mysore
 
practical manual on molecular biology and genetic engineering,recombinant DNA...
practical manual on molecular biology and genetic engineering,recombinant DNA...practical manual on molecular biology and genetic engineering,recombinant DNA...
practical manual on molecular biology and genetic engineering,recombinant DNA...MSCW Mysore
 

More from MSCW Mysore (20)

Mod 4 regulation of gene expression -notes SH.pdf
Mod 4 regulation of gene expression -notes SH.pdfMod 4 regulation of gene expression -notes SH.pdf
Mod 4 regulation of gene expression -notes SH.pdf
 
Regulation of Gene Expression-SH.pdf
Regulation of Gene Expression-SH.pdfRegulation of Gene Expression-SH.pdf
Regulation of Gene Expression-SH.pdf
 
unit 1.Ribosomes.pdf sh.pdf
unit 1.Ribosomes.pdf sh.pdfunit 1.Ribosomes.pdf sh.pdf
unit 1.Ribosomes.pdf sh.pdf
 
unit 1 cell wall , vacuole.pdf
unit 1 cell wall , vacuole.pdfunit 1 cell wall , vacuole.pdf
unit 1 cell wall , vacuole.pdf
 
unit 1 SCOPE OF BIOTECHNOLOGY .pdf
unit 1 SCOPE OF BIOTECHNOLOGY .pdfunit 1 SCOPE OF BIOTECHNOLOGY .pdf
unit 1 SCOPE OF BIOTECHNOLOGY .pdf
 
unit 1 cytoskeletal structures ECM docx.pdf sh.pdf
unit 1 cytoskeletal structures ECM docx.pdf sh.pdfunit 1 cytoskeletal structures ECM docx.pdf sh.pdf
unit 1 cytoskeletal structures ECM docx.pdf sh.pdf
 
Biotechnology III sem Practical manual
Biotechnology III sem Practical manual Biotechnology III sem Practical manual
Biotechnology III sem Practical manual
 
Vitamins -Biochemistry /Biotechnology
Vitamins -Biochemistry /BiotechnologyVitamins -Biochemistry /Biotechnology
Vitamins -Biochemistry /Biotechnology
 
Waste water treatment technology SH/pdf
Waste water treatment technology SH/pdfWaste water treatment technology SH/pdf
Waste water treatment technology SH/pdf
 
Biomolecules lecture notes
Biomolecules  lecture notesBiomolecules  lecture notes
Biomolecules lecture notes
 
AVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGYAVENUES AND Careers IN BIOTECHNOLOGY
AVENUES AND Careers IN BIOTECHNOLOGY
 
Immunology and cell culture techniques
Immunology and cell culture techniquesImmunology and cell culture techniques
Immunology and cell culture techniques
 
Cell membrane permeability and functions
Cell membrane permeability and functionsCell membrane permeability and functions
Cell membrane permeability and functions
 
DNA REPLICATION DAMAGE AND REPAIR
DNA REPLICATION DAMAGE AND REPAIRDNA REPLICATION DAMAGE AND REPAIR
DNA REPLICATION DAMAGE AND REPAIR
 
Role of genetically engineered microorganisms in biodegradation
Role of genetically engineered microorganisms in biodegradationRole of genetically engineered microorganisms in biodegradation
Role of genetically engineered microorganisms in biodegradation
 
structural biology-Protein structure function relationship
structural biology-Protein structure function relationshipstructural biology-Protein structure function relationship
structural biology-Protein structure function relationship
 
BIOCHEMISTRY LAB MANUAL
BIOCHEMISTRY LAB MANUALBIOCHEMISTRY LAB MANUAL
BIOCHEMISTRY LAB MANUAL
 
Protein: structure, classification,function and assay methods
Protein: structure, classification,function and assay methodsProtein: structure, classification,function and assay methods
Protein: structure, classification,function and assay methods
 
Microbial world
Microbial worldMicrobial world
Microbial world
 
practical manual on molecular biology and genetic engineering,recombinant DNA...
practical manual on molecular biology and genetic engineering,recombinant DNA...practical manual on molecular biology and genetic engineering,recombinant DNA...
practical manual on molecular biology and genetic engineering,recombinant DNA...
 

Recently uploaded

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 

Recently uploaded (20)

Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 

LECTURE NOTES ON BIOINFORMATICS

  • 1. Lecture notes in Bioinformatics 2018 SARDAR HUSSAIN [COMPANY NAME] | [Company address]
  • 2. 1 Lecture notes in Bioinformatics Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines Computer Science, Biology, Mathematics, and Engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques. More broadly, bioinformatics is applied statistics and computing to biological science. Bioinformatics is both an umbrella term for the body of biological studies that use computer programming as part of their methodology, as well as a reference to specific analysis "pipelines" that are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the identification of candidate genes and single nucleotide polymorphisms (SNPs). Often, such identification is made with the aim of better understanding the genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a less formal way, bioinformatics also tries to understand the organizational principles within nucleic acid and protein sequences, called proteomics.[1] Introduction Bioinformatics has become an important part of many areas of biology. In experimental molecular biology, bioinformatics techniques such as image and signal processing allow extraction of useful results from large amounts of raw data. In the field of genetics and genomics, it aids in sequencing and annotating genomes and their observed mutations. It plays a role in the text mining of biological literature and the development of biological and gene ontologies to organize and query biological data. It also plays a role in the analysis of gene and protein expression and regulation. Bioinformatics tools aid in the comparison of genetic and genomic data and more generally in the understanding of evolutionary aspects of molecular biology. At a more integrative level, it helps analyze and catalogue the biological pathways and networks that are an important part of systems biology. In structural biology, it aids in the simulation and modeling of DNA,[2] RNA,[2][3] proteins[4] as well as biomolecular interactions.[5][6][7] History Historically, the term bioinformatics did not mean what it means today. Paulien Hogeweg and Ben Hesper coined it in 1970 to refer to the study of information processes in biotic systems.[8][9][10] This definition placed bioinformatics as a field parallel to biophysics (the study of physical processes in biological systems) or biochemistry (the study of chemical processes in biological systems).[8] Sequences Sequences of genetic material are frequently used in bioinformatics and are easier to manage using computers than manually. Computers became essential in molecular biology when protein sequences became available after Frederick Sanger determined the sequence of insulin in the early 1950s. Comparing multiple sequences manually turned out to be impractical. A pioneer in the field was Margaret Oakley Dayhoff, who has been hailed by David Lipman, director of the National Center for Biotechnology Information, as the "mother and father of bioinformatics." Dayhoff compiled one of the first protein sequence databases, initially published as books[12] and pioneered methods of sequence alignment and molecular
  • 3. 2 Lecture notes in Bioinformatics evolution.[13] Another early contributor to bioinformatics was Elvin A. Kabat, who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released with Tai Te Wu between 1980 and 1991.[14] Goals/scope To study how normal cellular activities are altered in different disease states, the biological data must be combined to form a comprehensive picture of these activities. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data. This includes nucleotide and amino acid sequences, protein domains, and protein structures.[15] The actual process of analyzing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:  Development and implementation of computer programs that enable efficient access to, use and management of, various types of information  Development of new algorithms (mathematical formulas) and statistical measures that assess relationships among members of large data sets. For example, there are methods to locate a gene within a sequence, to predict protein structure and/or function, and to cluster protein sequences into families of related sequences. The primary goal of bioinformatics is to increase the understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques to achieve this goal. Examples include: pattern recognition, data mining, machine learning algorithms, and visualization. Major research efforts in the field include sequence alignment, gene finding, genome assembly, drug design, drug discovery, protein structure alignment, protein structure prediction, prediction of gene expression and protein–protein interactions, genome-wide association studies, the modeling of evolution and cell division/mitosis. Bioinformatics now entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. Over the past few decades, rapid developments in genomic and other molecular research technologies and developments in information technologies have combined to produce a tremendous amount of information related to molecular biology. Bioinformatics is the name given to these mathematical and computing approaches used to glean understanding of biological processes. Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning DNA and protein sequences to compare them, and creating and viewing 3-D models of protein structures. Relation to other fields Bioinformatics is a science field that is similar to but distinct from biological computation, while it is often considered synonymous to computational biology. Biological computation uses bioengineering and biology to build biological computers, whereas bioinformatics uses computation to better understand biology. Bioinformatics and computational biology involve the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology.
  • 4. 3 Lecture notes in Bioinformatics Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from graph theory, artificial intelligence[16], soft computing, data mining, image processing, and computer simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics. Sequence analysis Main articles: Sequence alignment and Sequence database The sequences of different genes or proteins may be aligned side-by-side to measure their similarity. This alignment have to compares protein sequences and genomic sequences .containing WPP domains. Since the Phage Φ-X174 was sequenced in 1977,[17] the DNA sequences of thousands of organisms have been decoded and stored in databases. This sequence information is analyzed to determine genes that encode proteins, RNA genes, regulatory sequences, structural motifs, and repetitive sequences. A comparison of genes within a species or between different species can show similarities between protein functions, or relations between species (the use of molecular systematics to construct phylogenetic trees). With the growing amount of data, it long ago became impractical to analyze DNA sequences manually. Today, computer programs such as BLAST are used daily to search sequences from more than 260 000 organisms, containing over 190 billion nucleotides.[18] These programs can compensate for mutations (exchanged, deleted or inserted bases) in the DNA sequence, to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process itself. DNA sequencing Before sequences can be analyzed they have to be obtained. DNA sequencing is still a non-trivial problem as the raw data may be noisy or afflicted by weak signals. Algorithms have been developed for base calling for the various experimental approaches to DNA sequencing. Sequence assembly Main article: Sequence assembly Most DNA sequencing techniques produce short fragments of sequence that need to be assembled to obtain complete gene or genome sequences. The so-called shotgun sequencing technique (which was used, for example, by The Institute for Genomic Research (TIGR) to sequence the first bacterial genome, Haemophilus influenzae)[19] generates the sequences of many thousands of small DNA fragments (ranging from 35 to 900 nucleotides long, depending on the sequencing technology). The ends of these fragments overlap and, when aligned properly by a genome assembly program, can be used to reconstruct the complete genome. Shotgun sequencing yields sequence data quickly, but the task of assembling the fragments can be quite complicated for larger genomes. For a genome as large as the human genome, it may take many days of CPU time on large-memory, multiprocessor computers to assemble the fragments, and the resulting assembly usually contains numerous gaps that must be
  • 5. 4 Lecture notes in Bioinformatics filled in later. Shotgun sequencing is the method of choice for virtually all genomes sequenced today, and genome assembly algorithms are a critical area of bioinformatics research. See also: sequence analysis, sequence mining, sequence profiling tool, and sequence motif Genome annotation In the context of genomics, annotation is the process of marking the genes and other biological features in a DNA sequence. This process needs to be automated because most genomes are too large to annotate by hand, not to mention the desire to annotate as many genomes as possible, as the rate of sequencing has ceased to pose a bottleneck. Annotation is made possible by the fact that genes have recognisable start and stop regions, although the exact sequence found in these regions can vary between genes. The first description of a comprehensive genome annotation system was published in 1995 [19] by the team at The Institute for Genomic Research that performed the first complete sequencing and analysis of the genome of a free-living organism, the bacterium Haemophilus influenzae.[19] Owen White designed and built a software system to identify the genes encoding all proteins, transfer RNAs, ribosomal RNAs (and other sites) and to make initial functional assignments. Most current genome annotation systems work similarly, but the programs available for analysis of genomic DNA, such as the GeneMark program trained and used to find protein-coding genes in Haemophilus influenzae, are constantly changing and improving. Following the goals that the Human Genome Project left to achieve after its closure in 2003, a new project developed by the National Human Genome Research Institute in the U.S appeared. The so- called ENCODE project is a collaborative data collection of the functional elements of the human genome that uses next-generation DNA-sequencing technologies and genomic tiling arrays, technologies able to automatically generate large amounts of data at a dramatically reduced per-base cost but with the same accuracy (base call error) and fidelity (assembly error). Computational evolutionary biology Evolutionary biology is the study of the origin and descent of species, as well as their change over time. Informatics has assisted evolutionary biologists by enabling researchers to:  trace the evolution of a large number of organisms by measuring changes in their DNA, rather than through physical taxonomy or physiological observations alone,  more recently, compare entire genomes, which permits the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, and the prediction of factors important in bacterial speciation,  build complex computational population genetics models to predict the outcome of the system over time[20]  track and share information on an increasingly large number of species and organisms Future work endeavours to reconstruct the now more complex tree of life. The area of research within computer science that uses genetic algorithms is sometimes confused with computational evolutionary biology, but the two areas are not necessarily related. Comparative genomics
  • 6. 5 Lecture notes in Bioinformatics The core of comparative genome analysis is the establishment of the correspondence between genes (orthology analysis) or other genomic features in different organisms. It is these intergenomic maps that make it possible to trace the evolutionary processes responsible for the divergence of two genomes. A multitude of evolutionary events acting at various organizational levels shape genome evolution. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion.[21] Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation. The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who have recourse to a spectrum of algorithmic, statistical and mathematical techniques, ranging from exact, heuristics, fixed parameter and approximation algorithms for problems based on parsimony models to Markov chain Monte Carlo algorithms for Bayesian analysis of problems based on probabilistic models. Many of these studies are based on the homology detection and protein families computation. Pan genomics Pan genomics is a concept introduced in 2005 by Tettelin and Medini which eventually took root in bioinformatics. Pan genome is the complete gene repertoire of a particular taxonomic group: although initially applied to closely related strains of a species, it can be applied to a larger context like genus, phylum etc. It is divided in two parts- The Core genome: Set of genes common to all the genomes under study (These are often housekeeping genes vital for survival) and The Dispensable/Flexible Genome: Set of genes not present in all but one or some genomes under study. A bioinformatics tool BPGA can be used to characterize the Pan Genome of bacterial species.[23] Genetics of disease With the advent of next-generation sequencing we are obtaining enough sequence data to map the genes of complex diseases such as diabetes,[24] infertility,[25] breast cancer[26] or Alzheimer's Disease.[27] Genome-wide association studies are a useful approach to pinpoint the mutations responsible for such complex diseases.[28] Through these studies, thousands of DNA variants have been identified that are associated with similar diseases and traits.[29] Furthermore, the possibility for genes to be used at prognosis, diagnosis or treatment is one of the most essential applications. Many studies are discussing both the promising ways to choose the genes to be used and the problems and pitfalls of using genes to predict disease presence or prognosis.[30] Analysis of mutations in cancer In cancer, the genomes of affected cells are rearranged in complex or even unpredictable ways. Massive sequencing efforts are used to identify previously unknown point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems to manage the sheer volume of sequence data produced, and they create new algorithms and software to compare the sequencing results to the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are employed, such as oligonucleotide microarrays to identify chromosomal gains and losses (called comparative genomic hybridization), and single-nucleotide polymorphism arrays to detect known point mutations. These detection methods simultaneously measure several hundred thousand sites throughout the genome, and when used in high-throughput to measure thousands of samples, generate terabytes of data per experiment. Again the massive amounts and new types of data generate new opportunities for bioinformaticians. The
  • 7. 6 Lecture notes in Bioinformatics data is often found to contain considerable variability, or noise, and thus Hidden Markov model and change-point analysis methods are being developed to infer real copy number changes. Two important principles can be used in the analysis of cancer genomes bioinformatically pertaining to the identification of mutations in the exome. First, cancer is a disease of accumulated somatic mutations in genes. Second cancer contains driver mutations which need to be distinguished from passengers.[31] With the breakthroughs that this next-generation sequencing technology is providing to the field of Bioinformatics, cancer genomics could drastically change. These new methods and software allow bioinformaticians to sequence many cancer genomes quickly and affordably. This could create a more flexible process for classifying types of cancer by analysis of cancer driven mutations in the genome. Furthermore, tracking of patients while the disease progresses may be possible in the future with the sequence of cancer samples.[32] Another type of data that requires novel informatics development is the analysis of lesions found to be recurrent among many tumors. Gene and protein expression Analysis of gene expression The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies.[33] Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells. Analysis of protein expression Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected. Cellular protein localization in a tissue context can be achieved through affinity proteomics displayed as spatial data based on immunohistochemistry and tissue microarrays.[34] Analysis of regulation Regulation is the complex orchestration of events by which a signal, potentially an extracellular signal such as a hormone, eventually leads to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process.
  • 8. 7 Lecture notes in Bioinformatics For example, gene expression can be regulated by nearby elements in the genome. Promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Enhancer elements far away from the promoter can also regulate gene expression, through three-dimensional looping interactions. These interactions can be determined by bioinformatic analysis of chromosome conformation capture experiments. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single- cell organism, one might compare stages of the cell cycle, along with various stress conditions (heat shock, starvation, etc.). One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements. Examples of clustering algorithms applied in gene clustering are k-means clustering, self-organizing maps (SOMs), hierarchical clustering, and consensus clustering methods. Analysis of cellular organization Several approaches have been developed to analyze the location of organelles, genes, proteins, and other components within cells. This is relevant as the location of these components affects the events within a cell and thus helps us to predict the behavior of biological systems. A gene ontology category, cellular compartment, has been devised to capture subcellular localization in many biological databases. Microscopy and image analysis Microscopic pictures allow us to locate both organelles as well as molecules. It may also help us to distinguish between normal and abnormal cells, e.g. in cancer. Protein localization The localization of proteins helps us to evaluate the role of a protein. For instance, if a protein is found in the nucleus it may be involved in gene regulation or splicing. By contrast, if a protein is found in mitochondria, it may be involved in respiration or other metabolic processes. Protein localization is thus an important component of protein function prediction. There are well developed protein subcellular localization prediction resources available, including protein subcellualr location databases, and prediction tools.[35][36] Nuclear organization of chromatin Data from high-throughput chromosome conformation capture experiments, such as Hi-C (experiment) and ChIA-PET, can provide information on the spatial proximity of DNA loci. Analysis of these experiments can determine the three-dimensional structure and nuclear organization of chromatin. Bioinformatic challenges in this field include partitioning the genome into domains, such as Topologically Associating Domains (TADs), that are organised together in three-dimensional space.[37] Structural bioinformatics
  • 9. 8 Lecture notes in Bioinformatics 3-dimensional protein structures such as this one are common subjects in bioinformatic analyses. Protein structure prediction is another important application of bioinformatics. The amino acid sequence of a protein, the so-called primary structure, can be easily determined from the sequence on the gene that codes for it. In the vast majority of cases, this primary structure uniquely determines a structure in its native environment. (Of course, there are exceptions, such as the bovine spongiform encephalopathy – a.k.a. Mad Cow Disease – prion.) Knowledge of this structure is vital in understanding the function of the protein. Structural information is usually classified as one of secondary, tertiary and quaternary structure. A viable general solution to such predictions remains an open problem. Most efforts have so far been directed towards heuristics that work most of the time.[citation needed] One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: if the sequence of gene A, whose function is known, is homologous to the sequence of gene B, whose function is unknown, one could infer that B may share A's function. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in structure formation and interaction with other proteins. In a technique called homology modeling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This currently remains the only way to predict protein structures reliably. One example of this is the similar protein homology between hemoglobin in humans and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the organism. Though both of these proteins have completely different amino acid sequences, their protein structures are virtually identical, which reflects their near identical purposes.[38] Other techniques for predicting protein structure include protein threading and de novo (from scratch) physics-based modeling. Network and systems biology Network analysis seeks to understand the relationships within biological networks such as metabolic or protein–protein interaction networks. Although biological networks can be constructed from a single type of molecule or entity (such as genes), network biology often attempts to integrate many different data types, such as proteins, small molecules, gene expression data, and others, which are all connected physically, functionally, or both.
  • 10. 9 Lecture notes in Bioinformatics Systems biology involves the use of computer simulations of cellular subsystems (such as the networks of metabolites and enzymes that comprise metabolism, signal transduction pathways and gene regulatory networks) to both analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution attempts to understand evolutionary processes via the computer simulation of simple (artificial) life forms. Molecular interaction networks Interactions between proteins are frequently visualized and analyzed using networks. This network is made up of protein–protein interactions from Treponema pallidum, the causative agent of syphilis and other diseases. Tens of thousands of three-dimensional protein structures have been determined by X-ray crystallography and protein nuclear magnetic resonance spectroscopy (protein NMR) and a central question in structural bioinformatics is whether it is practical to predict possible protein–protein interactions only based on these 3D shapes, without performing protein–protein interaction experiments. A variety of methods have been developed to tackle the protein–protein docking problem, though it seems that there is still much work to be done in this field. Other interactions encountered in the field include Protein–ligand (including drug) and protein– peptide. Molecular dynamic simulation of movement of atoms about rotatable bonds is the fundamental principle behind computational algorithms, termed docking algorithms, for studying molecular interactions.  Sequence development We start with a very basic review of biology, necessary for any further work, but largely sufficient for getting started in computational biology. One can (and must) learn more “on the job”. Biomolecules are sequences of monomers (DNA, RNA=nucleotide sequences, proteins=amino acid sequences). DNA is the molecule that contains the entire blueprint for an organism. It contains genes that encode the sequences for every protein in the organism, as well as non-coding regions that, among other things, contain regulatory mechanisms for when and in what order different genes get turned on, and may have other functions as well. Most genes code for proteins; some genes code for RNA molecules that play various roles in the cell. Both DNA and RNA are polymers of “nucleotides” which are bases of four kinds [adenine=A, cytosine=C, guanine=G, thymine=T (DNA only), uracil=U (RNA only)] attached to sugar-phosphate backbones. Apart from the one difference in bases, RNA and DNA are very similar except that DNA usually exists in double-stranded “base-paired” form and RNA is in single-stranded form. The backbone of DNA (or RNA) is not symmetrical: each monomer has a 5’-phosphate group at one end and a 3’-hydroxyl group at the other. Each strand is usually read from the 5’ to the 3’ end. The two strands go in opposite directions. The nucleic acids are base-paired A to T, G to C.
  • 11. 10 Lecture notes in Bioinformatics A-T bonds are weaker (double-bonds), G-C bonds are stronger (triple-bonds).Proteins are the “building blocks” of life, responsible for a vast number of cellular processes.They regulate genes, catalyse various biochemical reactions, form machinery for synthesis of othermolecules (including other proteins) and are important parts of organelles and tissues. They arepolymers of amino acids (carboxylic acids with an amide group and a side chain). There are twentynaturally occurring amino acids, differing in their side chains.Proteins tend to “fold” into complex three-dimensional conformations; usually the fold is unique and misfolding is rare. The details of the fold are biochemically important. Usually a few active “domains” (for example, binding to DNA, interaction with other proteins) help the protein play important roles in gene regulation, catalysis, etc; these domains tend to be well conserved across species, while the rest of the protein sequence can mutate a lot. Much computational effort goes into studying protein structure and function, but we will not discuss this vast subject here. Genes that code for proteins are first “transcribed” to “messenger RNA” (mRNA) molecules, and then the RNA is “translated” to proteins. Each “codon” of three nucleotides corresponds to a unique amino acid. Since there are 4 nucleotides, there are 64 possible codons; three of these are “stop codons” (TAA, TAG, TGA) (sometimes called “nonsense codons”) and don’t code for amino acids, instead indicating a stop to transcription. The remaining 61 code for 20 amino acids. Several codons (up to six) thus can code for the same amino acid. The “start codon” is ATG, which codes for the amino acid methionine. What are the biological problems? There are of course a huge number of problems in biology that can benefit from a quantitative treatment, ranging from single molecule behaviour to population biology and ecology. From the title, we are already restricting ourselves to bioinformatics, but we will mainly focus on DNA sequence analysis, with only occasional mention of proteins. The following are a few issues of interest to biologists (and often of medical importance) that could benefit from analysis of DNA sequence: • Cellular processes: how the cell carries out its normal tasks; how it responds to external events like heat shock and starvation; how it carries out complex cascades of events such as the process of cell division (mitosis). • Development: How a complex organism (eg a worm, a fly, a human) develops from a single fertilised egg. As this embryonic cell divides, the daughter cells also slowly differentiate into functions. This happens as a result of “gradients” of various factors (some of them maternal) that change gene regulation in different parts of the embryo and ultimately cause different cells to develop in highly specialised ways. • Evolution: How different species evolve, how new functionality develops. All cellular and developmental processes are controlled by genes that get turned on in response to some external condition (stress, starvation, embryonic gradients) or cyclically (cell cycle). Computational study of how these genes are regulated and how they function is very useful. This is done by analysing the gene sequence and regulatory DNA sequence of the organism itself, and by comparison of this sequence with already-annotated sequence from other organisms. Highly similar (homologous) genes exist among widely different organisms; such genes are called “orthologues”. Many subsystems in widely different organisms are very similar and are regulated by orthologous proteins; some proteins exist largely unchanged from primitive archaebacteria all the way to humans. Moreover, many genes with heavy sequence identity often exist in the same organism, arising from ancestral “gene duplication” events; their function is often slightly differentiated, and in fact
  • 12. 11 Lecture notes in Bioinformatics this is a major driving factor in evolution. There are now “high-throughput” microarray experiments that can essentially give the response of every gene in the genome; analysing, clustering and interpreting this data, and combining it with other computational tasks in gene regulation, is of great interest. Finally, the study of phylogeny (evolutionary history of organisms) and the classification (taxonomy) of organisms has been revolutionised by DNA sequencing.  Aims and tasks of Bioinformatics The aims of bioinformatics are threefold. First, at its simplest bioinformatics organizes data in a way that allows researchers to access existing information and to submit new entries as they are produced, eg the Protein Data Bank for 3D macromolecular structures. While data- curation is an essential task, the information stored in these databases is essentially useless until analyzed. Thus the purpose of bioinformatics extends much further. The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences. This needs more than just a simple text-based search and programs such as FASTA and PSI-BLAST must consider what comprises a biologically significant match. Development of such resources dictates expertise in computational theory as well as a thorough understanding of biology. The third aim is to use these tools to analyze the data and interpret the results in a biologically meaningful manner. Traditionally, biological studies examined individual systems in detail, and frequently compared those with a few that are related. In bioinformatics, we can now conduct global analyses of all the available data with the aim of uncovering common principles that apply across many systems and highlight novel features.  Application of Bioinformatics in current research Currently almost every field of biological research has accepted this biological research weapon and following it, whether it is molecular biology or genetics, or even agriculture. There a complete new emerging field of genome informatics is there which is completely based on bioinformatics tools . Apart from these there are so many areas where bioinformatics is readily being accepted with primary role in prediction of structure similarity and functional similarity in novel drug molecule research also. They perform initially tasks such as • Submitting DNA Sequences to the Databases This is one of important thing in biological research, where scientists sequence DNA, and RNA, but until it is not getting deposited to any public sequence database, that cannot be beneficial for scientific community. It became very essential to submit all the sequenced data to public sequence repositories. Some of the important public repositories are DDBJ, EMBL, and Genebank. These sequence data can be submitted to repositories in two ways, either by email submission or by online submission through sequence submission tools. There are specific tools for every public sequence repository (Table 1).
  • 13. 12 Lecture notes in Bioinformatics Table 1: Public sequence depositorieS Table 2: Human Genome Databases, Browsers and Variation Resources
  • 14. 13 Lecture notes in Bioinformatics Table 2a: Vertebrate databases and genome browsers
  • 15. 14 Lecture notes in Bioinformatics Table 3: List of some invertebrate databases and genome browsers After submission every database provides an unique accession number to submitted sequence after verification and duplication checks. If it is an unique sequence, then accession number is given as a single letter followed by 5 digit number, but recently due to huge number of submission two letters followed by 6 digit of number for accession number is now proposed. • Genomic Mapping and Mapping Databases Gene mapping is one of the technique to estimate accurate position of gene and corresponding distance between related genes of similar type. After complete evaluation we can reach to a conclusion of genome map for complete genome for that particular organism. • Information Retrieval From Biological Database Developing biological database and its availability online was one of the primary concerns at initial stage of biological research, but now as we have many biological database and data is in form of text, table and pictures and many other formats. We should really know that how to retrieve exact data from a suitable database. Database may be of text retrieval, sequence retrieval or it may also include structural data retrieval importance.
  • 16. 15 Lecture notes in Bioinformatics • Sequence Alignment and Database Searching Alignment of sequence with compare to other relevant and similar sequence is very much needed in biological research to understand relation between two sequences and also to predict structure and function based on sequence similarity. For basic alignment of sequences use of BLAST is very common. Based on number of sequences involved in sequencing, we can classify these alignments into pairwise alignment or multiple sequence alignment. • Predictive Methods Using DNA Sequences Gene-finding strategies can be Classified into three major categories. Content-based methods rely on the overall, bulk properties of a sequence in making determination. Characteristics considered here include how often particular codons are used, the periodicity of repeats, and the compositional complexity of the sequence. Because different organisms use synonymous codons with different frequency, such clues can provide insight into determining regions that are more likely to be exons. In site-based methods, the focus turns to the presence or absence of a specific sequence, pattern, or consensus. These methods are used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, polyA tracts, and start and stop codons. comparative methods make determinations based on sequence homology. Here, translated sequences are subjected to database searches against protein sequences to determine whether a previously characterized coding region corresponds to a region in the query sequence. Although this is conceptually the most straightforward of the methods, it is restrictive because most newly discovered genes do not have gene products that match anything in the protein databases. Tools associated with these are Grail, Genscan, Fgenes, procrustes and many others developed with bioinformatics. • Predictive Methods Using Protein Sequences o There are tools based on predictive methods using protein sequences, such as PSLPred, NRpred, PSEAPred. There are other methods also based on motif level, residue level, signal level, peptide level, domain level and profile based]. • Sequences Assembly and Finishing Methods At present,the sequencing process is often talked of as consisting of two parts, namely, assembly and finishing, but in practice there is considerable overlap between the two. Assembly is the process of attempting to order and align the readings, and finishing is the task of checking and editing the assembled data. This includes performing new sequencing experiments to fill gaps or to cover segments where the data is poor and adjudicating between conflicting readings when editing. • Phylogenetic Analysis o Phylogenetic analysis is also one of the important implementation of bioinformatics in biological research. Phylogenetic analysis is study of ancestral history of an organism. Here after sequence and structural similarities we try to relate organism’s ancestral history to show how origin of organism was related to each other and “what was order of evolution”. We actually do evolutionary history analysis by phylogenetic analysis. o There are many tools available online as well as commercial packages also like PHYLIP. It uses tree generation methods with algorithms based on methods such as UPGMA, and neighbor joining. • Comparative Genome Analysis o Comparative genome analysis is also being performed in various researches at many levels such as academics and professional researches. By comparing the finished reference sequence of the human genome with genomes of other organisms, researchers can identify regions of similarity and difference. This information can help scientists better understand the structure and function of human genes and thereby develop new strategies to combat human disease. Comparative genomics also provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved among species, as well as genes that give each organism its unique characteristics.
  • 17. 16 Lecture notes in Bioinformatics • Large-Scale Genome Analysis Large scale genome analysis is complete genome sequencing, and this application has much advancement as next generation sequencing and bioinformatics tools like illumina have been developed to analyze them very quickly. These tools are generally termed as sequencer and playing a vital role in modern biological research. There are so many other application in pharmaceutical research also have been seen these days as it also deals with systems biology and pathways of metabolites and their relation to biological functioning similarity. Recent Advancement However bioinformatics is still in its nascent stage, but continuous improvement is making it more efficient . Mostly with inclusion of various computer language incorporation in this field and development of software packages for analysis of biological data is contributing to recent advancement in this field . Drug designing software packages like “Sanjeevni” developed by IIT Delhi, India maestro from Schrodinger also is contributing a lot to it. Indian agricultural statistical research institute (IASRI) also is making a huge contribution towards bioinformatics research by creating so many databases on agricultural and biological area. Most recent use of bioinformatics has been seen in novel drug molecule discovery and ligand analysis for protein targets in human physiological cycle to receive most possible cure for lethal diseases in short period]. There is plenty of docking software available, which are very efficient and proved their accuracy also Review and conclusion With inclusion of large number of tools and implementation of bioinformatics in various biological research areas, it is now showing its existence and importance simultaneously. Now a day every experiment in biological research is getting associated with bioinformatics. It has made research very simple and fast, but still validation for various techniques are still in process for its accuracy. We are able to get a lot of results in a minute, which was not possible by using wet lab techniques in biotechnology.  Guide to NCBI Databases tools & Services/uses. The National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), has created a large number of databases that are freely available to researchers. These databases represent a vast store of information about genetics, genomics, proteomics, and medicine. All of these databases can be reached from the Entrez search page. This page also allows cross searching of all the NCBI databases, a feature called Entrez Global Query.  There are two main functions of biological databases: 1. Make biological data available to scientists. As much as possible of a particular type of information should be available in one single place (book, site, database). Published data may be difficult to find or access, and collecting it from the literature is very time-consuming. And not all data is actually published in an article. 2. To make biological data available in computer-readable form. Since analysis of biological data almost always involves computers, having the data in computer- readable form (rather than printed on paper) is a necessary first step. One of the first biological sequence databases was probably the book "Atlas of Protein Sequences and Structures" by Margaret Dayhoff and colleagues, first published in 1965. It contained the protein sequences determined at the time, and new editions of the book were published well into the 1970s. Its data became the foundation for the PIR database.
  • 18. 17 Lecture notes in Bioinformatics The computer became the storage medium of choice as soon as it was accessible to ordinary scientists. Databases were distributed on tape, and later on various kinds of disks. When universities and academic institutes were connected to the Internet or its precursors (national computer networks), it is easy to understand why it became the medium of choice. And it is even easier to see why the World Wide Web (WWW, based on the Internet protocol HTTP) since the beginning of the 1990s is the standard method of communication and access for nearly all biological databases. As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown tremendously. The obvious examples are the nucleotide sequences, the protein sequences, and the 3D structural data produced by X-ray crystallography and macromolecular NMR. An new field of science dealing with issues, challenges and new possibilities created by these databases has emerged: bioinformatics. Other types of data that are or will soon be available in databases are metabolic pathways, gene expression data (microarrays) and other types of data relating to biological function and processes. One very important issue is the frequency and type of errors that the entries in a database have. Naturally, this depends strongly on the type of data, and whether the database is curated (added, deleted, or modified by a defined group of people) or not. For the sequence databases, the errors may be either in the sequence itself (misprint, wrong on entry, genuine experimental error...) or in the annotation (mistaken features, errors in references,...). In the 3D structure database (PDB), structures have been deposited which were later discovered to contain severe errors. The error handling policy differs considerably between databases. If one needs to use any particular database heavily, then the implications of its particular policy need to be considered. The present document will touch on only the largest and most frequently used databases. We will begin with an introduction to the Entrez search interface and will then proceed to the details of some of the individual NCBI databases. Entrez Entrez is the unified search interface for NCBI databases. This common interface allows easy linking between results in different databases. Entrez Search Tips • The Boolean operators AND, OR, and NOT may be used and must be in all caps. • To see exactly how Entrez has interpreted your query, see the “Search details” box on the right side of the screen. • Use quotation marks to enclose a phrase. • Use the asterisk for truncation (e.g., bacteri* will retrieve bacteria, bacterium, bacteriophage). • Enter author names in the format Johnson AB with no punctuation. Entrez will recognize this as an author name and search only that field. When in doubt, or when the initials are not known, use Johnson[AUTHOR]. • Clicking on “Advanced Search” will display a numbered list of searches for the current session. Previously run searches may be combined using the syntax #2 AND #3. • Field specific searching is also available under “Advanced Search.” Alternately, one can use Entrez search field qualifiers (e.g., rbcL[GENE] to search only the gene name field). BLAST Like Entrez, BLAST (Basic Local Alignment Search Tool) is not a database itself, but a means of accessing the data in NCBI databases, particularly in the nucleotide and protein databases. BLAST allows researchers to directly search nucleotide or protein sequence
  • 19. 18 Lecture notes in Bioinformatics data. For instance, a researcher can submit a sequence through BLAST to see if there are similar sequences already in the NCBI databases. In addition, BLAST can be used to align two sequences using a tool called bl2seq. PubMed PubMed is the major bibliographic database from NCBI. It searches MEDLINE, a database from the NLM that covers “medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences, such as molecular biology.” PubMed also allows access to articles that are out of scope for MEDLINE, but which appear in journals indexed by MEDLINE. All material included in PubMed Central (NLM’s online journal archive) is indexed, as well as a few additional databases from NLM. PubMed employs a system called Automatic Term Mapping to match search terms to the Medical Subject Headings (MeSH) vocabulary. To see how terms have been mapped, see the “Search details” box on the right side of the screen. This is an invaluable tool for troubleshooting a search. Clicking “Advanced Search” on the PubMed search page provides many options for customizing a search. Limits include human vs. animal subjects, male vs. female subjects, age of subjects, article type, and journal type. Advanced search will also display your search history. The PubMed help file provides guidance on structuring searches and managing search results. Online Mendelian Inheritance in Man (OMIM) OMIM is a database from Johns Hopkins University for human genetics containing short articles with references on genetic disorders. It is an excellent starting point for any question involving human genetics as it links out to bibliographic records in PubMed and to sequence records. Nucleotide Databases The nucleotide sequence data in NCBI is a composite of the data from GenBank, the European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan (DDBJ). NCBI’s nucleotide data is divided into three sub-databases: 1. GenBank Expressed Sequence Tags (EST) – these are generally short sequences derived from mRNA isolated from a particular tissue at a particular stage of development. 2. GenBank Genome Survey Sequence (GSS) – these are sequences derived from whole-genome sequencing projects. 3. CoreNucleotide – all nucleotide sequences that are not ESTs or GSSs. Confusingly, the links on the Entrez search page to EST, GSS, and CoreNucleotide all go to the same Entrez Nucleotide search interface, so when a search is performed in any one of the sub-databases, results are returned from all three. Links are provided at the top of the results page so that results from a particular sub-database may be isolated. Among the nucleotide sequences, there are some that are uncurated, meaning that they are in the database just as they were submitted by researchers. Other records are referred to as reference sequences (or RefSeq) and are curated by NCBI. RefSeq records are identified by accession numbers beginning with two letters and an underscore (e.g., NM_, XP_). For more information about the nucleotide databases, see Chapter 1 of the NCBI Handbook. For more information about the RefSeq project, see Chapter 18. Protein Database The protein database contains data from GenBank, EMBL, and DDBJ as well as sequences submitted to various other sources including SWISS-PROT. As with the
  • 20. 19 Lecture notes in Bioinformatics nucleotide database, RefSeq records are identified by accession numbers beginning with two letters and an underscore. Genome Database The genome database provides views of entire genomes and chromosomes. Results are displayed via NCBI’s Map Viewer, from which the user can zoom in on a region of interest. The Map Viewer is highly customizable, allowing users to control what types of maps are displayed and the level of resolution. Links to other NCBI databases are provided. For help using the Map Viewer, see Chapter 20 of the NCBI Handbook. Structure Database The structure database contains three-dimensional images of proteins from the protein database. It is searchable by keyword or by protein or nucleotide sequence. Protein images can be manipulated using the free CN3D tool. Help is available from the database help screen. Gene Database The gene database allows the user to search for individual genes from among the genomes represented in RefSeq, providing useful summary statements about the gene and links to other NCBI databases. As in the Genome database, results may be examined using the sequence viewer. Taxonomy The taxonomy database contains the names of all organisms that are represented by nucleotide or protein sequences in the NCBI databases. Records contain links to higher taxa, nomenclatural synonyms, and links to the various databases in which records for a given organism reside.  DNA and Protein sequencing and analysis. Introduction: Before 1970’s there was no direct method to determine the nucleotide sequence. In the mid of 1970’s, two methods developed for the direct sequencing of DNA. These were the Sanger Coulson’s chain termination method and Maxam Gilbert’s chain termination method. For which they shared Nobel Prize in Chemistry (1980). DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.[1] Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological systematics. The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and species of life, including the human genome and other complete DNA sequences of many animal, plant, and microbial species. DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes or entire genomes, of any organism. DNA sequencing is also the most efficient way to sequence RNA or proteins (via their open reading frames). In fact, DNA sequencing has become a key technology in many areas of biology and other sciences such as medicine, forensics, or anthropology
  • 21. 20 Lecture notes in Bioinformatics An example of the results of automated chain-termination DNA sequencing. The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional chromatography. Following the development of fluorescence- based sequencing methods with a DNA sequencer A sequencing has become easier and orders of magnitude faster. Methods of DNA sequencing. Maxam-Gilbert sequencing Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based on chemical modification of DNA and subsequent cleavage at specific bases. Also known as chemical sequencing, this method allowed purified samples of double-stranded DNA to be used without further cloning. This method's use of radioactive labeling and its technical complexity discouraged extensive use after refinements in the Sanger methods had been made. Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and purification of the DNA fragment to be sequenced. Chemical treatment then generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The concentration of the modifying chemicals is controlled to introduce on average one modification per DNA molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side in denaturing acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment, from which the sequence may be inferred. Chain-termination methods The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the method of choice, owing to its relative ease and reliability When invented, the chain-terminator method used fewer toxic chemicals and lower amounts of radioactivity than the Maxam and Gilbert method. Because of its comparative ease, the Sanger method was soon automated and was the method used in the first generation of DNA sequencers.
  • 22. 21 Lecture notes in Bioinformatics Sanger sequencing is the method which prevailed from the 1980s until the mid-2000s. Over that period, great advances were made in the technique, such as fluorescent labelling, capillary electrophoresis, and general automation. These developments allowed much more efficient sequencing, leading to lower costs. The Sanger method, in mass production form, is the technology which produced the first human genome in 2001, ushering in the age of genomics. However, later in the decade, radically different approaches reached the market, bringing the cost per genome down from $100 million in 2001 to $10,000 in 2011. Some tasks in DNA sequence analysis • Large quantities of sequence data are being published, for organisms from bacteria to higher mammals. It will take decades to analyse the data. Here are a few of the tasks involved in the analysis: • Sequence assembly: Sequencing involves a “shotgun” approach where DNA fragments 1000bp long are sequenced with significant over coverage. These then have to be “assembled”. • Annotation: The assembled genomes have to be “annotated”: genes identified and marked out, their functions identified, and so on. • Motif finding: Non-coding DNA contains “regulatory” regions where proteins called “transcription factors” bind to “turn on” genes. Identifying such regions, and binding sites for individual TFs, is of great importance. TFs typically bind to small “motifs”, so the task is to find overrepresented short “motifs” in larger quantities of sequence. • Sequence alignment: In the last two tasks, it is very useful to compare genomes of previously sequenced species. “Comparative genomics” is becoming a very important subfield. Detection and alignment of homologous sequence is an important task here. • Phylogenetic trees: Given sequence data from different species, it is useful to reconstruct their phylogenetic relationship. Algorithms exist for all these tasks, but all are evolving with increasing understanding of the function of non-coding DNA, increasing mathematical and algorithmic sophistication in the methods, and increasing raw computational power available to tackle these tasks. There is not much explicit mention of parallel programming in what follows. But most problems are intrinsically parallelisable, and many tasks require several independent runs that can be done trivially in parallel.
  • 23. 22 Lecture notes in Bioinformatics An overview of DNA Sequencing  Protein sequencing and analysis Is the practical process of determining the amino acid sequence of all or part of a protein or peptide. This may serve to identify the protein or characterize its post-translational modifications. Typically, partial sequencing of a protein provides sufficient information (one or more sequence tags) to identify it with reference to databases of protein sequences derived from the conceptual translation of genes. The two major direct methods of protein sequencing are mass spectrometry and Edman degradation using a protein sequenator (sequencer). Mass spectrometry methods are now the most widely used for protein sequencing and identification but Edman degradation remains a valuable tool for characterizing a protein's N-terminus. Determining amino acid composition It is often desirable to know the unordered amino acid composition of a protein prior to attempting to find the ordered sequence, as this knowledge can be used to facilitate the discovery of errors in the sequencing process or to distinguish between ambiguous results. Knowledge of the frequency of certain amino acids may also be used to choose which protease to use for digestion of the protein. The misincorporation of low levels of non-standard amino acids (e.g. norleucine) into proteins may also be
  • 24. 23 Lecture notes in Bioinformatics determined.[1] A generalized method often referred to as amino acid analysis[2] for determining amino acid frequency is as follows: 1. Hydrolyse a known quantity of protein into its constituent amino acids. 2. Separate and quantify the amino acids in some way. Hydrolysis Hydrolysis is done by heating a sample of the protein in 6 M hydrochloric acid to 100–110 °C for 24 hours or longer. Proteins with many bulky hydrophobic groups may require longer heating periods. However, these conditions are so vigorous that some amino acids (serine, threonine, tyrosine, tryptophan, glutamine, and cysteine) are degraded. To circumvent this problem, Biochemistry Online suggests heating separate samples for different times, analysing each resulting solution, and extrapolating back to zero hydrolysis time. Rastall suggests a variety of reagents to prevent or reduce degradation, such as thiol reagents or phenol to protect tryptophan and tyrosine from attack by chlorine, and pre-oxidising cysteine. He also suggests measuring the quantity of ammonia evolved to determine the extent of amide hydrolysis. Separation and quantitation The amino acids can be separated by ion-exchange chromatography then derivatized to facilitate their detection. More commonly, the amino acids are derivatized then resolved by reversed phase HPLC. An example of the ion-exchange chromatography is given by the NTRC using sulfonated polystyrene as a matrix, adding the amino acids in acid solution and passing a buffer of steadily increasing pH through the column. Amino acids are eluted when the pH reaches their respective isoelectric points. Once the amino acids have been separated, their respective quantities are determined by adding a reagent that will form a coloured derivative. If the amounts of amino acids are in excess of 10 nmol, ninhydrin can be used for this; it gives a yellow colour when reacted with proline, and a vivid purple with other amino acids. The concentration of amino acid is proportional to the absorbance of the resulting solution. With very small quantities, down to 10 pmol, fluorescent derivatives can be formed using reagents such as ortho-phthaldehyde (OPA) or fluorescamine. Pre-column derivatization may use the Edman reagent to produce a derivative that is detected by UV light. Greater sensitivity is achieved using a reagent that generates a fluorescent derivative. The derivatized amino acids are subjected to reversed phase chromatography, typically using a C8 or C18 silica column and an optimised elution gradient. The eluting amino acids are detected using a UV or fluorescence detector and the peak areas compared with those for derivatised standards in order to quantify each amino acid in the sample. N-terminal amino acid analysis Sanger's method of peptide end-group analysis: A derivatization of N-terminal end with Sanger's reagent (DNFB), B total acid hydrolysis of the dinitrophenyl peptide
  • 25. 24 Lecture notes in Bioinformatics Determining which amino acid forms the N-terminus of a peptide chain is useful for two reasons: to aid the ordering of individual peptide fragments' sequences into a whole chain, and because the first round of Edman degradation is often contaminated by impurities and therefore does not give an accurate determination of the N-terminal amino acid. A generalised method for N-terminal amino acid analysis follows: 1. React the peptide with a reagent that will selectively label the terminal amino acid. 2. Hydrolyse the protein. 3. Determine the amino acid by chromatography and comparison with standards. There are many different reagents which can be used to label terminal amino acids. They all react with amine groups and will therefore also bind to amine groups in the side chains of amino acids such as lysine - for this reason it is necessary to be careful in interpreting chromatograms to ensure that the right spot is chosen. Two of the more common reagents are Sanger's reagent (1-fluoro-2,4- dinitrobenzene) and dansyl derivatives such as dansyl chloride. Phenylisothiocyanate, the reagent for the Edman degradation, can also be used. The same questions apply here as in the determination of amino acid composition, with the exception that no stain is needed, as the reagents produce coloured derivatives and only qualitative analysis is required. So the amino acid does not have to be eluted from the chromatography column, just compared with a standard. Another consideration to take into account is that, since any amine groups will have reacted with the labelling reagent, ion exchange chromatography cannot be used, and thin layer chromatography or high-pressure liquid chromatography should be used instead. C-terminal amino acid analysis The number of methods available for C-terminal amino acid analysis is much smaller than the number of available methods of N-terminal analysis. The most common method is to add carboxypeptidases to a solution of the protein, take samples at regular intervals, and determine the terminal amino acid by analysing a plot of amino acid concentrations against time. This method will be very useful in the case of polypeptides and protein-blocked N termini. C-terminal sequencing would greatly help in verifying the primary structures of proteins predicted from DNA sequences and to detect any postranslational processing of gene products from known codon sequences. Edman degradation The Edman degradation is a very important reaction for protein sequencing, because it allows the ordered amino acid composition of a protein to be discovered. Automated Edman sequencers are now in widespread use, and are able to sequence peptides up to approximately 50 amino acids long. A reaction scheme for sequencing a protein by the Edman degradation follows; some of the steps are elaborated on subsequently. 1. Break any disulfide bridges in the protein with a reducing agent like 2-mercaptoethanol. A protecting group such as iodoacetic acid may be necessary to prevent the bonds from re-forming. 2. Separate and purify the individual chains of the protein complex, if there are more than one. 3. Determine the amino acid composition of each chain. 4. Determine the terminal amino acids of each chain. 5. Break each chain into fragments under 50 amino acids long. 6. Separate and purify the fragments. 7. Determine the sequence of each fragment. 8. Repeat with a different pattern of cleavage. 9. Construct the sequence of the overall protein. Digestion into peptide fragments Peptides longer than about 50-70 amino acids long cannot be sequenced reliably by the Edman degradation. Because of this, long protein chains need to be broken up into small fragments that can then be sequenced individually. Digestion is done either by endopeptidases such as trypsin or pepsin or
  • 26. 25 Lecture notes in Bioinformatics by chemical reagents such as cyanogen bromide. Different enzymes give different cleavage patterns, and the overlap between fragments can be used to construct an overall sequence. Reaction The peptide to be sequenced is adsorbed onto a solid surface. One common substrate is glass fibre coated with polybrene, a cationic polymer. The Edman reagent, phenylisothiocyanate (PITC), is added to the adsorbed peptide, together with a mildly basic buffer solution of 12% trimethylamine. This reacts with the amine group of the N-terminal amino acid. The terminal amino acid can then be selectively detached by the addition of anhydrous acid. The derivative then isomerises to give a substituted phenylthiohydantoin, which can be washed off and identified by chromatography, and the cycle can be repeated. The efficiency of each step is about 98%, which allows about 50 amino acids to be reliably determined. A Beckman-Coulter Porton LF3000G protein sequencing machine Protein sequenator A protein sequenator is a machine that performs Edman degradation in an automated manner. A sample of the protein or peptide is immobilized in the reaction vessel of the protein sequenator and the Edman degradation is performed. Each cycle releases and derivatises one amino acid from the protein or peptide's N-terminus and the released amino-acid derivative is then identified by HPLC. The sequencing process is done repetitively for the whole polypeptide until the entire measurable sequence is established or for a pre-determined number of cycles. Identification by mass spectrometry Protein identification is the process of assigning a name to a protein of interest (POI), based on its amino-acid sequence. Typically, only part of the protein’s sequence needs to be determined experimentally in order to identify the protein with reference to databases of protein sequences deduced from the DNA sequences of their genes. Further protein characterization may include confirmation of the actual N- and C-termini of the POI, determination of sequence variants and identification of any post-translational modifications present. Proteolytic digests A general scheme for protein identification is described. The POI is isolated, typically by SDS-PAGE or chromatography.  The isolated POI may be chemically modified to stabilise Cysteine residues (e.g. S- amidomethylation or S-carboxymethylation).  The POI is digested with a specific protease to generate peptides. Trypsin, which cleaves selectively on the C-terminal side of Lysine or Arginine residues, is the most commonly used protease. Its advantages include i) the frequency of Lys and Arg residues in proteins, ii) the high specificity of the enzyme, iii) the stability of the enzyme and iv) the suitability of tryptic peptides for mass spectrometry.  The peptides may be desalted to remove ionizable contaminants and subjected to MALDI-TOF mass spectrometry. Direct measurement of the masses of the peptides may provide sufficient information to identify the protein (see Peptide mass fingerprinting) but further fragmentation of the peptides inside the mass spectrometer is often used to gain information about the peptides’ sequences. Alternatively, peptides may be desalted and separated by reversed phase
  • 27. 26 Lecture notes in Bioinformatics HPLC and introduced into a mass spectrometer via an ESI source. LC-ESI-MS may provide more information than MALDI-MS for protein identification but uses more instrument time.  Depending on the type of mass spectrometer, fragmentation of peptide ions may occur via a variety of mechanisms such as Collision-induced dissociation (CID) or Post-source decay (PSD). In each case, the pattern of fragment ions of a peptide provides information about its sequence.  Information including the measured mass of the putative peptide ions and those of their fragment ions is then matched against calculated mass values from the conceptual (in-silico) proteolysis and fragmentation of databases of protein sequences. A successful match will be found if its score exceeds a threshold based on the analysis parameters. Even if the actual protein is not represented in the database, error-tolerant matching allows for the putative identification of a protein based on similarity to homologous proteins. A variety of software packages are available to perform this analysis.  Software packages usually generate a report showing the identity (accession code) of each identified protein, its matching score, and provide a measure of the relative strength of the matching where multiple proteins are identified.  A diagram of the matched peptides on the sequence of the identified protein is often used to show the sequence coverage (% of the protein detected as peptides). Where the POI is thought to be significantly smaller than the matched protein, the diagram may suggest whether the POI is an N- or C-terminal fragment of the identified protein. De novo sequencing The pattern of fragmentation of a peptide allows for direct determination of its sequence by de novo sequencing. This sequence may be used to match databases of protein sequences or to investigate post-translational or chemical modifications. It may provide additional evidence for protein identifications performed as above. N- and C-termini The peptides matched during protein identification do not necessarily include the N- or C-termini predicted for the matched protein. This may result from the N- or C-terminal peptides being difficult to identify by MS (e.g. being either too short or too long), being post-translationally modified (e.g. N- terminal acetylation) or genuinely differing from the prediction. Post-translational modifications or truncated termini may be identified by closer examination of the data (i.e. de novo sequencing). A repeat digest using a protease of different specificity may also be useful. Post-translational modifications Whilst detailed comparison of the MS data with predictions based on the known protein sequence may be used to define post-translational modifications, targeted approaches to data acquisition may also be used. For instance, specific enrichment of phosphopeptides may assist in identifying phosphorylation sites in a protein. Alternative methods of peptide fragmentation in the mass spectrometer, such as ETD or ECD, may give complementary sequence information. Whole-mass determination The protein’s whole mass is the sum of the masses of its amino-acid residues plus the mass of a water molecule and adjusted for any post-translational modifications. Although proteins ionize less well than the peptides derived from them, a protein in solution may be able to be subjected to ESI-MS and its mass measured to an accuracy of 1 part in 20,000 or better. This is often sufficient to confirm the termini (thus that the protein’s measured mass matches that predicted from its sequence) and infer the presence or absence of many post-translational modifications. Limitations Proteolysis does not always yield a set of readily analyzable peptides covering the entire sequence of the POI. The fragmentation of peptides in the mass spectrometer often does not yield ions corresponding to cleavage at each peptide bond. Thus, the deduced sequence for each peptide is not necessarily complete. The standard methods of fragmentation do not distinguish between leucine and isoleucine residues since they are isomeric.
  • 28. 27 Lecture notes in Bioinformatics Because the Edman degradation proceeds from the N-terminus of the protein, it will not work if the N- terminus has been chemically modified (e.g. by acetylation or formation of Pyroglutamic acid). Edman degradation is generally not useful to determine the positions of disulfide bridges. It also requires peptide amounts of 1 picomole or above for discernible results, making it less sensitive than mass spectrometry. Introduction to sequence alignment Sequence similarity search and sequence alignment We can do a similarity search to learn if our sequenced DNA can be found in a public nucleotide database (i.e. it has already been cloned by others) and/or whether it is evolutionally related (i.e. homologous) to other sequences. In a simple similarity search, one can compare a sequence with sequences found in an entire nucleotide database (see later the BLAST program), while for a homology search the method of choice is multiple sequence alignment by the ClustalW program. By comparing either nucleotide or amino acid sequences we can find homologs. If these are from different species (that had a common ancestor) but have identical or similar functions they are called orthologs; while those homologs that are found in the same organism and originate from a gene duplication event followed by divergent evolution within the species are called paralogs. We will not cover the construction of evolutionary trees in this e-book—one can learn about these in bioinformatics or evolutionary biology courses. The BLAST program If we sequence a DNA clone, the first bioinformatics analysis is a similarity search against a nucleotide database. The most widely used similarity search program accessible on the internet is BLAST (Basic
  • 29. 28 Lecture notes in Bioinformatics Local Alignment Search Tool), which will be described here and will be used by the students during the laboratory practice. The BLAST program is available online at several servers including the one at NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi. BLAST uses a heuristic algorithm that makes it possible to search a huge database in a very short period of time by using a query sequence. The high speed of the algorithm stems from the fact that the query sequence is divided into short „words” that are used, instead of the full-length sequence, during the alignment process. These words are searched in the database first (called „seeding”, i.e. finding the best local alignments). The most relevant hits are then scored with the help of a scoring matrix, extended to neighbouring words, and finally assembled and compiled into a final list of similarity hits. It is important that the query sequences must be in the so-called FASTA format (FASTA was a previously popular but much slower similarity search program). The FASTA format is shown in Figure 11.10. Figure 11.10. The FASTA sequence format If we want to search using a nucleotide query sequence within a nucleotide database, we can use the BLASTN version of the program. If we have an amino acid sequence, we can search a protein database by the BLASTP version of the program. The BLASTX version of the program translates a nucleotide sequence in all six reading frames (three on each strand) and allows searching a protein database. Finally, with the TBLAST subprogram, we can search against a translated nucleotide database using either a protein (TBLASTN) or a nucleotide (TBLASTX) query sequence. These similarity search options are summarised in Figure 11.11. Figure 11.11. Search possibilities in the BLAST program The result of a BLAST analysis is a list a sequences from the searched database that show significant similarity to the query sequence. Besides the sequence identifiers of the similar sequence hits in the database, the final list of alignments contains a score number and a statistical significance number, the E-value. The E-value is a parameter that describes the number of hits one can expect to see by chance
  • 30. 29 Lecture notes in Bioinformatics when searching a database of a particular size. It decreases exponentially as the score (S) of the match increases. Essentially, the E-value describes the random background noise. The lower the E-value, or the closer it is to zero, the more "significant" the match (E > 0.01 is usually considered to reflect a homologous, i.e. evolutionarily-related sequence). The score value is calculated based on the alignment, taking into account the gaps and the similarity of the amino acids at the aligned positions. The most often used similarity matrix (an amino acid substitution matrix) is the BLOSUM (BLOcks SUbstitution Matrix) matrix. The numbers within a BLOSUM are “log-odds” scores that measure, in an alignment, the logarithm of the ratio of the likelihood of two amino acids appearing with a biological sense and the likelihood of the same amino acids appearing by chance. The similarity hits can be found and downloaded from the database using their accession number (identifier). BLAST hits are usually hyperlinked directly to the corresponding entries in the GenBank database where we can learn much more about the related sequences, the gene, cDNA and/or the coded protein. As we have already mentioned, the most comprehensive information on a given protein can be found in the UniProt database. In Figure 11.12, a detail of a BLAST run is shown in which the BLASTP program was used to search the UniProt database using a human skeletal actin query sequence. Figure :Result of a sequence similarity search by the BLAST program (human skeletal muscle actin was used as a query sequence against the UniProt database) It is important to note that, since 3-D structure is more conserved than primary structure, it is easier to recognise two related proteins by comparing their three-dimensional structure than their amino acid sequence. Obviously, it is more convenient to compare primary sequences, since they are available for much more proteins than the atomic-resolution structures. Similarity searches and protein structure comparisons are dealt with in more detail in bioinformatics (or structural bioinformatics) courses.
  • 31. 30 Lecture notes in Bioinformatics Figure : The wide range of in silico analysis possibilities of protein sequences. (Most of these options are also available for nucleic acid sequences.)  FASTA Is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985.[1] Its legacy is the FASTA format which is now ubiquitous in bioinformatics. The original FASTP program was designed for protein sequence similarity searching. Because of the exponentially expanding genetic information and the limited speed and memory of computers in the 1980s heuristic methods were introduced aligning a query sequence to entire data-bases. FASTA (developed in 1988) added the ability to do DNA:DNA searches, translated protein:DNA searches, and also provided a more sophisticated shuffling program for evaluating statistical significance.[2] There are several programs in this package that allow the alignment of protein sequences and DNA sequences. Nowadays, increased computer performance makes it possible to perform searches for local alignment detection in a database using the Smith-Waterman algorithm. Uses:FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, an extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment. The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle frameshift errors (which six-frame- translated searches do not handle very well) when comparing nucleotide to protein sequence data. In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal Smith-Waterman algorithm. A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology. The FASTA package is available from the University of Virginia[3] and the European Bioinformatics Institute.[4] The web-interface to submit sequences for running a search of the European Bioinformatics Institute (EBI)'s online databases is also available using the FASTA programs.
  • 32. 31 Lecture notes in Bioinformatics The FASTA file format used as input for this software is now largely used by other sequence database search tools (such as BLAST) and sequence alignment programs (Clustal, T-Coffee, etc.). FASTA takes a given nucleotide or amino acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences. The FASTA program follows a largely heuristic method which contributes to the high speed of its execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches before performing a more time-consuming optimized search using a Smith- Waterman type of algorithm. The size taken for a word, given by the parameter kmer, controls the sensitivity and speed of the program. Increasing the kmer value decreases number of background hits that are found. From the word hits that are returned the program looks for segments that contain a cluster of nearby hits. It then investigates these segments for a possible match. There are some differences between fastn and fastp relating to the type of sequences used but both use four steps and calculate three scores to describe and format the sequence similarity results. These are:  Identify regions of highest density in each sequence comparison. Taking a kmer to equal 1 or 2. In this step all or a group of the identities between two sequences are found using a look up table. The kmer value determines how many consecutive identities are required for a match to be declared. Thus the lesser the kmer value: the more sensitive the search. kmer=2 is frequently taken by users for protein sequences and kmer=4 or 6 for nucleotide sequences. Short oligonucleotides are usually run with kmer= 1. The program then finds all similar local regions, represented as diagonals of a certain length in a dot plot, between the two sequences by counting kmer matches and penalizing for intervening mismatches. This way, local regions of highest density matches in a diagonal are isolated from background hits. For protein sequences BLOSUM50 values are used for scoring kmer matches. This ensures that groups of identities with high similarity scores contribute more to the local diagonal score than to identities with low similarity scores. Nucleotide sequences use the identity matrix for the same purpose. The best 10 local regions selected from all the diagonals put together are then saved.  Rescan the regions taken using the scoring matrices. trimming the ends of the region to include only those contributing to the highest score. Rescan the 10 regions taken. This time use the relevant scoring matrix while rescoring to allow runs of identities shorter than the kmer value. Also while rescoring conservative replacements that contribute to the similarity score are taken. Though protein sequences use the BLOSUM50 matrix, scoring matrices based on the minimum number of base changes required for a specific replacement, on identities alone, or on an alternative measure of similarity such as PAM, can also be used with the program. For each of the diagonal regions rescanned this way, a subregion with the maximum score is identified. The initial scores found in step1 are used to rank the library sequences. The highest score is referred to as init1 score.  In an alignment if several initial regions with scores greater than a CUTOFF value are found, check whether the trimmed initial regions can be joined to form an approximate alignment with
  • 33. 32 Lecture notes in Bioinformatics gaps. Calculate a similarity score that is the sum of the joined regions penalising for each gap 20 points. This initial similarity score (initn) is used to rank the library sequences. The score of the single best initial region found in step 2 is reported (init1). Here the program calculates an optimal alignment of initial regions as a combination of compatible regions with maximal score. This optimal alignment of initial regions can be rapidly calculated using a dynamic programming algorithm. The resulting score initn is used to rank the library sequences.This joining process increases sensitivity but decreases selectivity. A carefully calculated cut-off value is thus used to control where this step is implemented, a value that is approximately one standard deviation above the average score expected from unrelated sequences in the library. A 200-residue query sequence with kmer 2 uses a value 28.  Use a banded Smith-Waterman algorithm to calculate an optimal score for alignment. This step uses a banded Smith-Waterman algorithm to create an optimised score (opt) for each alignment of query sequence to a database(library) sequence. It takes a band of 32 residues centered on the init1 region of step2 for calculating the optimal alignment. After all sequences are searched the program plots the initial scores of each database sequence in a histogram, and calculates the statistical significance of the "opt" score. For protein sequences, the final alignment is produced using a full Smith-Waterman alignment. For DNA sequences, a banded alignment is provided. FASTA cannot remove low complexity regions before aligning the sequences as it is possible with BLAST. This might be problematic as when the query sequence contains such regions, e.g. mini- or microsatellites repeating the same short sequence frequent times, this increases the score of not familiar sequences in the database which only match in this repeats, which occur quite frequently. Therefore the program PRSS is added in the FASTA distribution package. PRSS shuffles the matching sequences in the database either on the one-letter level or it shuffles short segments which length the user can determine. The shuffled sequences are now aligned again and if the score is still higher than expected this is caused by the low complexity regions being mixed up still mapping to the query. By the amount of the score the shuffled sequences still attain PRSS now can predict the significance of the score of the original sequences. The higher the score of the shuffled sequences the less significant the matches found between original database and query sequence.[5] The FASTA programs find regions of local or global similarity between Protein or DNA sequences, either by searching Protein or DNA databases, or by identifying local duplications within a sequence. Other programs provide information on the statistical significance of an alignment. Like BLAST, FASTA can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families. Protein  Protein–protein FASTA.  Protein–protein Smith–Waterman (ssearch).  Global protein–protein (Needleman–Wunsch) (ggsearch)  Global/local protein–protein (glsearch)  Protein–protein with unordered peptides (fasts)  Protein–protein with mixed peptide sequences (fastf) Nucleotide
  • 34. 33 Lecture notes in Bioinformatics  Nucleotide–nucleotide (DNA/RNA fasta)  Ordered nucleotides vs nucleotide (fastm)  Unordered nucleotides vs nucleotide (fasts) Translated  Translated DNA (with frameshifts, e.g. ESTs) vs proteins (fastx/fasty)  Protein vs translated DNA (with frameshifts) (tfastx/tfasty)  Peptides vs translated DNA (tfasts) Statistical significance  Protein vs protein shuffle (prss)  DNA vs DNA shuffle (prss)  Translated DNA vs protein shuffle (prfx) Local duplications  Local protein alignments (lalign)  Plot protein alignment "dot-plot" (plalign)  Local DNA alignments (lalign)  Plot DNA alignment "dot-plot" (plalign)