2. What is Bioinformatics
It is the discipline of quantitative analysis of information relating to biological macromolecules with
the aid of computers.
“Bioinformatics involves the technology that uses computers for storage, retrieval, manipulation,
and distribution of information related to biological macromolecules such as DNA, RNA, and
proteins.”
Bioinformatics is limited to sequence, structural, and functional analysis of genes and genomes and
their corresponding products and is often considered computational molecular biology.
3. Historical background
The development of bioinformatics as a field is the result of advances in both molecular
biology and computer science over the past 30–40 years.
The first major bioinformatics project was undertaken by Margaret Day Hoff in 1965, who
developed a first protein sequence database called “Atlas of Protein Sequence and
Structure”.
In 1970s the Brookhaven National Laboratory established the “Protein Data Bank” for
archiving three-dimensional protein structures
4. Continue…
The first sequence alignment algorithm was developed by Needleman and Wunch in
1970. This was a fundamental step in the development of the field of bioinformatics.
The first protein structure prediction algorithm was developed by Chou and Fasman in
1974.
In 1980s, the establishment of GenBank and the development of fast database searching
algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and
coworkers.
5. Our goals
The ultimate goal of bioinformatics is to better understand a living cell and how it
functions at the molecular level.
By analyzing raw molecular sequence and structural data, bioinformatics research can
generate new insights and provide a “global” perspective of the cell.
By analyzing sequence data is ultimately because the flow of genetic information is
dictated by the “central dogma” of biology in which DNA is transcribed to RNA, which is
translated to proteins.
6. Scope
Bioinformatics consists of two subfields: the development of computational tools and databases
and the application of these tools and databases in generating biological knowledge to better
understand living systems.
The tool development includes writing software for sequence, structural, and functional
analysis, as well as the construction and curating of biological databases.
These tools are used in three areas of genomic and molecular biological research: “Molecular
sequence analysis, Molecular structural analysis, and Molecular functional analysis.”
7. Points of discussion
Biological databases
Multiple Sequence alignment
Protein motifs and domain prediction
Structure prediction tools
Molecular phylogenetics
Structural Bioinformatics
Gene and Genomics
8. Biological databases
A database is a computerized archive used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
Depending on the types of data structures, these database management systems can be
classified into two types: “relational database management systems and object-oriented
database management systems”.
Based on their contents, biological databases can be roughly divided into three categories:
“primary databases, secondary databases, and specialized databases”.
9.
10. Biological databases
Primary databases contain original biological data that archive the raw sequence and structural
data. For example GenBank, European Molecular Biology Laboratory (EMBL) database, DNA
Data bank of Japan (DDBJ) Protein Data Bank (PDB).
Secondary databases contain computationally processed or manually curated information. For
example SWISS-Prot and Protein Information Resource (PIR).
Specialized databases normally serve a specific research community or focus on a particular
organism. For example Flybase, WormBase, AceDB, TAIR and Microarray Gene Expression
Database at European Bioinformatics Institute (EBI).
11. NCBI
The National Center for Biotechnology Information
advances science and health by providing access to
biomedical and genomic information.
Proteins sequences derived from the same DNA
sequences are explicitly linked as related entries.
Sequence variants from the same organism with very
minor differences, which may well be caused by
sequencing errors, are treated as distinctly related
entries.
12. Major Biological databases
Databases Brief Summary URL
EMBL Nucleotide database Europe www.ebi.ac.uk/embl
Expasy Proteomics database http://us.expasy.org/
OMIM Genetic information of human diseases www.omim.org
PubMed Biomedical literature Information www.ncbi.nlm.nih.gov/PubMed
SWISS-Prot Curated protein sequence database www.ebi.ac.uk/swissprot/acces
TAIR Arabidopsis information database www.Arabidopsis.org
Flybase Database of the Drosophila genome flybase.bio.Indiana.edu
13. Sequence alignment
Sequence comparison lies at the heart of bioinformatics analysis. It is an important first
step toward structural and functional analysis of newly determined sequences.
The most fundamental process in this type of comparison is sequence alignment. This is
the process by which sequences are compared by searching for common character patterns
and establishing residue–residue correspondence among related sequences.
Pairwise sequence alignment is the process of aligning two sequences and is the basis of
database similarity searching and multiple sequence alignment.
14. Continue…
In global alignment, two sequences to be aligned are assumed to be generally similar over their entire
length. Alignment is carried out from beginning to end of both sequences to find the best possible
alignment across the entire length between the two sequences.
In local alignment, only finds local regions with the highest level of similarity between the two
sequences and aligns these regions without regard for the alignment of the rest of the sequence regions.
15. Motifs and domains
A motif is a short conserved sequence pattern associated with distinct functions of a protein or DNA.
It is often associated with a distinct structural site performing a particular function. A domain is also
a conserved sequence pattern, defined as an independent functional and structural unit. Domains are
normally longer than motifs.
Because of evolutionary divergence, functional relationships between proteins often cannot be
distinguished through simple BLAST or FASTA database searches.
These motifs and domains in a protein sequences can be derived by five bioinformatics databases i.e.
PROSITE, Pfam, PRINTS, ProDom. CDART and SMART databases.
16. Sequences logos
A multiple sequence alignment or a motif is often represented by a graphic representation called
a logo.
In a logo, each position consists of stacked letters representing the residues appearing in a
particular column of a multiple alignment.
The overall height of a logo position reflects how conserved the position is, and the height of
each letter in a position reflects the relative frequency of the residue in the alignment.
Conserved positions have fewer residues and bigger symbols, whereas less conserved positions
have a more heterogeneous mixture of smaller symbols stacked together.
18. Structural prediction tools
3-D model, visualization, chemical interaction, structural annotation, modification after mutation
and comparison between wild and mutant amino acids were obtained by Swiss pdb viewer.
CPH modell-3.3 is protein homology modeling server Template alignment is based on the profile
alignment guided by secondary structure and exposure prediction.
Homology modeling, also known as comparative modeling of protein, refers to constructing an
atomic-resolution model of the "target" protein from its amino acid sequence and an experimental
three-dimensional structure of a related homologous protein (the "template").
19. The Heterozygous mutation in GJB6
and GJB2 were found A88V
andV27Ille causes Clouston
Syndrome(CS), photophobia and mild
Sensorineural Hearing Loss(SNHL)
respectively
Swiss PDB viewer
20. Continue…
MEMSAT SVM, Improved transmembrane protein topology prediction using SVMs. This
method is capable of differentiating signal peptides from transmembrane helices.
MEMEMBED, Prediction of Membrane protein orientation.
PSIPRED v3.3 (Predict Secondary Structure), DISOPRED3 (Disorder Prediction). pGen
THREADER (Profile Based Fold Recognition), MEMSAT3 & MEMSAT-
SVM (Membrane Helix Prediction). BioSerf v2.0 (Automated Homology Modelling),
DomPred (Protein Domain Prediction). FFPred 3 (Eukaryotic Function Prediction).
22. Phylogenetic analysis
Molecular phylogenetics is a fundamental aspect of bioinformatics. Biological sequence
analysis is founded on solid evolutionary principles. The tree branching patterns
representing the evolutionary divergence are referred to as phylogeny.
Similarities and divergence among related biological sequences revealed by sequence
alignment often have to be rationalized and visualized in the context of phylogenetic trees.
A gene phylogeny (phylogeny inferred from a gene or protein sequence) only describes
the evolution of that particular gene or encoded protein.
23. Phylogenetic trees and taxonomy
A phylogenetic tree can be either rooted or unrooted. An unrooted phylogenetic tree does
not assume knowledge of a common ancestor, but only positions the taxa to show their
relative relationships.
In a rooted tree, all the sequences under study have a common ancestor or root node from
which a unique evolutionary path leads to all other nodes.
The topology of branches in a tree defines the relationships between the taxa. The trees
can be drawn in different ways, such as a cladogram (branch length represent
evolutionary divergence ) or a phylogram (relative ordering of taxa).
24. Phylogenetic tools
In bioinformatics, there are many computational tools used for the phylogenetic analysis
which are as follows: Jukes-Cantor Model and Kimura Model are basic algorithms,
used as a fundamentals of evolutionary history checks.
Phylogeny.fr is a simple to use web service dedicated to reconstructing and analyzing
phylogenetic relationships between molecular sequences.
It includes multiple alignment (MUSCLE, T-Coffee, ClustalW, ProbCons), phylogeny
(PhyML, MrBayes, TNT, BioNJ), tree viewer (Drawgram, Drawtree, ATV) and utility
programs.
26. PROTEIN STRUCTURE VISUALIZATION, COMPARISON, AND CLASSIFICATION
The main feature of computer visualization programs is interactivity, which
allows users to visually manipulate the structural images through a graphical user
interface.
At the touch of a mouse button, a user can move, rotate, and zoom an atomic
model on a computer screen in real time, or examine any portion of the structure
in great detail, as well as draw it in various forms in different colors.
27. Molecular graphic generated
by
(A) Rasmol
(B) Molscript
(C) Ribbons
(D) Grasp
Molecular structure
visualization forms.
(A) Wireframes.
(B) Balls and sticks.
(C) Space-filling spheres.
(D) Ribbons (see color
plate section).
Results interpret by
Protein Data Bank
Japan
28. Genomics and Proteomics
Genomics is the study of genomes. Genomic studies are characterized by simultaneous
analysis of a large number of genes using automated data gathering tools. It range from
“genome mapping, sequencing, and functional genomic analysis to comparative
genomic analysis”.
Structural genomics refers to the initial phase of genome analysis, which includes
“construction of genetic and physical maps of a genome, identification of genes,
annotation of gene features, and comparison of genome structures”.
29. Genome Mapping
It is the process of identifying relative locations of genes,
mutations or traits on a chromosome. A low-resolution
approach to mapping genomes is to describe the order and
relative distances of genetic markers on a chromosome.
Genetic markers are identifiable portions of a
chromosome whose inheritance patterns can be
followed.
30. Continue…
Cytological maps refers to the light and dark bands that can be
visualized by stained chromosomes under a microscope.
Genetic maps identify the relative positions of genetic markers on a
chromosome that based on how frequent the markers are inherited
together.
Physical maps are the locations of identifiable land markers which
is measured in Kb or Mb because distance is measured in physical
units. They are made by chromosomes walking technique, which
uses no. of radiolabeled in a library of cloned DNA fragments.
When common probes overlapped on cloned probes, an order is
established.
31. Genome Sequencing
DNA sequencing can be considered as a type of physical map describing a genome at the single base-
pair level. There are two major strategies for whole genome sequencing: the shotgun approach and
the hierarchical approach.
DNA sequencing is now routinely carried out using the Sanger method. This involves the use of DNA
polymerases to synthesize DNA chains of varying lengths. The DNA synthesis is stopped by adding
dideoxynucleotides.
The dideoxynucleotides are labeled with fluorescent dyes, which terminate the DNA synthesis at
positions containing all four bases, resulting in nested fragments that vary in length by a single base.
32. Genome Annotation
The genome annotation process provides comments for the features of genes with the help
of gene prediction and functional assignment. Gene annotation of the human genome
employs a combination of theoretical prediction and experimental verification.
In bioinformatics, GeneQuiz (http://jura.ebi.ac.uk:8765/ext-genequiz/) is a web server
used for protein sequence annotation. The predictions are verified by BLAST searches
against a sequence database and then further compared with experimentally determined
cDNA and EST (Expressed sequence tag). All predictions are manually checked by human
curators.
33.
34. What is Gene Ontology?
A problem arises when using existing literature because the description of a gene function
uses natural language, which is often ambiguous and imprecise.
Researchers working on different organisms tend to apply different terms to the same type
of genes or proteins. Alternatively, the same terminology used in different organisms may
actually refer to different genes or proteins. Therefore, there is a need to standardize
protein functional descriptions.
This demand has spurred the development of the gene ontology, which uses a limited
vocabulary to describe molecular functions, biological processes, and cellular
components.
35. Online Genomics
MUMmer (Maximal Unique Match, www.tigr.org/tigr scripts/CMR2/webmum/ mumplot) is a free UNIX program from
TIGR for alignment of two entire genome sequences and comparison of the locations of orthologues.
BLASTZ (http://bio.cse.psu.edu/) is a UNIX program modified from BLAST to do pairwise alignment of very large
genomic DNA sequences.
LAGAN (Limited Area Global Alignment of Nucleotides; http://lagan.stanford. edu) is a web-based program designed
for pairwise alignment of large genomes.
PipMaker (http://bio.cse.psu.edu/cgi-bin/pipmaker) is a web server using the BLASTZ heuristic method to find similar
regions in two DNA sequences.
MAVID (http://baboon.math.berkeley.edu/mavid/) is a web-based program for aligning multiple large DNA sequences.
36. Relational Sciences
The research in biotechnology especially that involving sequence data management and drug design occurred at a speedy rate
due to development of bioinformatics.
The Bioinformatics major offers a rigorous, interdisciplinary training in the new and rapidly evolving field with a strong focus
on chemistry and biochemistry.
In Biophysics, The use of high level computational techniques and computer modeling to address biological problems and to
model molecular aspects of living cells.
The comprehensive format for connecting and integrating information derived from mathematical and statistical methods and
applying it to the understanding of biological sequences, structures, and networks.
The methodologies of bioinformatics to handle large scale data analysis, are providing exciting opportunities for us to
understand microbial communities.