Bioinformatics

1.
Introduction :
Importance and Scope

IMPORTANCE
 It is an interdisciplinary subject, where three subjects Biology,
Computer science and Information technology compain or merge
together to form the new disciplin ….. Bioinformatics.
OR
 Bioinformatics is a branch of biology which deals with very fast,
accurate and logical analysis of biological data’s and information for
interpretations and predictions by making use of computational
techniques. ( Margaret Day Hoff )
DEFINITION
 Bioinformatics, n. The science of information and information flow
in biological systems, esp. of the use of computational methods in
genetics and genomics. (Oxford English Dictionary)
 "The mathematical, statistical and computing methods that aim to solve
biological problems using DNA and amino acid sequences and related
information." -- Fredj Tekaia

SCOPE
1) Better documentation, store large quantity of data and addition,
documentation, delition of data are also possible.
a) Design and discovery of drugs. Considering genomic structure
of pathogens and chemical structure of drugs.
b) Study based on the important biomolecules protein and nucleic
acid.
PROTEIN: Structural and functional unit.
NUCLEIC ACID: Hereditary determining path.
c) Bioinformatics is the comparison based on the already available
details of protein and nucleic acid.
2) Very easy to search and access information.
3) Fast, accurate, logical analysis.
4) Interpretation and prediction.

Applications
1) Comparison
 Comparison of nucleic acid and protein sequence.
 It provides similarities and differences between the sequence of protein and nucleic
acids.
 Two type analysis is there
1) Structural analysis 2) Functional analysis
Get structural details Get functional details
 Molecular level of classification of organism are possible by using bioinformatic tool.
 Classification by comparing sequences by their similarities and differences of protein
as well as nucleic acid sequences and there by relationship of both nucleic acid and
protein.
 In taxonomy morphological, enzymatic analysis and comparisons are only occur but
for obtaining accurate level analysis molecular level analysis requires.
 Comparison of protein and nucleic acid helps to,
Classification of protein
Classification of nucleic acid
Classification of individual
Evalutionary relationship between organism

2) Gene finding
 Using bioinformatic gene finding easy.
 Nucleic acid is the expression product of genes.
 By finding the nucleic acid sequences, helps to identify the gene
responsible for certain characters. Eg : gene responsible for yeild
improvement
 Gene finding has application in crop improvement such as
resistance to insect, disease, drought, salinity etc. higher yeild.
 In agricultural and medical field – useful in comparison of normal
one with diseased one.
 In medical field, to find out the gene responsible for genetic
disorders and rectify in embryo and patient level by compairing
normal with diseased one.
 By Embryo therapy : at embryo level or rectify in sperm/egg
Patient therapy : rectify at particular cells or nucleic acid

3) Protein structure prediction
 Comparison of protein structure with protein structure database.
 By knowing protein structure, find out the final activities, their
influence in physiological and metabolic pathway of an organisms & also
related growth of organisms via knowing protein structure.
 Find out the disease pathway; by identifying defective protein and
defective gene.
 By identifying protein coding gene, helps to cure genetic disorders.
 NMR technique, X- Ray diffraction technique is used for identifying
protein structure. But it is very expensive and time consuming methods.
 Inted of there 2 method bioinformatics are applicable, very easy, less
expensive and time saving method.
 Very short time required for structure prediction.
 Discovery of near noval protein using bioinformatics inserted of NMR
and X- Ray diffraction technique, which is used in several field, drug
discovery and pharmaceutical etc.
 By knowing protein structure we can synthesis biologically valuable
synthetic enzymes.

4) Evalutionary relationship study
 By structural genomics, functional genomic and comparison
genomics.
5) Construction of biological data bases
 Construction of data bases is a part of coming under better
documentation.
 Depending up of type and kind of information, different type of
databases are there.
DATA BASE: area or spaces where informations are stored in
electric format. Different type of data bases are present, based on
the information containing ( information about protein/ nucleic
acid) Eg: EMBL, Gene bank
6) Total genomic structural study of an organism
 Helps to species identification.

7) Used in environmental cleaning up programme
By gene finding: scope for bioremediation. Eg: In oil spils we
use psuedomonas putrida to decrease the effect of
hydrocarbons in oils.
Plasmid – degrade hydocarbon – total oil degrade
Improve and modify individual useful for bioremediation
8) Creation of bio weapon
By gene finding near future bio weapons are used for
Eg: different disease causing microorganism identify and
used as weapon.

a) Nucleic Acid Databases
EMBL, Gene Bank – Structure of
Gene Bank entries. Specialized
genomic resources. UniGene

EMBL
 Nucleotide sequence data base
 It is developed by EBI ( European Bioinformatic Institute in UK)
 European Molecular Biology Laboratory (EMBL)
 It collect information from different sources such as
* Genome sequencing projects
* Scientific literature
* Direct auther submission
 It associated with Gene Bank, DDBJ, for exchanging information each other. So we can see
comprehensive collection of information.
 Its growth rate is very fast, double the information in 9-10 months.
 It divided in to many subdivisions.
 The Laboratory operates from six sites: the main laboratory in Heidelberg, and outstations
in Hinxton (the European Bioinformatics Institute (EBI), in
England), Grenoble (France), Hamburg (Germany), Rome (Italy) and Barcelona (Spain).
 EMBL groups and laboratories perform basic research in molecular biology and molecular medicine
as well as training for scientists, students and visitors.
 Informations are accessing through SRS system. SRS: SEQUENCE RETRIEVAL SYSTEM
 The first systematic genetic analysis of embryonic development in the fruit fly was conducted
at EMBL by Christiane Nüsslein-Volhard and Eric Wieschaus,[13] for which they were awarded
the Nobel Prize in Physiology or Medicine in 1995.
 In the early 1980s, Jacques Dubochet and his team at EMBL, developed cryogenic electron
microscopy for biological structures. It was rewarded with the 2017 Nobel Prize in Chemistry.
 URL Address : ( Uniform Resource Location) http://w.w.w.ebl.uk/embl/

GENE BANK
 It is a primary nucleiotide sequence biological data base.
 Full form Gene Bank
 Developed by NCBI (National Centre for Biotechnology Information)
 Less restriction
 AIM: Helps the scientific and research community in order to support their
research activity that contain information without restrictions except copy right
sequence and patent sequence.
 Growth rate : 1 months; with in one month double the informations.
 Information's are divided into 17 divisions for getting information easily.
 There are 17 divisions to make convinient & efficient informations in Gene
Bank.
 2 Retrieval system:
1) Entrenz Integrated Retrieval system : It have a capacity to link with
nucleotide sequence db with protein sequence db.
2) MEDLINE Facility: useful to get information of abstract of originally
bublised papers related to nucleiotide sequences.
 http://w.w.w.ncbi.nlm.nih.gov/genebank

Gene Bank incorporates information from
# publish available sources
# primarily from direct author submissions
# large scale sequencing project
To help ensure comprehensive coverage, the resource
exchanges data with both the EMBL data library and DDBJ.

Structural Entities
The Structure of Gene Bank Entries
A Gene Bank release includes the sequence files, indices created
on various databases fields and information derived from the
databases.
Gene Bank was made availabe on CD-ROM
It is convenient machanism for widespread.
Relatively inexpensive distribution
As the size of the database, large no.of CD required and dificult to
handle for the producers and for the users.
Today Gene Bank is available in FTP format.
Commonly used is the sequence entry file which contains the
sequence itself and disruptive information relating to it.
Each entry consist of no. of keywords,relevent associated sub-
keywords and an optional features.

The structure of gene bank entries consist of 13 structural
components:
1) LOCUS
2) DEFINITION
3) ACCESSION NUMBER
4) VERSION
5) KEYWORDS
6) SOURCE
7) ORGANISM
8) REFERENCE
9) AUTHOR
10) TITLE
11) JOURNEL
12) PUB MED NO
13) REMARK/COMMENT

1) LOCUS: we need to provide an entry number (identification for nucleiotide sequence)
[ NM- 000555- mRNA- Tuesday, 21.7.2018]
(entry no.) (Type of sequence) (day ) (day.M.year)
2) DEFINITION: scientific name of source organism.
Eg for Bt gene: Sequence entering there and expresssion product.
scientific name: Baccillus thuringenesis, mRNA, βendotoxin.
3) ACCESSION NUMBER: normallysimilar to entry number.
[NM: 000555]
4) VERSION: if we want to update information we first write entry No. and version No.
and also gene information Id No. along with it.
[NM: 000555.5.G: Id No 12345]
5) KEYWORDS: we must provide the key word of our work, if no key word put a dot.
Eg: Insert resistivity .
6) SOURCE: name of source of organism which we get, we must write common name .
source organism: Bacteria
7) ORGANISM: name of source of organism, we must write scientific name. scientific
name of source of organism: Bacillus thuringenesis

8) REFERENCE: reference of that paper published related to
enter the nucleotide sequence of interest.
9) AUTHOR: we need to enter the name of author in the same
order as in the same order as in the case of published.
10) TITLE: title of the paper
11) JOURNEL: name of the journel where you have publishd
the paper.
12) PUB MED NO: this is the no. which helps to access the
archived published paper with in PUB MED( scientific
journel archiver).
13) REMARK/COMMENT: we can enter, biological
importance/ expression/changes/source organism as
comment.

Specialized genomic resources
 The purpose of specialized resources is to focus on species -
species genomics and to particular sequencing techniques.
The particular aim of such a data base is the integrated view
of a particular biological system.
a) UniGene
* The collection represents genes from many organisms and
each cluster relating to a unique gene and including related
information corresponding to the gene.
* A valuable role of UniGene is in gene discovery.
* UniGene is also used for gene mapping projects and large
scale gene expression analysis.

b)TDB — The TIGR Database
* These databases containing DNA and protein sequence, gene
expression, protein family information etc.
* Also the data such as taxonomic range of plants and humans, role
of cellular components are also present.
c) SGD (Saccharomyces Genome Database)
* SGD is an online data resource which contain information on the
molecular biology and genetics of S.cerevisiae (Budding yeast).
* This data base provides internet access to the genome, its genes
and their products etc.
* SGD helps the research field by uniting together functions to
perform sequence similarity search tools.
* The illustration of genetic maps using dynamically created
graphical displays make the data base user friendly.

UniGene
It is an specilized genomic resources.
There are the databases, which tend to be linked, to some
extend, with the primary DNA databases from which they
may derive their data and into which their results are usually
fed.
Purpose of specialized genomic resource
1) to species-specific genomics
2) to particular sequencing technique
Primary goal of human genome project is to determine the
complete sequence of human genome.93 billion base pairs)
3% of the genome encodes protein.
Biological significance of remainder is unknown

A transcript map is a vital resource in flagging there parts of
the genome that are actually expressed.
Unigene attempts to provide a transcript map by utilising
sets of non-redundant gene-oriented clusters derived from
genebank sequence.
The collection represents gene from many organisms each
cluster relating to a unique gene and including related
information., such as the type in which the gene is expressed,
map location etc.

b) Protein Sequencing Databases
PIR
SWISS-PROT
TrEMBL
Composite Protein Databases
NRDB
OWL
Secondary Databases
PROSITE
PRINTS
BLOCKS
IDENTIFY

SWISS-PROT
• Protein sequence database
• Switzerland based database.
• SWISS-PROT is an annotated protein sequence database established in
1986 and maintained collaboratively, since 1987, by the Department of
Medical Biochemistry of the University of Geneva and the EMBL Data
Library.
• It is a curated protein sequence database, which strives to provide a high
level of annotation
• (such as the description of the function of a protein, its domain
structure, posttranslational modifications, variants, source and
organisms)
• a minimal level of redundancy, and a high level of integration with other
databases.
• SWISS-PROT contains the information about the name and origin of the
protein, protein attributes, general information, ontologies, sequence
annotation, amino acid sequence, bibliographic references, cross-
references with sequence, structure and interaction databases, and entry
information.

 It is maintained collaboratively by the Swiss Institute for
Bioinformatics (SIB) and the European Bioinformatics
Institute (EBI).
 The SWISS-PROT group is headed by: Rolf Apweiler.
 It contains non-redundant sequence entries and
informations are thoroughly revealed and annotated.
 Provide protein sequence to students researchers and other
related industries like pharmasutical industries.
 SWISS-PROT aims to be minimally redundant and is
interlinked to many other resourses.
 Linked with other databases EMBL and TrEMBL.

TrEMBL
 It is primary protein sequence database
 Translated EMBL
 A protein sequence database of nucleotide translated sequences.
 Created in 1996 as a computer annotatd suppliment to SWISS-
PROT
 This is complete annotated protein sequence databases.
 There databases is constructed via translatingeach nucleiotide
sequence that are available in EMBL in to protein sequence by
using computational techniques.
 The TrEMBL sequence database contains the translations of all
coding sequences (CDS) present in the DDBJ/EMBL/GenBank
Nucleotide Sequence Database and also protein sequences
extracted from the literature or submitted to SWISS-PROT, which
are not yet integrated into SWISS-PROT.

TrEMBL consist two divisions:
SP TrEMBL REM TrEMBL
 It is an temporary storing area
where incomplete sequence
have not yet manually
annotated.when it is fully
discribed contains entries that
well eventually be incorporated
in to SWISS-PROT.
 TrEMBL developed by EBI
 It contains completely explained
and fully annotated sequences.
 Contains sequences that are not
destined to be included in SWISS-
PROT
 Eg:
# immunoglobulins & t cell
receptors.
# fragments of four than eight
amino acids
# synthetic sequences
# patented sequences

PIR
 Primary protein sequence data base.
 Protein Information Resource[1960]
 Developed by Margaret Dayhoff in 1960 as a collection of
sequence for investigating evolutionary relationships among
proteins.
 Developed at the National Biomedical Research Foundation
( NBRF)
 The databases is split into 4 distinct sections. Based on kind
of informations level.
 PIR-1, PIR-2, PIR-3, PIR-4
 They differ in the terms of
# quality of data
# level of anotation provided.

1) PIR-1
 Contains fully classified and annotated.
2) PIR-2
 Includes preliminary entries, which have not been
throughly reviewed and may contain redundancy
3) PIR-3
 Contains unverified entries, which have not been reviewed.
4) PIR-4
 Contains protein sequences that are not geneticallly
encoded and not produced on ribosomes. So they are
synthetic protein sequences.

Composite Protein Databases
 These are the amalgamation or compilation of product
of different primary databases.
 Makes searching easy and efficient for a searcher.
 They render sequence searching much more, because
they obviate the need to interrogate multiple resources
1) NRDB
2) OWL

NRDB- Non-Redundant Data Base
 It is built localy at NCBI
 Combination of 6 primary DB
1. SWISS-PROT
2. PDB
3. PIR
4. Gen pept
5. Gen pept update
6. SP update
 Non-redundant & error free
 But if strictly speaking chance of redundency and error
 When redundency and error and incorrect sequence are present in any
component DB. As such they where incorporated in to NRDB, especially in
SWISS-PROT.
 Make more efficient via, avoiding to search to too much DB for getting related
information.

OWL- Ontology Web Language
 Web ontology language
 Compilation of 4 primary DB
1. Gene Bank
2. SWISS-PROT
3. NRL-3D
4. PIR-4
 Make searching more efficient via, avoiding or obivating too much DB for getting
related information
 Developed by NCBI
 If any redundency in Gene Bank, it is as much incorporated into OWL during
amalgamation.
 Development of university deals –UK in association with Daresburg laboratory in
warrington 1994
 The sources are aligant on the basis of level of annotation and sequence
validations
 SWISS-PROT has the highest priority
 OWL is only released on a 6-8 weekly basis .

Secondary Databases
PROSITE
PRINTS
BLOCKS
IDENTIFY
 It contains the fruits of analysis of sequences in the primary
sources
 Simply secondary data were derived from primary
 These are db which are analysed primary databases, which
from secondary data. These are several different primary db
& a variety of ways of analysing protein sequences.

PROSITE
 First secondary DB to have been developed was PROSITE
 Generate its information from the primary data base SWISS-PROT
 Produced and maintained by SIB
 Relesed date : 1988 by amosbiroch
 URL Address: http://www.prosite.expasy.org.
 It categorises the protein sequences in families.
 Proteins are grouped into different family. Based on the single most
conserved Motif.
 Motif: it is a ring of aminoacid (10-20 amino acid sequences)they are
responsible for protein function and preserves its 3D structure.
 Such Motifs usually according key biological function.
 Eg: enzymes active site, ligand or metal binding site
 Motif indicate or represent charecteristic features or site for each family.
 The region act as signatures of particular protein family and help to
identify the other newly members of family
 PROSITE is developed a largely manual process of seeking the patterns
that best fit particular families and functions.

 In PROSITE entries are developed in two different files
1) First of this pattern and list all matches in the new version of SWISS-PROT
2) Documentation file provide:
# details of characterized family
# discription of biological molecule of choosen Motif
# supporting biografy
SIGNIFICANCE
 To find families based on Motif, ie; presence of motif the same portion of many
sequence are considered a single family.
 Fat functional charecterization and annotation of protein sequences.
 Identify possible functions of newly discoered protein and analyses of protein
for previously unditermined activity
 Offers tool for protein sequence analyses and Motif detection
 It is a part of expasy proteomics analysing server
APPLICATION
 Classification of protein is possible based on the highest conserved motif
 Based on particular motif can identify the charecteristic features of motif and
representing character.
 Eg: the structural and functional details if that proteins

PRINTS
 Collect information from OWL in future. It will collect
information from SP, TrEMBL and SWISS-PROT
 Information deriving process from OWL is called interactive data
base scanning.
 Contributed by SIB
 In 1999 it was maintained in the department of biochemistry and
molecular biology at university college London (UCL).
 http://www.bioinf.man.ac.uk/db browser/ bioactivity/ protein 2
frm. html.
 Here we need to consider multiple Motif. Insert to single common
Motif.
 Helps to find out the more similar sequence. So clear information
are available.
 More accurate analyses is possible based on similar multiple motif
sharing by sequences.

BLOCKS
 Multiple Motifs based database
 Ungaped multiple alignment of Motifs
 Database contains informations on blocks
 Highly conserved multiple motifs are arranged without any gap
 Developed by Henikoff 1998
 Automatically derived database
 Database constructed by using automated PROTOMAT system.
 Ultimately encoded as ungapped local alignments are calibrated against
SWISS-PROT to obtain a measure of the likelihood of a chance match
 Two scores are noted for each block :
 first denotes at the level at which 99.5 percentage of matches are true
negative.
 Second median value of the true positive scores .
 The median standardized score for known true positive matches is
termed strength .
 Because the database is derived by fully automatic methods, The blocks
are not annotated but links are made to the corresponding PROSITE
family documentation file .

 These information are derived from the secondary
database PRINTS & PROSITE it can also called as tertiary
database .
 It is based on protein families contained in PROSITE, at Fred
Hutchinson Cancer Research Centre (FHCRC).
 The motifs or BLOCKS are created by automatically detecting the
most highly conserved regions of each protein family.
 The blocks are ultimately and encoded as Ungappped local or
multiple alignment.
 Structure of BLOCKS entries:
 Where each block is identified by a general code (ID) line and
accession number.
 ID line indicates the type of discriminated to expect in the life.
 AC line indicates the minimum and maximum distance of the
blocks from its preceding neighbour.
 DI line contains the descriptions for a title of the family.
 BL line indicates the diagnostic power (amino acid triplet, number
of sequence it contains)

IDENTIFY
 Another automatically derived tertiary source
 Derived from BLOCKS and PRINTS
 Developed in the department of biochemistry at stanford
university by Navill - Manning et al 1998
 Constructed on the basis of e-motif
 e-motif : it is a based on the similarities of highly conserved
Motif sequence.
 This database is constructed on the basis of generalised
expressions of similarities between highly conserved Motif
sequences.
 It is designed to be more flexible band exact regular expression
matching.
 They are accessible for use the protein function web server from
the biochemistry department at stanford sets and their properties
are used in e-Motif.

Structure Classification DataBases
 Many proteins share structural similarities, reflecting,
common evolutionary origins
1) SCOP
2) CATH

SCOP
 Structural Classification Of Proteins
 It is maintained under MRC laboratory of molecular biology
and centre for protein engineering.
 Which describes structural and evolutionary relationships
between proteins of known structure 1995.
 It is helpful for at the multi domain level and individual
domain level.
 It is constructed using a combination of manual inspection
and automated methods.
 The information of structure of protein is available due to
the Checking done with automatic and manual method
result would be more accurate.

Scope Classification
 proteins are classified in a hierarchical fashion to reflect their structural and
evolutionary relationships.
 In this protein structures are assigned in a hierarchical order at three levels:
1) Family
2) Super family
3) Fold
 Family
proteins are clustered into families with clear evolutionary relationship if they
have sequence identify more than 30 percentage sequence similarity
 Super family
proteins are placed in super families when in spite of low sequence identify
their structure structure and functional characteristics suggest a common
evolutionary origin.
 Fold
proteins are classified as a common fold is have the same major secondary
structures in the same arrangement and with the same topology
 Scope is accessible for keyword via MRC laboratory webserver
 http://www.bioinf.man.ac.uk/db browser/ bioactivity/ structure frm. html

CATH
 Class Architecture Topology Homology
 It is a hierarchy in classification of protein structures maintained at University
College of London (UCL) 1997.
 The resource is largely derived using automatic methods but manual inspection
is necessary word automatic methods, fail.
 Developed by UCL's biomolecular structure and protein modelling unit. Used
for classification of protein structure. There are five levels within the hierarchy.
A) CLASS
Is derived from gross secondary structure content and packing of protein.
four classes of domain are recognised ,
1. SUBCLASS 1
2. SUBCLASS 2
3. SUBCLASS 3
4. SUBCLASS 4
Sub class 1: mainly similarities in alpha helix
Sub class 2: similarities in beta sheet
Sub class 3: alpha - beta which includes both alternating alpha /beta and
alpha + beta structures
Sub class 4: based on secondary structure content for element secondary
structural element contents will be very less in amount

B)ARCHITECHTURE
 Describe the gross arrangement of secondary structure ignoring the
connectivities.
C) TOPOLOGY
 both the overall shape and the connectivity of Secondary structures
protein
D) HOMOLOGY
 share more than 35 percentage sequence identity and share a common
and sister (homologous )similarities are first identified by sequence
comparison and and structure comparison algorithm
E) SEQUENCE
# Final level in the hierarchy.
# Structures with homology groups are further clustered on te basis of
sequence identify.
# domains have sequence identifies more than 35 % indicating highly
similar structures and functions
CATH is as accessable keyword via UCL’s biomolecular structure and
modelling unit web server.

A) Sequence Data Base Searching
EST searches
Different approaches to EST analysis
Merck/IMAGE
Incyte
TIGR
EGAD
EST analytical tools
Sequence similarity
Sequence assembly and Sequence clustering

EST searches
 Expressed Sequence database.
 EST data are held in the EST database.
 EST sequence tag are also called gene transcripts.
 Which maintains its own format and identification number
system.
 Expression tag sequence is a short sequence .
 Short nucleotide sequence produced from CDNA
 mRNA- reverse transcriptase enzyme- single stranded DNA.
 A typical EST will be between 200 to 500 basis in length, with
modern technical advances increasing the theoretical length
resulting from a single run 1000 bases are more
 It is called genes transcript and parcel sequences and series are
noisy sequences that, as a result of sequences errors, may not only
contain have ambiguous bases but also be missing bases.

 In analysing EST’s, the following points should:
 The EST alphabet is five characters ACGTN.
 EST will be sum sequence of any other sequence in the database
 EST may not represent part of the series of CDS of any gene .
 EST production is highly automated and results often
contaminated with ambiguous are missing bases. This course
difficulties in sequence interpretation.
Uses
 Identification of particular gene
 Mapping of genes within a genome by using a small stretch of
sequence
 Identification of species
 For academic analyses or commercial exploitation have been
developed

Different approaches to EST analysis
 These are the EST’s information providing sources.
 Where is approaches to establishing libraries of EST’s for
academic or commercial exploitation have been developed.
 Much of the publicity available data are collected together
into the EST sections of the year EMBL data library and Gene
Bank (db EST)
 Merck/ IMAGE
Incyte
TIGR
EGAD

Merck/ IMAGE
 It is a research project was run by the university of washington and
funded Merck and company.
 In 1994 , Merck and co-founded a research project based at the
university of washington to sequence 300000 EST’s from a variety
of normalised libraries.
AIM:
 To produce 3 lakh EST’s from CDNA libraries.
 For many years Merck has sponsored the production of a drug
index.
Approaches of the sources
 To support academic analysis
 Commercialization of EST information to drug production
 The drug index is known as Merck Gene Index as of May 1997,
A,84,421 EST’s had been submitted by the project to dbEST

Incyte
 It is a pharmaceutical company
 Incyte pharmaceutical Inc.
 It produces a database Life Seq, that enphasises the quantitative
information derived by sequencing strand CDNA libraries.
AIM
 To provide/collect information on relative copy numbers of genes
in healthy and deseased tissue.
 To facilitate the elucidation of potential therapeutic targets.
APPROACH
 Commercialization of genomic information regarding EST’s of
healthy and diseased cells. Then it give to the therapeutic targets.
 Production of drugs for getting money
 In april 1998, the size of Life Seq was 2.5 million EST’s
representing 8000 to 12000 different genes.

TIGR
 The Institute for Genomic Rsearch .
 It is a government organisation .
 It purely stands for academic purposes .
 It is a research organisations with interest in structure, functional and
comparative analysis of genomes and gene products .
 The range of organisms covered includes viruses, Eubacteria ,pathogenic
bacteria ,archaebacteria and eubacteria (plant and animal)
AIM
 Preparation of Human Gene Index (HGI).
 This index integrates results from human genome research projects
around the world including that from db EST and Gene Bank.
 To create a non redundant view of all human genes and informations on
their expression pattern cellular roles , functions and evolutionary
relationship.
 Data in HGI are freely available.
 TIGR sequence more than 100000 EST’s from over 300 CDNA libraries
+ data from db EST + non redundant Human Transcript Information
using the technique of sequence assembly, to generate Tentative Human
Consensus ( THC) sequences .

EGAD
 Expression Gene Anatomy Database
 It is database providing information of EST’s

EST Analytical Tools
There are many tools avilable for the analysis of EST’s:
 Commercially available Tool = Incyte Life Tools
 Publicaly available Tool = 3 Types
1) Sequence Similarity Search Tools
2) Sequence Assembly Tools
3) Sequence Clustering Tools

1) Sequence Similarity Search Tools
 We consider the tools as the relate to EST's.
 If the reason est is told, then identify the tool which shows
the sequence similarity with the EST, by comparing the all
sequences.
 Eg: BLAST tool
BLAST P
BLAST N
BLAST X
X BLAST N

2) Sequence Assembly Tools
 When a search of databases reveals several EST matching
with probe sequence, normally the ESTs must be aligned
with each other to reveal the consensus sequences.
 This tool is used in when there are several EST sequences
showing similarity to a probe sequence .
 In this situation, this tool will do aligning and merging of
different fragments of sequences to reconstruct the original
mRNA .
 Example; Phrap, Staten assembler, TIGR assembler

3) Sequence Clustering Tools
 These are the programs that take a large set of sequences and
divide them into subsets, or clusters, between the extent of shared
sequences are defined in a minimum overlap region.
 These tools having the capacity to analyse a large set of sequences
and capable of grouping for clustering sequences based on the
sharing of maximum similar regions .
 Reliable and effective mechanism for clustering EST will reduced
redundancy in the database And save database search time and
analysis effort .
 Example:
Wed EST clustering tools
USEARCH
CD- HIT

Sequence similarity searching tools
 These are softwares used for searching, assessing, analysis, interpretation and prediction
of information containing in databases.
 These are two types
1) Pair wise sequence alignment and similarity searching tool
# A pair of sequence involved
# one will query sequence and other template.
# query – sequence will be studied
# template – will be find out from DB
Eg; BLAST , FASTA
2) Multiple sequence alignment and similarity search tool or
homology searching tool
# more than two sequence involved.
# a set of sequence can compare in it & alignment possible
Eg; CLUSTAL , MODELLER
PSI - BLASTA
# Position specific Interacted blast
# It is an hybrid of pairwise sequence alignment and multiple sequence similarity search
tool

 sequences are aligned to find region of higher density or
strong similarity.
 According to the sequence length, sequence alignment are
two types;
1) Local sequence alignment: Sequence alignment that
select only regional areas only which exhibit strong
similarity
Eg: BLAST, FASTA, PSI - BLAST
2) Global sequence alignment :
Sequence alignment that consider entire sequence known
as global sequence alignment

Functional Analysis Tool
• Protein as well as nucleotide.
• Used for functional analysis.
• To study the similarities of sequence based on their
function
• GOFFA :
# Gene ontology for functional analysis
# using for identification of functional elements in
genome and related
functional analysis of gene and genome
• Ermine J :
# Used for genome analysis
# and also for functional analysis related to gene
expression
• Interproscan :
# It is used for the functional analysis of protein

Structural Analysis Tool
 Structural analysis of nucleotide and proteins .
Eg:
 SWISS PROT
 PDB viewer
 Ras Mol

Statistical Analysis Tool
 Statistical analysis the value of similarity and
differences
Eg:
 Statistica
 Met Lab
 Perl

B) Pair-Wise Sequence Alignment
Technique
 Comparison of sequences and sub sequences
 Identity and similarity
 Substitution matrics
 PAM
 BLOSUM
 DOTPLOT
 BLAST
 FASTA

Substitution matrices
( BLOSUM & PAM)
 When two sequences compare, one sequences have Leusine and
other also have Leusin at comparing sequences,
 If the residue to residue (Leusin- Leusin)Similarity in amino acid
in the both sequences plot alignment score as 1.
 But according to this substitution matrix program due to
mutation or evolutionary change, the amino acid can change and
cause mismatches.
 But the mismatch can accept matching ones, since they do not
change the basic structural or functional.
 The matching are considered by deep analysis.
 Used in the study of evolutionary relationship.
 If amino acid changes their nature will be considered. if
nature Remains same in deeper analysis, researcher should be
considered them as match one and plotted it in matrices such
plotted matrices produce called substitutional matrices.

BLOSUM Model
 It is a substitution matrices.
 BLOCKS amino acid substitution matrices .
 It was proposed to overcome the problem of alignment of distantly
related sequences comparisons on substitution matrices .
 It was proposed by Steven Heinkoff & Jorja G Henikoff in
1992 , From the conserve regions of blocks the informations are
derived from the and amino acid patterns of distantly related
protein sequences available in BLOCKS databases hence the name
BLOCK SUBSTITUTION MATRIX.
 BLOSUM Matrices are based on a much larger data set.
 Represent distant relationships more explicitly. The closely related
sequences are considered and clustered together and treated as
single sequences.

The cluster contains sequences that have sequences
identifies higher than it cutoff called clustering percentage
changes in clustering percentage Leads to a family of
matrices.
This has three versions of comparison:
BLOSUM 30 - 30 less than 30 percentage similarity
BLOSUM 62 - 62 or between 62 and 30 percentage similarity
BLOSUM 90- 90 or between 90 and 62 percentage of
similarity
It helps to detect all kinds of information and to get diverse
type of relationships (closely and distinct )

PAM
 (Point Accepted Mutation or DayHoff PAM model)
 Also known as DayHoff amino acid substitution matrix.
 It was derived by M.O.DayHoff In 1978.
 Here Substitutions of A.As are observed in homologuos protein
sequences during evolution, so these amino acids Substitutions
do not significantly change the function of the protein.
 These substitutions are accepted by natural selection.
 These matrixes are known as as accepted point mutation or point
accepted mutation PAM.
 To prepare PAM Matrices , observed substitutions that occur in
alignments between similar sequences estimated Then used to
generate a 20×20 mutation probability matrix p representing all
amino acid changes.

 Each element of matrix Pij Represent the probability of
replacement of A.A. j by A.A i Over a fixed evolutionary
period .
 For PAM 1 Is the unit of evolutionary divergence in which
one percentage of amino acids have been changed .
 The model has limited value.
 Applied for highly similar sequence alignment and
comparison .
 Only used in case of closely related sequence comparison .
 Not provide distantly related Closely related sequences and
relation to overcome this later proposed BLOSUM.
 Used in evolutionary studies

DOT PLOT Analysis
 It is a paradise sequence alignment
 It is a very simple and basic pair why sequence analysis technique
 It is done by manual and graphical method of sequence analysis
 W ithin a plot, two identical sequences are characteristic
 It is the most basic method of comparing two sequences A visual
approach known as Dot Plot.
 It was first described by A J Gibbs and G A Memory in 1970
 It is a graphical method for comparing two sequences to identify the
region of similarity or dissimilarity, depicted by the presence or absence
of a dot on the plot, hence the name Dot Plot.
 To construct dot plot of sequences A and sequence B, the first
sequences is taken on the top of the plot (x axis) and the second
sequences is taken on the left side (y-axis) of the plot.
 A dot is placed on the plot if any sequence character Ai Present in A
sequences is identical to sequences character Bi Present in sequence B.

 A region of constructive Identical characters between both
sequences forms a diagonal line on the plot space .
 When large similar sequences are compared, such clouds
become crowded or noisy. To overcome this, the sliding
window concept is used .
 From the dot plot, the alignment score is calculated .
Uses
 Used for improvise logical sequence analysis.
 Useful for comparison of protein sequences.
 The plot is characterized by some apparently random dots
(noise) indicates regions of greater similarities between two
sequences

BLAST
 Basic Local Alignment Searching Tool
 Pair wise sequence alignment tool.
 Developed and maintained by NCBI
 It is a tool specialised in local sequence alignment inserted of
whole sequence alignment.
 Tool based on a statistical, theory called explicit statistical theory
by Altschul et al 1990
 Ungapped Alignment of regional sequences
 Can be used to align both protein and nucleotide sequences but it
can provide with alignment for protein sequences
 Very fast searching tool
 This tool can be search a data with millions of sequences in the
data base with In a second in pair wise manner.

Use
 Construct pair why sequence alignment by comparisons between two
sequence.
 Best tool for searching single most best sequence from corresponding
database.
 To find out the structural sequence similarity of quary sequence include 3d
structure.
 Used in the interpretation and prediction of structural information.
 Interpretation and prediction of functional information.
Steps
 Selection of regional areas of information shows best similarity .
 Extension of searching towards both the sides of selected region to get
maximum similarity .
Demerits
 At a time, we can only Compare a query sequence with a single sequence.
sensitivity to select sequences.
sometimes it may loses its sensitivity in selecting best matches
from databases (because when this tool tries to maintain thier speed in
selecting the best .it may missed certain matches that may be better than
selected one .


1) BLAST P
Used to search and find out a perfect protein sequences from
the P.S.D.B for for the query sequences.
2) BLAST N
Search and find the best N.S from N.S.D.B For the query
sequences .
3) t BLAST N
query sequeneces equal to protein sequences.
Then the given N.S.D.B Is converted into protein sequences then
comparing the quarry with the translated nucleotide sequences.
4) BLAST X
query sequence = nucleotide sequence
we are searching within P.S.D.B, Then the protein sequences are
converted into nucleotide sequences and compare nucleotide
sequences with the translated protein sequences.
5) t BLAST X
This translates Both N & P sequences in the respected databases
and then searching is occurs.

FASTA
 fast all
 it is a sequence alignment tool
 developed by Lipman and pearson 1985
 The FASTA format is a text-based format for representing either nucleotide
sequences or amino acid (protein) sequences, in which nucleotides or amino
acids are represented using single-letter codes.
 The format also allows for sequence names and comments to precede the
sequences.
 The format originates from the FASTA software package, but has now become a
near universal standard in the field of bioinformatics.
 The simplicity of FASTA format makes it easy to manipulate and parse
sequences using text-processing tools and scripting languages like the R
programming language, Python, Ruby, and Perl.
comparison with BLAST:
 It give better results for nucleotides but can used for both P& N sequences .
 It can provide better results than BLAST N But not better than BLAST P.
 More sensitive than BLAST in selecting best matches Missing of sequences
while searching is lesser than BLAST.

Different forms of FASTA:
1) FAST A3
It has a normal function used for both N & P Sequences for
searching P& N sequence query
2) FAST S3
Used to compare linked peptides against a protein sequences
databases
3) FAST f3
Used to compare mixed peptides against protein sequences
databases
4) FAST X/Y3
Used to search within protein sequences databases against a
translated query N.S.
5) t FAST X/Y3
Used to search within a translated protein sequence databases
for comparing a query protein sequences

C) Multiple Alignment Technique
 Objective, manual, simultaneous and progressive
methods
 Databases of multiple alignments
 PSI-BLAST
 CLUSTAL-W

Multiple Sequence Alignment
 More than two sequences involved.
 A set of sequences can compare at time and alignment also possible.
 2 type alignment:
 Simultaneous Multiple Sequence Alignment and Progressive Multiple
Sequence Alignment.
1) Simultaneous Multiple Sequence Alignment
 Alignment occur a time, that is simultaneously.
 There is no hierarchy fashion of arrangement or orderly arrangement.
 But sequences having similarity.
Advantage
Very fast, very quick alignment
Disadvantage
 We can't expect orderly arrangement of sequences based on similarity.
 Evolutionary relationship study is not possible

2) Progressive multiple sequence alignment
 Hierarchical arrangement of sequences and clear cut orderly
arrangement can seen.
 Sequence alignment of occurs progressively by step by step,
little time consuming process.
 This alignment best and most similar sequence, arrange next
after query sequence.
Advantage
 Arrange at hierarchical fashion .
 Evolutionary relationship study possible
Diadvantage
 Comparatively slow and little time consuming process

PSI-BLAST
 PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search
Tool) derives a position-specific scoring matrix (PSSM) or profile from
the multiple sequence alignment of sequences detected above a given
score threshold using protein–protein BLAST.
 This PSSM is used to further search the database for new matches, and is
updated for subsequent iterations with these newly detected sequences.
 Thus, PSI-BLAST provides a means of detecting distant relationships
between proteins.
 PSI-BLAST is most conveniently used on the internet with the help of
the graphical user interface provided by the PSI-BLAST search page on
National Center for Biotechnology Information (NCBI) website
(http://www.ncbi.nlm.nih.gov/BLAST/).
 The PSI-BLAST page may be customized by the user in terms of
automated or semiautomated or “two-page formatting” and other
parameters modified as desired.
 This page can then be saved as permanent internet bookmark for
repeated use on future occasions.

 It is an hybrid tool
 It is a recent approach
 Hybrid element of both device and multiple sequence alignment
method
 It was proposed by Altschul in 1997
 Hybrid of pairwise sequence alignment and multiple sequence
alignment and similarity searching tool.
 It can aligned sequence via progressive sequences alignment
 Searching residue to residue similarity, we compare sequence only,
plot dot similarity occurs.
 If there similarity present, place a dot mark as graphical
representation
 Calculate similarity
 Out of 7, 5 is similar
 Used mainly for nucleotide sequence comparison

 Here, sequences are aligned via pair wise , but with repeated blast in order to
get more and more related sequences.
 So they act as pair wise as well as look like a multiple sequence alignment .
 So they contains maximum similarity, median and least similarity
Advantages
 To increase the search of BLAST
 fast to run
 provide sequences with diverse range of sequence similarity like M.S. alignment
 Searches are more sensitive and Selective, able to detect weak but meaningful
similarities.
 running the program, increases search sensitivity.
Disadvantages
 To derive diagnostic family motifs can be very time consuming and demands
levels of understanding for general use.
 Automated interactive stearch may degenerate and lead to profile dilution

CLUSTAL
 3 forms:
1) CLUSTAL X
2) CLUSTAL W
3) CLUSTAL ω
 CLUSTAL X&W:
Protein sequence as well as nucleotide sequence alignment possible
 CLUSTAL ω:
Can only align the protein sequence
 CLUSTAL X:
 In CLUSTAL X Controlling interface is graphical user interface.
 Menu based operations for this handling or graphical representations
are used.
 CLUSTAL W CLUSTAL ω:
 Command line interface.
 For controlling interphase using text command.

Clustal W
 Clustal W like the other Clustal tools is used for aligning
multiple nucleotide or protein sequences in an efficient
manner.
 It uses progressive alignment methods, which align the most
similar sequences first and work their way down to the least
similar sequences until a global alignment is created.
 Clustal W is a matrix-based algorithm, whereas tools like T-
Coffee and Dialign are consistency-based.
 ClustalW has a fairly efficient algorithm that competes well
against other software.
 This program requires three or more sequences in order to
calculate a global alignment, for pairwise sequence alignment
(2 sequences) use tools similar to EMBOSS, LALIGN

 Multiple sequence alignment tool
 progressive multiple sequence alignment possible
 written in O ++ programming language.
 this can run almost all platforms like Unix, Linux, Metash, Windows
 Developed by Juli Thomson and Toby Gibson
 Developed and maintained by EBI
 User interface is command line, interface by write text commands.
 Due to progressive multiple sequence alignment comparison is very easy
due to orderly arrangement.
Application
 Very easy to compare sequences due to progressive sequence alignment
 Very useful for the classification of both protein and nucleotide
sequences.
 Application in predicting structural and functional features of both
nucleotide as well as protein sequences.
 This is the best tool for evolutionary relationships study .

4.Protein Structure Prediction
A)Secondary structure prediction
1) Chou-fasman Method
2) J Pred prediction method

Secondary structure prediction
 Commonly two methods are used for protein structure
prediction
1) X - ray diffraction technique
2) Nuclear magnetic resonance technique
 Birthday are very expensive by clever wise and time taking
processes.
 To over comes these issues we are used by biinformatics
tools.
 Less time consuming and very fast method.
 Skilled labours are not required.
 Cheapest method, when comparing with above 2.

Chou-fasman Method
 Chou fasman Method is an empirical technique for the prediction
of secondary structures in proteins .
 Development by Peter Y Chou and Gerald D Fasman.
 The method is based on analysis of the relative frequencies of each
amino acid in alpha helix, beta sheets and turn based on known
protein structures solve with x-ray crystallography.
 From these frequencies a set of probability parameters were
derived for the appearances of each amino acid in each secondary
structure type, And these parameters are used to predict the
probability that a given sequence of amino acids would form a
helix, a beta strand, for a turn in a protein.
 Significantly Low accurate than the modern machine learning
based technique.
 50 to 60 percentage accurate in identify correct secondary
structures

Definition
 It is an statistical procedure in which each and every amino acids
and their frequencies of given sequence is Compared with the
probability of amino acids and their corresponding propensitive
Values given by Chou Fasman in order to Fit the given protein to a
particular secondary structure
Probability table
What are the amino acids and their numbers are present in
secondary structure of protein according to traditional sequence
Propensitive value
 Is is the value at which a particular and aminoacid showing their
tendency towards a particular secondary structure.
 Propensity value of an aminoacid is generally depends the
chemical properties and their R groups:
# Alpha helix: 4 helix markers + 2 helix breakers
# Beta sheet: 3 sheet markers + 2 sheet breakers

Steps
 Scan through the given polypeptide chain
 For to find out the what are the different amino acids
present in the given strand
 Also for finding out their numbers
 Compare the same with the probability and propensitive
value given by Chou Fasman

J Pred prediction method
 A protein secondary structure prediction server
 Fully automatic method
 It has been operation since approximately 19
 J Pred Incorporate the J net algorithm in order to make more
accurate predictions.
 Combination of 6 Independent protein structure prediction
method
1) Z PRED
2) MUL PRED
3) DSC
4) PHD
5) NNSSP
6) PREDATOR

 All 6 different method predict independency .
 396 Domain data support secondary structure
information.
 Evaluate 6 different methods result with 396 domain data
and get final structural information.
 Inserted of 6 method, using Gives more accurate results
than it using Z PRED, MUL PRED Methods.
 4 methods compilation gives accuracy 72.9 percentage .
 It is an Secondary structural prediction method, hear
combilation of 6 different independent methods are using .

Tertiary Structure Prediction
Comparative modelling-
 MODELLER
 RasMol

Comparative modelling
 Comparative modelling/Homology modelling
 It predict the 3d structure of proteins.
 It uses experimentally determined protein sequences as
models (templates)
 The method predict the structure of another protein that
exhibits aa sequence similarity to the template protein.
 Evolutionary related protein have similar sequence and
structure.
 These similarities are very high in Core regions the
sequence similarity should be greater than 35 percentage

Steps
1) selection of tablet sequences
 select template from protein sequences database.
 the template strand should show maximum sequences similarity
or homology
2) Preparation of sequence alignment
 alignment of two sequences for homology determinations
3) Construct 3d model
 it is made between the cordinents of template
 We consider the length height width For comparing the template
with the query sequences between the coordinates of templates
4) Evaluation of the model constructed
 it is evaluated between known 3d model.
 the method is more accurate.
 the accuracy is depends on sequence alignment

 Homologous models are identified and extinct of their
sequences similarity with one another and the unknown is
determined.
 Sequence databases search tools BLAST and FASTA are
used to search related structures.
 Sequences are aligned together with the help of a MSA tool
called clustal W.
 Structurally conserved and variable regions are identified
Co-ordinate of core residues of unknown structure and those
of non are generated.
 The side chain and combinations are built.
 Unknown structures are refined and evaluated
various software packages are used WHAT, RASMOL,
MODELLER.
 It exploited the revolutionary related proteins.

MODELLER
 Used for 3d structure prediction.
 It is written in FORTRAN 90 languages.
 It is a software used in homology or knowledge based modelling.
 It was developed by Anrej sali at the university of california san
francisco .
 The ModWeb with comparative protein structure modelling webserver is
based on MODELLER.
 It has limited incorporation with abintitio.
 It is a computer program used in producing homology models of protein
tertiary as well as quarternary structures.
 It is freely available for academic use.
 Graphical user interface and commercial versions are different .
 Computer program.
 Used for sequence database searching
 For protein structural comparison.
 used for sequence clustering

4 important steps
1) Selection of tablet sequence
 select temperature sequence from protein sequence databases template
to sequence exhibit maximum homology with sequence which is used
to study
2) Preparation of sequence alignment
 preparation of sequence alignment between the sequence which is to be
analyised with that template sequence
3) Construction of 3d model
 construction 3d model based on the coordinates of the templet using
technique called satisfaction spacial restraints
 Here by using certain geometrical criteria Length, breadth, height
compare the complete with query sequence especially on the basis of
coordinates of the tablets searches loop, folding, side chains etc.
4) Evaluation of model constructed
 we can expect 90 % accuracy, when provides sequence alignment highly
accurate

RASMOL
 Molecular visualisation software.
 Molecular structural analysis of protein as well as nucleic acid and
other similar molecule is possible.
 Used for visualising molecular structure.
 Used in a maily for structural analysis.
 Example : pollen grains, detailed molecular structure study .
 Zooming facility of molecular structure and getting full size of
monitor .
 Rotating facility in any 3d direction x, y, z 180 degree, 120 degree,
120 degree etc.
 Periferal analysis is possible.
 Different colouring scheme available for particular part projection.
 We can view entire structure is possible detailed study is possible
by using RASMOL.

Advantage
 detail study of structure is possible by using RASMOL.
 Molecular visualisation software .
 Very good for detailed molecular analysis of small
molecules like nucleotide or protein etc.
1) Group colouring scheme
2) Shapely colouring scheme
3) amino colouring scheme

5.Emerging Areas of Bioinformatics
1) DNA microarrays
2) Functional genomics
3) Comparative genomics
4) Pharmacogenomics
5) Chemoinformatics
6) Medical informatics

DNA Microarrays
 it is genetic analysis technique.
 used for analysis of nucleic acid
 in genetic analysis technique 100 to 1000 of microscopic dots of
dna was spotted on small glass plate in an orderly fashion.
 Location of each DNA dots, structural details, final details and
expression products informations are available.
and stored in computer program .
 All informations of spotted DNA are available form computer, by
using these information genetic analysis occurs
 Started at 1990.
 Also called DNA chips, gene chips, DNA array, gene array and
biochiyps.
 Principles is hybridizations between nucleotides

Procedure
 for this, normal mRNA from normal expresses cell and it is
enter into this microarray, get the rate of gene expression.
 Collect mRNA and prepare DNA microarray.
 Radiolabeling the CDNA (100 NOS )and which is considered
as the probe
 Introduced into DNA microarray.
 Radiolabelled CDNA Hybridization with DNA microarrays
dots that indicate the number of hybridization

Application
 Gene expression study
1) for comparison of gene expression in similar cell type (diseased cell and normal
type )
2) for comparison of gene expression in different cell type (different cell of
different individual)
 Identification of tissues specific gene
 Discovery of drugs
 Diagnostics and genetic mapping
 Study of protein protein interaction
 Functional genomics
 DNA sequencing
 Agricultural biotechnology
 Study the expression of plants
 DNA polymorphism
 Detection of pathogen
 Gene finding
 Analysis 100 -1000 genes at a time
 Gene mapping

Functional genomics
 Study the functions of genes.
 example growth and physiological environment biochemical environment and role in
growth.
 In activity of genes and its reasons.
 Genes are inactive by the actions of other genes and expression of genes may die to the
suppression of other gene, the causing reason.
 Development and application of genomic analysis technique .
 Identify the genes involving in the disease.
1) Positional cloning technique
2) genome sequencing technique
 Example:
# Mirring Shotgun method
# enzymatic method
# chemical method
 are developed on the basis of functional genomics
 get information about structural and functions of gene
3) Gene expression Profiling technique comparison of similar cell type but different in
 gene expression due to mutation
 So used to find out the expression
4) Knockout technique

Comparative genomics
 Compare the structural and functional details and based on the similarities and
differences find out the relationship
 Gene finding
 classification of nucleotide sequence
 find out the evolutionary relationship comparison of gene expression
 Analysis of protein sets from completely sequenced genomes
 For better understanding of the genomes and biology of the respective
organism
 Example methanococcus, mycoplasma, E.coli, bacillus subtilis are fully
sequenced
 Genes involved in ripening green mangoes to yellow mangoes
 In this genome of mango is compared to the annotated genome of similar
species to identify the genes and the functions that they do
 Databases used for comparative genomics:
A. PEDANT Give informations about proteins, enzyme
B. KEGG A comprehensive set of metabolic pathway of genome
C. MBGD Microbial genome database. search for microbial genome
D. WIT Metabolic reconstruction of completely sequenced genomes

Pharmacogenomics
 Is the study of the role of the genome in drug response
 its name reflects its combining of pharmacology and genomics
 Pharmacogenomics analyses how the genetic makeup of an
individual affects his or her response to drugs
 It deals with the influence of acquired and inherited genetic
variation on drug response in patients by correlating gene
expression for single nucleotide polymorphism with pharmaco
kinetic and pharmacodynamics
 Pharmacogenomics aims to develop rational means to optimise
drug therapy.
 with respective patients genotype, to ensure maximum efficiency
with minimal adverse effect
 Genomic research will allow drugmakers to tailor a therapy to the
individual specific need

 It is described as a marriage between functional genomics and
molecular pharmacology
 A new journel pharmacogenomics was started by the nature group
of journals
 The entire spectrum of genes that determine response and
sensitivity to individual drugs
 Example human genome project
 Pharmacogenetics is the narrow spectrum of inherited differences
in drug metabolism and disposition .
 Both pharmacogenomics and genetics are Interchangeable
 It provide tools to classify interogenity of disease, Individual
response to medicine.
 give fascinating area in biotechnology research.
 Example: diagnosis, mechanism of disease and Response of
patients to medicine

 2 approaches to pharmacogenomics
1) candidate gene approach
2) linkage disequilibrium approach
 In industrial level, it is used to know variability in
clinical trials
 Disturb differential side effects
 Inconsistency in disease models

Chemoinformatics
 Also known as chemoinformatics, Chemio informatics
and Chemical informatics
 It is the use of computer and informational techniques
applied to a range of problems in the field of chemistry
Application
 In pharmaceutical companies and academic settings in the
process of drug discovery
 These methods can also be used in chemical and allied
industries in various other forms

Medical informatics
 Also called health informatics
 Clinical informatics
 It is information engineering applied to the field of healthcare,
essentially the management and use of patient healthcare information
 It is a multidisciplinary field that uses health information technology to
improve health care via any combination of higher quality, higher
efficiency and new opportunities
 Used in gene therapy
 Neurological and metabolic disorders
 Cystic fibrosis
 Infectious diseases
 More efficient to patient case
 Cardiovascular diseases, cancer gene therapy, human gene therapy

Bioinformatics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bioinformatics

Similar to Bioinformatics (20)

More from Kottakkal farook arts and science college

More from Kottakkal farook arts and science college (20)

Recently uploaded

Recently uploaded (20)

Bioinformatics