Bioinformatics
Bioinformatics
 The analysis of biological information using
computers and statistical techniques.
 The science of developing and utilizing computer
databases and algorithms to accelerate and
enhance biological research
 Marriage of Computer Science and Biology
 Organize, store, analyze, visualize genomic data
 Utilizes methods from Computer Science,
Mathematics, Statistics and Biology
 Margaret Oakley Dayhoff is the father & mother of
Bioinformatics
 At the convergence of two revolutions: the ultra-fast
growth of biological data, and the information
revolution
Bioinformatics
Example 1
 Compare proteins with similar sequences (for
instance –kinases) and understand what the
similarities and differences mean.
Example 2
 Look at the genome and predict where genes are
(promoters; transcription binding sites; introns;
exons)
 Predict the 3-dimensional structure of a protein from
its primary sequence
Example 3
 Correlate between gene expression and disease
Example 4
A gene chip – quantifying gene
expression in different tissues
under different conditions
May be used for personalized
medicine
Genomics
Study of sequences, gene organization &
mutations at the DNA level
the study of information flow within a cell
The Human Genome Project
3 billion bases 30,000 genes
http://www.genome.gov/
Goals Established for the Human Genome
Project
 Began in 1990
 Identify all of the genes in human DNA.
 Determine the sequence of the 3 billion chemical
nucleotide bases that make up human DNA.
 Store this information in data bases.
 Develop faster, more efficient sequencing
technologies.
 Develop tools for data analysis.
 Address the ethical, legal, and social issues (ELSI)
that may arise form the project.
Published
•The International Human Genome Sequencing Consortium
published their results in Nature, 409 (6822): 860-921,
2001.”Initial Sequencing and Analysis of the Human
Genome”
•Celera Genomics published their results in Science, Vol
291(5507): 1304-1351, 2001.“The Sequence of the Human
Genome”
• How to characterize new diseases?
• What new treatments can be discovered?
• How do we treat individual patients? Tailoring
treatments?
Impact of Genomics on
Medicine
Implications for Biomedicine
• Physicians will use genetic information to
diagnose and treat disease.
• Virtually all medical conditions have a
genetic component
• Faster drug development research:
(pharmacogenomics)
• Individualized drugs
• All Biologists/Doctors will use gene sequence
information in their daily work
Proteomics
• Uses information determined by biochemical/crystal
structure methods
• Visualization of protein structure
• Make protein-protein comparisons
• Used to determine:
conformation/folding
antibody binding sites
protein-protein interactions
computer aided drug design
Transcriptomics
The term can be applied to the total set of transcripts in a given
organism, or to the specific subset of transcripts present in a
particular cell type
The transcriptome reflects the genes that are being actively
expressed at any given time
Eg. DNA microarray technology
Database
 A collection of data that needs to be:
 Structured
 Searchable
 Updated (periodically)
 Cross referenced
 Challenge:
 To change “meaningless” data into useful information that
can be accessed and analysed the best way possible.
Biological databases
 Like any other database
 Data organization for optimal analysis
 Data is of different types
 Raw data (DNA, RNA, protein sequences)
 Curated data (DNA, RNA and protein annotated
sequences and structures, expression data)
Raw Biological data
Nucleic Acids (DNA)
Raw Biological data
Amino acid residues (proteins)
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology Gene structure
Introns, exons, ORFs, splicing
Expression data Mass spectometry
Curated Biological data
3D Structures, folds
Primary Seq db’s is collaborative- data exchanged daily
Databases 1: nucleotide sequence
 The main DNA sequence db are
EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
 There are also specialized databases for the different types
of RNAs (i.e. tRNA, rRNA, mi RNA…)
 3D structure (DNA and RNA)
 Others: Aberrant splicing db; Eucaryotic promoter db
(EPD); RNA editing sites, Multimedia Telomere Resource
……
EMBL/GenBank/DDJB
 These 3 db contain mainly the same informations
within 2-3 days (few differences in the format and
syntax)
 Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.) derived from:
 Genome projects and sequencing centers
 Individual scientists
 Patent offices (i.e. European Patent Office, EPO)
 Non-confidential data are exchanged daily
 Currently: 8.3 x106 sequences, over 9.7 x109 bp;
 Sequences from > 50’000 different species;
GenBank entry: example
LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993
DEFINITION Human gene for erythropoietin.
ACCESSION X02158
VERSION X02158.1 GI:31224
KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide.
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 3398)
AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J.,
Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F.,
Kawakita,M., Shimizu,T. and Miyake,T.
TITLE Isolation and characterization of genomic and cDNA clones of human
erythropoietin
JOURNAL Nature 313 (6005), 806-810 (1985)
MEDLINE 85137899
COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs.
FEATURES Location/Qualifiers
source 1..3398
/organism="Homo sapiens"
/db_xref="taxon:9606"
mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
exon 397..627
/number=1
sig_peptide join(615..627,1194..1261)
CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
/codon_start=1
/product="erythropoietin"
/protein_id="CAA26095.1"
/db_xref="GI:312304"
/db_xref="SWISS-PROT:P01588"
/translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL
EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL
RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI
…
GenBank entry (cont.)
TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR"
intron 628..1193
/number=1
exon 1194..1339
/number=2
mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760)
/product="erythropoietin"
intron 1340..1595
/number=2
exon 1596..1682
/number=3
intron 1683..2293
/number=3
exon 2294..2473
/number=4
intron 2474..2607
/number=4
exon 2608..3327
/note="3' untranslated region"
/number=5
BASE COUNT 698 a 1034 c 991 g 675 t
ORIGIN
1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag
61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat
121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg
181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc
241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc
Database 2: protein sequences
 SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/
 TrEMBL: created in 1996; complement to SWISS-PROT; derived from
EMBL CDS translations (« proteomic » version of EMBL)
 PIR-PSD: Protein Information Resources http://pir.georgetown.edu/
 Genpept: « proteomic » version of GenBank
 Many specialized protein databases for specific families or groups of
proteins.
 Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM
receptors), IMGT (immune system) YPD (Yeast) etc.
SWISS-PROT
 Collaboration between the SIB (CH) and EMBL/EBI (UK)
 Fully annotated (manually), non-redundant, cross-
referenced, documented protein sequence database.
 ~113 ’000 sequences from more than 6’800 different
species; 70 ’000 references (publications); 550 ’000
cross-references (databases); ~200 Mb of annotations.
 Weekly releases; available from about 50 servers across
the world, the main source being ExPASy
TrEMBL (Translation of EMBL)
 It is impossible to cope with the quantity of newly
generated data AND to maintain the high quality of
SWISS-PROT -> TrEMBL, created in 1996.
 TrEMBL is automatically generated (from annotated EMBL
coding sequences (CDS)) and annotated using software
tools.
 Contains all what is not in SWISS-PROT.
SWISS-PROT + TrEMBL = all known protein sequences.
 Well-structured SWISS-PROT-like resource.
a Swiss-Prot entry…
overview
sequence
Accession
number
Entry name
TrEMBL: example
Original TrEMBL entry which has been integrated into the SWISS-PROT
EPO_HUMAN entry and thus which is not found in TrEMBL anymore.
PDB
 Protein Data Bank, managed by RCSB
 Currently there are ~13’000 structures for about 4’000
different molecules, but far less protein family !
 There are also databases that contain data derived
from PDB. Examples: HSSP (homology-derived
secondary structure of proteins), SWISS-3DIMAGE
(images)…
Restriction enzyme
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14
REMARK 2 12CA 15
REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16
REMARK 3 12CA 17
REMARK 3 REFINEMENT. 12CA 18
REMARK 3 PROGRAM PROLSQ 12CA 19
REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89
ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90
ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91
ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92
ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93
ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94
ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95
ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96
ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97
ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98
ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99
ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100
ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101
ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102
…….
Species-oriented databases
 SGD: Saccharomyces Genome Database; molecular
biology and genetics of the yeast Saccharomyces
cerevisiae (baker's or budding yeast)
 FlyBase: A Database of the Drosophila Genome
 MGI: Mouse Genome Informatics
 RGD: Rat Genome Database
 WormBase - database for the nematode
Caenorhabditis elegans
 The Arabidopsis Information Resource (TAIR) -
database for the brassica family plant Arabidopsis
thaliana
Mouse Genome Informatics
Sequence Alignment
Tools
Database Searching:
BLAST:
NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/
WuBLAST http://blast.wustl.edu
FASTA: http://www.ebi.ac.uk/fasta3/
Smith-Waterman
Par-Align: http://dna.uio.no/search/
Multiple Sequence Alignment:
CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl
MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html
Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html
Uses of BLAST
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences.
 Find phylogenetically related sequences.
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences.
 Find phylogenetically related sequences.
 Identify possible functions based on similarities
to known sequences.
Query a database for sequences similar to an
input sequence
Sequence Homology Software
 NCBI-BLAST
 Run by the National Center for Biotechnology
Information
 BLAST uses a heuristic algorithm based on the
Smith-Waterman algorithm
 Algorithm searches database for a small string
within the query (default 11 for nucleotide searches),
then when it detects a match, searches for shared
nucleotides at each end of the seed to extend the
match
 Gaps are taken into account, then the matches are
presented in order of statistical significance
 http://www.ncbi.nlm.nih.gov/BLAST/
Different Types of BLAST
 Nucleotide-nucleotide BLAST (BLASTN):
 Basic nucleutide sequence searches
 The BLAST that you used for your sequences
 Protein-protein BLAST (BLASTP):
 Similar technology used to search amino acid
sequences
 Position-Specific Iterative BLAST (PSI-
BLAST):
 A more advance protein BLAST useful for analyzing
relationships between divergently evolved proteins.
Different Types of BLAST
 BLASTX and tBLASTN variants:
 Use six-frame translation for proteins and
nucleotides, respectively, in the search
 MegaBLAST:
 Used for BLASTing several sequences at once to
cut down on processing load and server reporting-
time
Types of BLAST:
Graphic courtesy of Joel Graber.
Interpreting BLAST Results
 Query Coverage
 The percent of the query sequence matched by the
database entry
 Max Ident
 The percent identity, i.e. the percent that the genes
match up within the limits of the full match (e.g.
deletions or additions reduce this value)
FASTA
 Another method for local sequence alignment.
 Maintained by Dr. William Pearson at the
University of Virginia.
 http://www.infobiogen.fr/doc/Fasta/docfasta.html
FASTA Versions
FASTA-nucleotide or protein sequence searching
FASTx/ FASTy compares a translated DNA query sequence
to a protein sequence database (forward or
backward translation of the query)
tFASTx/ tFASTy -compares protein query sequence to
DNA sequence database that has been translated into three
forward and three reverse reading frames
Multiple Sequence Alignment
Software
 Clustal (free)
 ClustalX – Software
 ClustalW – Web
 DNAStar ($$$)
 Functionality is similar, but difference is in
interface, tools, and speed of algorithms
 http://www.ebi.ac.uk/clustalw/
Multiple Sequence Alignment
(MSA)
• Multiple sequence alignment (MSA) can be seen as a
generalization of Pairwise Sequence Alignment - instead
of aligning two sequences, n sequences are aligned
simultaneously, where n is > 2
• Definition: A multiple sequence alignment is an alignment
of n > 2 sequences obtained by inserting gaps (“-”) into
sequences
FastA Format
 A sequence in FASTA format begins with a single-
line description, followed by lines of sequence
data.
 The description line is distinguished from the
sequence data by a greater-than (">") symbol in
the first column.
 It is recommended that all lines of text be shorter
than 80 characters in length.
Other Completed Genomes
 Haemophilus influenzae
 Escherichia coli
 Bacillus subtilus
 Helicobacter pylori
 Borrelia burgdorferi
 Streptococcus pneumoniae
 Saccharomyces cerevisiae
 Caenorhabditis elegans
 Arabidopsis thaliana
 Archaeoglobus fulgidus
 Methanobacterium thermoautotrophicum
 Methanococcus jannaschii
 Mycoplasma pneumoniae
 Mycoplasm genitaliu
 Rickettsia prowazekii
 Mycobacterium tuberculosis
 Treponema pallidum
 Staphylococcus aureus
 And more!
Completed Plant Genomes
 Arabidopsis thaliana
Completed Insect Genomes
 Drosophila melanogaster
Completed Rodent Genomes
 Mus musculus
Protien- Three Structure Levels
Beta
Sheet
Helix
Loop
PDB ID: 12as
Primary structure: sequence of amino
acids
– e.g., DRVYIHPF
Secondary structure: local folding
patterns
– e.g., alpha-helix, beta-sheet, loop
Tertiary structure: complete 3D fold
Different Levels of Protein Structure Prediction
Protein Structure Prediction
 Regular Secondary Structure Prediction (-helix -sheet)
 APSSP2: Highly accurate method for secondary structure prediction
 Combines memory based reasoning ( MBR) and ANN methods
 Irregular secondary structure prediction methods (Tight turns)
 Betatpred: Consensus method for -turns prediction
 Statistical methods combined
 Kaur and Raghava (2001) Bioinformatics
 Bteval : Benchmarking of -turns prediction
 Kaur and Raghava (2002) J. Bioinformatics and Computational Biology, 1:495:504
 BetaTpred2: Highly accurate method for predicting -turns (ANN)
 Multiple alignment and secondary structure information
 Kaur and Raghava (2003) Protein Sci 12:627-34
 BetaTurns: Prediction of -turn types in proteins
 Evolutionary information
 Kaur and Raghava (2004) Bioinformatics 20:2751-8.
 AlphaPred: Prediction of -turns in proteins
 Kaur and Raghava (2004) Proteins: Structure, Function, and Genetics 55:83-90
 GammaPred: Prediction of -turns in proteins
 Kaur and Raghava (2004) Protein Science; 12:923-929.
Protein Structure Prediction
 BhairPred: Prediction of Super secondary structure prediction
 Prediction of Beta Hairpins
 Secondary structure and surface accessibility used as input
 Manish et al. (2005) Nucleic Acids Research
 TBBpred: Prediction of outer membrane proteins
 Prediction of trans membrane beta barrel proteins
 Prediction of beta barrel regions
 Application of ANN and SVM + Evolutionary information
 Natt et al. (2004) Proteins: 56:11-8
 ARNHpred: Analysis and prediction side chain, backbone
interactions
 Prediction of aromatic NH interactions
 Kaur and Raghava (2004) FEBS Letters 564:47-57 .
 SARpred: Prediction of surface accessibility
 Multiple alignment (PSIBLAST) and Secondary structure information
 ANN: Two layered network (sequence-structure-structure)
 Garg et al., (2005) Proteins
 PepStr: Prediction of tertiary structure of Bioactive peptides
Homology Modeling
The Best Match
DRVYIHPFADRVYIHPFAQuery
Sequence:
Protein sequence
classification database
• PSI-BLAST
• HMM
• Smith-Waterman algorithm
Phylogenetic analysis
 Evolution studies
 Systematic biology
 Medical research and epidemiology
 Ecology
Terminology
Rooted tree Unrooted tree
Advantages of molecular
phylogenetic analysis
 Analogous features (share common function, but
NOT common ancestry) can be misleading
 DNA sequences more simple to model, we only
have the four states A, C, G, T
 DNA samples for sequence analysis easy to
prepare
 Softwares used :
 Phylip
 Paup
Modern drug discovery process
Target
identification
Target
validation
Lead
identification
Lead
optimization
Preclinical
phase
Drug
discovery
2-5 years
• Drug discovery is an expensive process involving high R & D cost and
extensive clinical testing
• A typical development time is estimated to be 10-15 years.
6-9 years
Design
 Comparison of Sequences: Identify targets
 Homology modelling: active site prediction
 Systems Biology: Identify targets
 Databases: Manage information
 In silico screening (Ligand based, receptor
based): Iterative steps of Molecular docking.
 Pharmacogenomic databases: assist safety
related issues
Molecular Docking
RL
• Docking is the computational determination of binding affinity
between molecules (protein structure and ligand).
• Given a protein and a ligand find out the binding free energy of
the complex formed by docking them.
L
R
Molecular Docking: classification
 Docking or Computer aided drug designing can be
broadly classified
 Receptor based methods- make use of the structure of the target
protein.
 Ligand based methods- based on the known inhibitors
Receptor based methods
 Uses the 3D structure of the target receptor to search
for the potential candidate compounds that can
modulate the target function.
 These involve molecular docking of each compound
in the chemical database into the binding site of the
target and predicting the electrostatic fit between
them.
 The compounds are ranked using an appropriate
scoring function such that the scores correlate with
the binding affinity.
 Receptor based method has been successfully
applied in many targets
Ligand based strategy
 In the absence of the structural information of the
target, ligand based method make use of the
information provided by known inhibitors for the target
receptor.
 Structures similar to the known inhibitors are identified
from chemical databases.
Ligand based strategy
Search for similar compounds
database known actives structures found
Example: use of Bioinformatics in
Drug discovery
Identification of novel drug targets against human
malaria
Docking Software :
Dock
Autodock
Gold
Top 100 solutions
Out of top 40 only 10 compounds available for purchase
Drug-like compound
library (1,000,00)
Molecular
docking
Ligand docked into protein’s
active site
Insilico identification of novel inhibitors against PfClpQ , a novel
drug target of P.falciparum by high throughput docking
PfclpQ
Rasmol
RasMol2 is a molecular graphics program intended for the visualization of proteins,
nucleic acids and small molecules
The program is aimed at display, teaching and generation of publication quality images
The program reads in molecular co-ordinate files and interactively displays the molecule
the screen in a variety of representations and color schemes
Supported input file formats include Brookhaven Protein Databank (PDB),
Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format,
Minnesota Supercomputer Centre's (MSC)
XYZ (XMol) format CHARMm format files
Stephen James
Department of Bioinformatics
Mar Athanasios College for Advanced Studies(MACFAST)
Email : stephen@macfast.org
Phone: 9746935363

Bioinformatics final

  • 1.
  • 2.
    Bioinformatics  The analysisof biological information using computers and statistical techniques.  The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research  Marriage of Computer Science and Biology  Organize, store, analyze, visualize genomic data  Utilizes methods from Computer Science, Mathematics, Statistics and Biology  Margaret Oakley Dayhoff is the father & mother of Bioinformatics
  • 3.
     At theconvergence of two revolutions: the ultra-fast growth of biological data, and the information revolution Bioinformatics
  • 4.
    Example 1  Compareproteins with similar sequences (for instance –kinases) and understand what the similarities and differences mean.
  • 5.
    Example 2  Lookat the genome and predict where genes are (promoters; transcription binding sites; introns; exons)
  • 6.
     Predict the3-dimensional structure of a protein from its primary sequence Example 3
  • 7.
     Correlate betweengene expression and disease Example 4 A gene chip – quantifying gene expression in different tissues under different conditions May be used for personalized medicine
  • 8.
    Genomics Study of sequences,gene organization & mutations at the DNA level the study of information flow within a cell
  • 9.
    The Human GenomeProject 3 billion bases 30,000 genes http://www.genome.gov/
  • 10.
    Goals Established forthe Human Genome Project  Began in 1990  Identify all of the genes in human DNA.  Determine the sequence of the 3 billion chemical nucleotide bases that make up human DNA.  Store this information in data bases.  Develop faster, more efficient sequencing technologies.  Develop tools for data analysis.  Address the ethical, legal, and social issues (ELSI) that may arise form the project.
  • 11.
    Published •The International HumanGenome Sequencing Consortium published their results in Nature, 409 (6822): 860-921, 2001.”Initial Sequencing and Analysis of the Human Genome” •Celera Genomics published their results in Science, Vol 291(5507): 1304-1351, 2001.“The Sequence of the Human Genome”
  • 12.
    • How tocharacterize new diseases? • What new treatments can be discovered? • How do we treat individual patients? Tailoring treatments? Impact of Genomics on Medicine
  • 13.
    Implications for Biomedicine •Physicians will use genetic information to diagnose and treat disease. • Virtually all medical conditions have a genetic component • Faster drug development research: (pharmacogenomics) • Individualized drugs • All Biologists/Doctors will use gene sequence information in their daily work
  • 14.
    Proteomics • Uses informationdetermined by biochemical/crystal structure methods • Visualization of protein structure • Make protein-protein comparisons • Used to determine: conformation/folding antibody binding sites protein-protein interactions computer aided drug design
  • 15.
    Transcriptomics The term canbe applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type The transcriptome reflects the genes that are being actively expressed at any given time Eg. DNA microarray technology
  • 16.
    Database  A collectionof data that needs to be:  Structured  Searchable  Updated (periodically)  Cross referenced  Challenge:  To change “meaningless” data into useful information that can be accessed and analysed the best way possible.
  • 17.
    Biological databases  Likeany other database  Data organization for optimal analysis  Data is of different types  Raw data (DNA, RNA, protein sequences)  Curated data (DNA, RNA and protein annotated sequences and structures, expression data)
  • 18.
  • 19.
    Raw Biological data Aminoacid residues (proteins)
  • 20.
    Curated Biological Data DNA,nucleotide sequences Gene boundaries, topology Gene structure Introns, exons, ORFs, splicing Expression data Mass spectometry
  • 21.
    Curated Biological data 3DStructures, folds
  • 23.
    Primary Seq db’sis collaborative- data exchanged daily
  • 24.
    Databases 1: nucleotidesequence  The main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan)  There are also specialized databases for the different types of RNAs (i.e. tRNA, rRNA, mi RNA…)  3D structure (DNA and RNA)  Others: Aberrant splicing db; Eucaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource ……
  • 25.
    EMBL/GenBank/DDJB  These 3db contain mainly the same informations within 2-3 days (few differences in the format and syntax)  Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from:  Genome projects and sequencing centers  Individual scientists  Patent offices (i.e. European Patent Office, EPO)  Non-confidential data are exchanged daily  Currently: 8.3 x106 sequences, over 9.7 x109 bp;  Sequences from > 50’000 different species;
  • 26.
    GenBank entry: example LOCUSHSERPG 3398 bp DNA PRI 22-JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI …
  • 27.
    GenBank entry (cont.) TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" intron628..1193 /number=1 exon 1194..1339 /number=2 mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760) /product="erythropoietin" intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc
  • 28.
    Database 2: proteinsequences  SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/  TrEMBL: created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations (« proteomic » version of EMBL)  PIR-PSD: Protein Information Resources http://pir.georgetown.edu/  Genpept: « proteomic » version of GenBank  Many specialized protein databases for specific families or groups of proteins.  Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) YPD (Yeast) etc.
  • 29.
    SWISS-PROT  Collaboration betweenthe SIB (CH) and EMBL/EBI (UK)  Fully annotated (manually), non-redundant, cross- referenced, documented protein sequence database.  ~113 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.  Weekly releases; available from about 50 servers across the world, the main source being ExPASy
  • 30.
    TrEMBL (Translation ofEMBL)  It is impossible to cope with the quantity of newly generated data AND to maintain the high quality of SWISS-PROT -> TrEMBL, created in 1996.  TrEMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.  Contains all what is not in SWISS-PROT. SWISS-PROT + TrEMBL = all known protein sequences.  Well-structured SWISS-PROT-like resource.
  • 31.
  • 32.
    TrEMBL: example Original TrEMBLentry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.
  • 34.
    PDB  Protein DataBank, managed by RCSB  Currently there are ~13’000 structures for about 4’000 different molecules, but far less protein family !  There are also databases that contain data derived from PDB. Examples: HSSP (homology-derived secondary structure of proteins), SWISS-3DIMAGE (images)… Restriction enzyme
  • 35.
    PDB: example HEADER LYASE(OXO-ACID)01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
  • 36.
    ATOM 1 NTRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89 ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90 ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91 ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92 ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93 ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94 ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95 ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96 ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97 ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98 ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99 ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100 ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101 ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102 …….
  • 39.
    Species-oriented databases  SGD:Saccharomyces Genome Database; molecular biology and genetics of the yeast Saccharomyces cerevisiae (baker's or budding yeast)  FlyBase: A Database of the Drosophila Genome  MGI: Mouse Genome Informatics  RGD: Rat Genome Database  WormBase - database for the nematode Caenorhabditis elegans  The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis thaliana Mouse Genome Informatics
  • 41.
    Sequence Alignment Tools Database Searching: BLAST: NCBI,Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/ WuBLAST http://blast.wustl.edu FASTA: http://www.ebi.ac.uk/fasta3/ Smith-Waterman Par-Align: http://dna.uio.no/search/ Multiple Sequence Alignment: CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html
  • 43.
    Uses of BLAST Querya database for sequences similar to an input sequence
  • 44.
    Uses of BLAST Identify previously characterized sequences Query a database for sequences similar to an input sequence
  • 45.
    Uses of BLAST Identify previously characterized sequences.  Find phylogenetically related sequences. Query a database for sequences similar to an input sequence
  • 46.
    Uses of BLAST Identify previously characterized sequences.  Find phylogenetically related sequences.  Identify possible functions based on similarities to known sequences. Query a database for sequences similar to an input sequence
  • 47.
    Sequence Homology Software NCBI-BLAST  Run by the National Center for Biotechnology Information  BLAST uses a heuristic algorithm based on the Smith-Waterman algorithm  Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match  Gaps are taken into account, then the matches are presented in order of statistical significance  http://www.ncbi.nlm.nih.gov/BLAST/
  • 48.
    Different Types ofBLAST  Nucleotide-nucleotide BLAST (BLASTN):  Basic nucleutide sequence searches  The BLAST that you used for your sequences  Protein-protein BLAST (BLASTP):  Similar technology used to search amino acid sequences  Position-Specific Iterative BLAST (PSI- BLAST):  A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.
  • 49.
    Different Types ofBLAST  BLASTX and tBLASTN variants:  Use six-frame translation for proteins and nucleotides, respectively, in the search  MegaBLAST:  Used for BLASTing several sequences at once to cut down on processing load and server reporting- time
  • 50.
    Types of BLAST: Graphiccourtesy of Joel Graber.
  • 51.
    Interpreting BLAST Results Query Coverage  The percent of the query sequence matched by the database entry  Max Ident  The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)
  • 52.
    FASTA  Another methodfor local sequence alignment.  Maintained by Dr. William Pearson at the University of Virginia.  http://www.infobiogen.fr/doc/Fasta/docfasta.html
  • 53.
    FASTA Versions FASTA-nucleotide orprotein sequence searching FASTx/ FASTy compares a translated DNA query sequence to a protein sequence database (forward or backward translation of the query) tFASTx/ tFASTy -compares protein query sequence to DNA sequence database that has been translated into three forward and three reverse reading frames
  • 54.
    Multiple Sequence Alignment Software Clustal (free)  ClustalX – Software  ClustalW – Web  DNAStar ($$$)  Functionality is similar, but difference is in interface, tools, and speed of algorithms  http://www.ebi.ac.uk/clustalw/
  • 55.
    Multiple Sequence Alignment (MSA) •Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2 • Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences
  • 56.
    FastA Format  Asequence in FASTA format begins with a single- line description, followed by lines of sequence data.  The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.  It is recommended that all lines of text be shorter than 80 characters in length.
  • 57.
    Other Completed Genomes Haemophilus influenzae  Escherichia coli  Bacillus subtilus  Helicobacter pylori  Borrelia burgdorferi  Streptococcus pneumoniae  Saccharomyces cerevisiae
  • 58.
     Caenorhabditis elegans Arabidopsis thaliana  Archaeoglobus fulgidus  Methanobacterium thermoautotrophicum  Methanococcus jannaschii  Mycoplasma pneumoniae  Mycoplasm genitaliu  Rickettsia prowazekii  Mycobacterium tuberculosis
  • 59.
     Treponema pallidum Staphylococcus aureus  And more!
  • 60.
    Completed Plant Genomes Arabidopsis thaliana Completed Insect Genomes  Drosophila melanogaster Completed Rodent Genomes  Mus musculus
  • 61.
    Protien- Three StructureLevels Beta Sheet Helix Loop PDB ID: 12as Primary structure: sequence of amino acids – e.g., DRVYIHPF Secondary structure: local folding patterns – e.g., alpha-helix, beta-sheet, loop Tertiary structure: complete 3D fold
  • 62.
    Different Levels ofProtein Structure Prediction
  • 63.
    Protein Structure Prediction Regular Secondary Structure Prediction (-helix -sheet)  APSSP2: Highly accurate method for secondary structure prediction  Combines memory based reasoning ( MBR) and ANN methods  Irregular secondary structure prediction methods (Tight turns)  Betatpred: Consensus method for -turns prediction  Statistical methods combined  Kaur and Raghava (2001) Bioinformatics  Bteval : Benchmarking of -turns prediction  Kaur and Raghava (2002) J. Bioinformatics and Computational Biology, 1:495:504  BetaTpred2: Highly accurate method for predicting -turns (ANN)  Multiple alignment and secondary structure information  Kaur and Raghava (2003) Protein Sci 12:627-34  BetaTurns: Prediction of -turn types in proteins  Evolutionary information  Kaur and Raghava (2004) Bioinformatics 20:2751-8.  AlphaPred: Prediction of -turns in proteins  Kaur and Raghava (2004) Proteins: Structure, Function, and Genetics 55:83-90  GammaPred: Prediction of -turns in proteins  Kaur and Raghava (2004) Protein Science; 12:923-929.
  • 64.
    Protein Structure Prediction BhairPred: Prediction of Super secondary structure prediction  Prediction of Beta Hairpins  Secondary structure and surface accessibility used as input  Manish et al. (2005) Nucleic Acids Research  TBBpred: Prediction of outer membrane proteins  Prediction of trans membrane beta barrel proteins  Prediction of beta barrel regions  Application of ANN and SVM + Evolutionary information  Natt et al. (2004) Proteins: 56:11-8  ARNHpred: Analysis and prediction side chain, backbone interactions  Prediction of aromatic NH interactions  Kaur and Raghava (2004) FEBS Letters 564:47-57 .  SARpred: Prediction of surface accessibility  Multiple alignment (PSIBLAST) and Secondary structure information  ANN: Two layered network (sequence-structure-structure)  Garg et al., (2005) Proteins  PepStr: Prediction of tertiary structure of Bioactive peptides
  • 65.
    Homology Modeling The BestMatch DRVYIHPFADRVYIHPFAQuery Sequence: Protein sequence classification database • PSI-BLAST • HMM • Smith-Waterman algorithm
  • 66.
    Phylogenetic analysis  Evolutionstudies  Systematic biology  Medical research and epidemiology  Ecology
  • 68.
  • 69.
    Advantages of molecular phylogeneticanalysis  Analogous features (share common function, but NOT common ancestry) can be misleading  DNA sequences more simple to model, we only have the four states A, C, G, T  DNA samples for sequence analysis easy to prepare  Softwares used :  Phylip  Paup
  • 70.
    Modern drug discoveryprocess Target identification Target validation Lead identification Lead optimization Preclinical phase Drug discovery 2-5 years • Drug discovery is an expensive process involving high R & D cost and extensive clinical testing • A typical development time is estimated to be 10-15 years. 6-9 years
  • 71.
    Design  Comparison ofSequences: Identify targets  Homology modelling: active site prediction  Systems Biology: Identify targets  Databases: Manage information  In silico screening (Ligand based, receptor based): Iterative steps of Molecular docking.  Pharmacogenomic databases: assist safety related issues
  • 72.
    Molecular Docking RL • Dockingis the computational determination of binding affinity between molecules (protein structure and ligand). • Given a protein and a ligand find out the binding free energy of the complex formed by docking them. L R
  • 73.
    Molecular Docking: classification Docking or Computer aided drug designing can be broadly classified  Receptor based methods- make use of the structure of the target protein.  Ligand based methods- based on the known inhibitors
  • 74.
    Receptor based methods Uses the 3D structure of the target receptor to search for the potential candidate compounds that can modulate the target function.  These involve molecular docking of each compound in the chemical database into the binding site of the target and predicting the electrostatic fit between them.  The compounds are ranked using an appropriate scoring function such that the scores correlate with the binding affinity.  Receptor based method has been successfully applied in many targets
  • 75.
    Ligand based strategy In the absence of the structural information of the target, ligand based method make use of the information provided by known inhibitors for the target receptor.  Structures similar to the known inhibitors are identified from chemical databases.
  • 76.
    Ligand based strategy Searchfor similar compounds database known actives structures found
  • 77.
    Example: use ofBioinformatics in Drug discovery Identification of novel drug targets against human malaria Docking Software : Dock Autodock Gold
  • 78.
    Top 100 solutions Outof top 40 only 10 compounds available for purchase Drug-like compound library (1,000,00) Molecular docking Ligand docked into protein’s active site Insilico identification of novel inhibitors against PfClpQ , a novel drug target of P.falciparum by high throughput docking PfclpQ
  • 79.
    Rasmol RasMol2 is amolecular graphics program intended for the visualization of proteins, nucleic acids and small molecules The program is aimed at display, teaching and generation of publication quality images The program reads in molecular co-ordinate files and interactively displays the molecule the screen in a variety of representations and color schemes Supported input file formats include Brookhaven Protein Databank (PDB), Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer Centre's (MSC) XYZ (XMol) format CHARMm format files
  • 81.
    Stephen James Department ofBioinformatics Mar Athanasios College for Advanced Studies(MACFAST) Email : stephen@macfast.org Phone: 9746935363