M.Alroy Mascrenghe 1
Introduction……
M.Alroy Mascrenghe 2
2000
n  A Major event happened that was to
change the course of human history
n  It was a joint British and American
effort
n  nothing to do with IRAQ!
n  It was a race – who will complete first
n  Race Test – not whether they have
taken drugs but whether they can
produce them!
n  Human genome was sequenced
M.Alroy Mascrenghe 3
A Situ…somewhere in the
near future
n  A virus –not ‘I love you’ virus- creates an epidemic
n  Geneticists and bioinformaticians role on their
sleeves
n  Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
n  As the characteristics of the other viruses are
known
n  From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
n  When the protein (sequence and structure) is
known then medicines can be designed
M.Alroy Mascrenghe 4
What is
n  The marriage between computer
science and molecular biology
l  The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
n  ‘Information technology applied to
the management and analysis of
biological data’
l  Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
M.Alroy Mascrenghe 5
Biology Chemistry
Statistics
Computer
Science
Bioinformatics
M.Alroy Mascrenghe 6
What is..
n  This is the age of the Information
Technology
n  However storing info is nothing new
n  Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
n  ‘Bioinformatics tries to determine
what info is biologically important’
M.Alroy Mascrenghe 7
Basics
of
Molecular Biology….
M.Alroy Mascrenghe 8
DNA & Genes
n  DNA is where the genetic information is
stored
n  Blonde hair and blue eyes are inherited by
this
n  Gene - The basic unit of heredity
l  There are genes for characteristics i.e. a gene
for blond hair etc
n  Genes contain the information as a
sequence of nucleotides
n  Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
n  Genes are made up of nucleotides
M.Alroy Mascrenghe 9
M.Alroy Mascrenghe 10
Nucleotide (nt)
n  Each nt I made up of
l  Sugar
l  Phospate group
l  Base
n  The base it (nt) contains makes the only
difference between one nt and the other
n  There are 4 different bases
n  G(uanine),A(denine),T(hymine),C(ytosine)
n  The information is in the order of nucleotide
and the order is the info
n  Genes can be many thousands of nt long
n  The complete set of genetic instructions is
called genomes
M.Alroy Mascrenghe 11
Chromosomes
n  DNA strings make
chromosomes
n  Analogy
l  Letters - nt
l  Sentences – genes
l  Individual volumes of Britannica
encyclopedia – chromosomes
l  All voles together - Genome
M.Alroy Mascrenghe 12
Double Helix
n  The DNA is a double helix
n  Each strand has complementary
information
n  Each particular base in one strand is
bonded with another particular base in the
next strand
l  G - C
l  A - T
n  For example -
l  AATGC one strand
l  TTACG other strand
M.Alroy Mascrenghe 13
Proteins
n  Proteins are very important
biological feature
n  Amino Acids make up the proteins
n  20 different amino acids are there
n  The function of a protein is
dependant on the order of the amino
acids
M.Alroy Mascrenghe 14
Proteins…
n  The information required to make aa is
stored in DNA
n  DNA sequence determines amino acid
sequence
n  Amino Acid sequence determines protein
structure
n  Protein structure determines protein
function
n  A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
n  Storage - DNA
n  Information Transfer – RNA
n  RNA is the message boy!
M.Alroy Mascrenghe 15
Central dogma
DNA transcription RNA Translation Protein
RNA Polymerase Ribosomes
M.Alroy Mascrenghe 16
M.Alroy Mascrenghe 17
Proteins…..
n  Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
n  So in triplet codes – codon – protein
information is carried
n  The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)
n  Some codons are used as start
codons - AUG as well as to code
methionine
M.Alroy Mascrenghe 18
Protein Structure
n  Shows a wide variety as opposed to the
DNA whose structure is uniform
n  X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
n  Structure is related to the function or rather
structure determines the function
n  Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.
n  If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
n  Only in the native structure the proteins
functions well
n  Even after the translation is over protein
goes through some changes to its structure
M.Alroy Mascrenghe 19
Gene Expression
n  Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
n  Where do the genes begin in a
chromosome?
n  How does the RNA identify the beginning
of a gene to make a protein
n  A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently
n  But a particular combination of a nucleotide
can be
n  Promoter sequences – the order of nt
which mark the beginning of a gene
M.Alroy Mascrenghe 20
Bioinformatics
Techniques…..
M.Alroy Mascrenghe 21
Prediction and Pattern
Recognition
n  The two main areas of bioinformatics
are
n  Pattern recognition
l  ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
n  Prediction
l  From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
M.Alroy Mascrenghe 22
Dot plots….
n  Simple way of evaluating
similarity between two
sequences
n  In a graph one sequence is on
one side the next on the other
side
n  Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe 23
M.Alroy Mascrenghe 24
Alignments
n  A match for similarity between the characters of two or
more sequences
n  Eg.
l  TTACTATA
l  TAGATA
n  There are so many ways to align the above two
sequences
l  1.
n  TTACTATA
n  TAGATA
l  2.
n  TTACTATA
n  TAGATA
l  3.
n  TTACTATA
n  TAGATA
n  So which one do we choose and on what basis?
n  Solution is to Provide a match score and mismatch score
M.Alroy Mascrenghe 25
Gaps
n  Introduce gaps and a penalty
score for gaps
n  TTACTATA
n  T_A_GATA
n  In gap scores a single indel which is two characters long is preferred to two indels which are each one
character long
n  However not all gaps are bad
l  TTGCAATCT
l  CAA
l  How do we align?
l  ---CAA---
l  These gaps are not biologically significant
l  Semi Global Alignments
M.Alroy Mascrenghe 26
Scoring Matrix
n  For DNA/protein sequence alignment we create a matrix
n  If A and A score is 1
n  If A and T score is -5
n  If A and C score is -1
M.Alroy Mascrenghe 27
Dynamic Programming
n  As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
n  We cannot perform an exhaustive search
n  Combinatorial explosion occurs – too much
combinations to search for
n  Dynamic programming is a way of using
heuristics to search in the most promising
path
M.Alroy Mascrenghe 28
Databases
n  Sequence info is stored in
databases
n  So that they can be manipulated
easily
n  The db (next slide) are located
at diff places
n  They exchange info on a daily
basis so that they are up-to-date
and are in sync
n  Primary db – sequence data
Major Primary DB
Nucleic Acid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
M.Alroy Mascrenghe 30
Composite DB
n  As there are many db which one to
search? Some are good in some
aspects and weak in others?
n  Composite db is the answer – which
has several db for its base data
n  Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
M.Alroy Mascrenghe 31
Composite DB
n  OWL has these as their primary
db
l  SWISS PROT (top priority)
l  PIR
l  GenBank
l  NRL-3D
M.Alroy Mascrenghe 32
Secondary db
n  Store secondary structure info or
results of searches of the
primary db
Compo
DB
Primary
Source
PROSITE SWISS-PROT
PRINTS OWL
M.Alroy Mascrenghe 33
Database Searches
n  We have sequenced and identified
genes. So we know what they do
n  The sequences are stored in
databases
n  So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
n  Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
n  So heuristics must be used again.
M.Alroy Mascrenghe 34
Areas in
Bioinformatics…
M.Alroy Mascrenghe 35
Genomics
n  Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
n  i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
M.Alroy Mascrenghe 36
Genomics - Finding Genes
n  Gene in sequence data – needle in a
haystack
n  However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
n  Is whole array of nt we try to find and
border mark a set o nt as a gene
n  This is one of the challenges of
bioinformatics
n  Neural networks and dynamic
programming are being employed
Organism Genome
Size
(Mb)
bp * 1,000,000
Gene
Number
Web Site
Yeast 13.5 6,241 http://genome-
www.stanford.ed
u/
Saccharomyces
Fruit Flies 180 13,601 http://
flybase.bio.india
na.edu
Homo
Sapiens
3,000 45,000 http://
www.ncbi.nlm.ni
h.gov/genome/
M.Alroy Mascrenghe 38
Proteomics
n  Proteome is the sum total of an
organisms proteins
n  More difficult than genomics
l  4 20
l  Simple chemical makeup complex
l  Can duplicate can’t
n  We are entering into the ‘post
genome era’
n  Meaning much has been done with
the Genes – not that it’s a over
M.Alroy Mascrenghe 39
Proteomics…..
n  The relationship between the RNA and the protein it codes are
usually very different
n  After translation proteins do change
l  So aa sequence do not tell anything about the post
translation changes
n  Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
n  So aa only hint in these things
n  Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material
M.Alroy Mascrenghe 40
Protein Structure Prediction
n  Is one of the biggest challenges
of bioinformatics and esp.
biochemistry
n  No algorithm is there now to
consistently predict the structure
of proteins
M.Alroy Mascrenghe 41
Structure Prediction methods
n  Comparative Modeling
l  Target proteins structure is
compared with related proteins
l  Proteins with similar sequences
are searched for structures
M.Alroy Mascrenghe 42
Phylogenetics
n  The taxonomical system reflects
evolutionary relationships
n  Phylogenetics trees are things which reflect
the evolutionary relationship thru a picture/
graph
n  Rooted trees where there is only one
ancestor
n  Un rooted trees just showing the
relationship
n  Phylogenetic tree reconstruction algorithms
are also an area of research
M.Alroy Mascrenghe 43
Applications….
M.Alroy Mascrenghe 44
Medical Implications
n  Pharmacogenomics
l  Not all drugs work on all patients, some good
drugs cause death in some patients
l  So by doing a gene analysis before the
treatment the offensive drugs can be avoided
l  Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
l  Customized treatment
n  Gene Therapy
l  Replace or supply the defective or missing gene
l  E.g: Insulin and Factor VIII or Haemophilia
n  BioWeapons (??)
M.Alroy Mascrenghe 45
Diagnosis of Disease
n  Diagnosis of disease
l  Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
n  Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
n  Death in 10-15 years
n  The gene responsible for the disease has
been identified
n  Contains excessively repeated sections of
CAG
n  So once analyzed the couple can be
counseled
M.Alroy Mascrenghe 46
Drug Design
n  Can go up to 15yrs and
$700million
n  One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
n  The process
l  Discovery
n  Computational methods can
improves this
l  Testing
M.Alroy Mascrenghe 47
Discovery
Target identification
l  Identifying the molecule on which the
germs relies for its survival
l  Then we develop another molecule
i.e. drug which will bind to the target
l  So the germ will not be able to interact
with the target.
l  Proteins are the most common targets
M.Alroy Mascrenghe 48
Discovery…
n  For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
n  This HIV protease has an active
site where it binds to other
molecules
n  So HIV drug will go and bind
with that active site
l  Easily said than done!
M.Alroy Mascrenghe 49
Discovery…
n  Lead compounds are the
molecules that go and bind to
the target protein’s active site
n  Traditionally this has been a trial
and error method
n  Now this is being moved into the
realm of computers
M.Alroy Mascrenghe 50
Related Computer
Technology………….
M.Alroy Mascrenghe 51
PERL
n  Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
n  The default CGI language
n  It started out as a scripting language
but has become a fully fledged
language
n  IT has everything now, even web
service support
n  http://bio.perl.org
M.Alroy Mascrenghe 52
The place of XML & Web
Services
n  Various markup languages are being created –
Gene Markup language etc to represent sequence/
gene data
n  Web Services – program to program interaction,
making the web application centric as opposed to
human centric
n  So this has to platform language independent
n  Protocols like SOAP help in this regard
n  In bioinformatics various databases are being used,
different platforms, languages etc
n  So web services helps achieve platform
independence and program interaction
n  Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
M.Alroy Mascrenghe 53
The place of GRID
n  GRID - new kid on the block
n  Using many computers to fulfill a
single computational tasks
n  Bioinformatics is the ideal
platform as it has to deal with a
large amount of data in
alignment and searches
n  E-science initiative in the UK
n  ORACLE 10g – the worlds first
GRID database
M.Alroy Mascrenghe 54
Data bases and Mining
n  Lot of the sequence databases are
available publicly
n  As there is a DB involved various
data mining techniques are used to
pull the data out
n  As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
M.Alroy Mascrenghe 55
European Molecular Biology
Network (EMBnet)
n  A central system for sharing, training
and centralizing up to date bio info
n  Some of the EMBnet sites are:
n  SQENET
l  http://www.seqnet.dl.ac.uk
n  UCL
l  http://www.biochem.ucl.ac.uk/bsm/
dbbrowser/embnet/
n  EBI – European Bioinformatics
Institute
l  www.ebi.ac.uk
M.Alroy Mascrenghe 56
References
n  Dan E. Krane and Michael L. Raymer
l  Basic Concepts of Bioinformatics
n  Arthur M Lesk
l  Intro to Bioinformatics
n  T.K. Attwood & D. J. Parry-Smith
l  Intro to Bioinformatics
n  The genetic Revolution
l  Dr Patrick Dixon
n  Prof David Gilbert’s Site
l  http://www.brc.dcs.gla.ac.uk/~drg/
M.Alroy Mascrenghe 57
Thank You!

Bioinformatics

  • 1.
  • 2.
    M.Alroy Mascrenghe 2 2000 n A Major event happened that was to change the course of human history n  It was a joint British and American effort n  nothing to do with IRAQ! n  It was a race – who will complete first n  Race Test – not whether they have taken drugs but whether they can produce them! n  Human genome was sequenced
  • 3.
    M.Alroy Mascrenghe 3 ASitu…somewhere in the near future n  A virus –not ‘I love you’ virus- creates an epidemic n  Geneticists and bioinformaticians role on their sleeves n  Genetic material of the virus is compared with the existing base of known genetic material of other viruses n  As the characteristics of the other viruses are known n  From genetic material computer programs will derive the proteins necessary for the survival of the virus n  When the protein (sequence and structure) is known then medicines can be designed
  • 4.
    M.Alroy Mascrenghe 4 Whatis n  The marriage between computer science and molecular biology l  The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists n  ‘Information technology applied to the management and analysis of biological data’ l  Storage and Analysis are two of the important functions – bioinformaticians build tools for each
  • 5.
    M.Alroy Mascrenghe 5 BiologyChemistry Statistics Computer Science Bioinformatics
  • 6.
    M.Alroy Mascrenghe 6 Whatis.. n  This is the age of the Information Technology n  However storing info is nothing new n  Information to the volume of Britannica Encyclopedia is stored in each of our cells n  ‘Bioinformatics tries to determine what info is biologically important’
  • 7.
  • 8.
    M.Alroy Mascrenghe 8 DNA& Genes n  DNA is where the genetic information is stored n  Blonde hair and blue eyes are inherited by this n  Gene - The basic unit of heredity l  There are genes for characteristics i.e. a gene for blond hair etc n  Genes contain the information as a sequence of nucleotides n  Genes are abstract concepts – like longitude and latitudes in the sense that you cannot see them separately n  Genes are made up of nucleotides
  • 9.
  • 10.
    M.Alroy Mascrenghe 10 Nucleotide(nt) n  Each nt I made up of l  Sugar l  Phospate group l  Base n  The base it (nt) contains makes the only difference between one nt and the other n  There are 4 different bases n  G(uanine),A(denine),T(hymine),C(ytosine) n  The information is in the order of nucleotide and the order is the info n  Genes can be many thousands of nt long n  The complete set of genetic instructions is called genomes
  • 11.
    M.Alroy Mascrenghe 11 Chromosomes n DNA strings make chromosomes n  Analogy l  Letters - nt l  Sentences – genes l  Individual volumes of Britannica encyclopedia – chromosomes l  All voles together - Genome
  • 12.
    M.Alroy Mascrenghe 12 DoubleHelix n  The DNA is a double helix n  Each strand has complementary information n  Each particular base in one strand is bonded with another particular base in the next strand l  G - C l  A - T n  For example - l  AATGC one strand l  TTACG other strand
  • 13.
    M.Alroy Mascrenghe 13 Proteins n Proteins are very important biological feature n  Amino Acids make up the proteins n  20 different amino acids are there n  The function of a protein is dependant on the order of the amino acids
  • 14.
    M.Alroy Mascrenghe 14 Proteins… n The information required to make aa is stored in DNA n  DNA sequence determines amino acid sequence n  Amino Acid sequence determines protein structure n  Protein structure determines protein function n  A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins n  Storage - DNA n  Information Transfer – RNA n  RNA is the message boy!
  • 15.
    M.Alroy Mascrenghe 15 Centraldogma DNA transcription RNA Translation Protein RNA Polymerase Ribosomes
  • 16.
  • 17.
    M.Alroy Mascrenghe 17 Proteins….. n Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos n  So in triplet codes – codon – protein information is carried n  The codons that do not correspond to a protein are stop codons – UAA, UAG, UGA (RNA has U instead of T) n  Some codons are used as start codons - AUG as well as to code methionine
  • 18.
    M.Alroy Mascrenghe 18 ProteinStructure n  Shows a wide variety as opposed to the DNA whose structure is uniform n  X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure n  Structure is related to the function or rather structure determines the function n  Although proteins are created as a linear structure of aa chain they fold into 3 d structure. n  If you stretch them and leave them they will go back to this structure – this is the native structure of a protein n  Only in the native structure the proteins functions well n  Even after the translation is over protein goes through some changes to its structure
  • 19.
    M.Alroy Mascrenghe 19 GeneExpression n  Gene Expression – the process of Transcripting a DNA and translating a RNA to make protein n  Where do the genes begin in a chromosome? n  How does the RNA identify the beginning of a gene to make a protein n  A single nt cannot be taken to point out the beginning of a gene as they occur frequently n  But a particular combination of a nucleotide can be n  Promoter sequences – the order of nt which mark the beginning of a gene
  • 20.
  • 21.
    M.Alroy Mascrenghe 21 Predictionand Pattern Recognition n  The two main areas of bioinformatics are n  Pattern recognition l  ‘A particular sequence or structure has been seen before’ and that a particular characteristic can be associated with it n  Prediction l  From a sequence (what we know) we can predict the structure and function (what we don’t know)
  • 22.
    M.Alroy Mascrenghe 22 Dotplots…. n  Simple way of evaluating similarity between two sequences n  In a graph one sequence is on one side the next on the other side n  Where there are matches between the two sequences the graph is marked
  • 23.
  • 24.
    M.Alroy Mascrenghe 24 Alignments n A match for similarity between the characters of two or more sequences n  Eg. l  TTACTATA l  TAGATA n  There are so many ways to align the above two sequences l  1. n  TTACTATA n  TAGATA l  2. n  TTACTATA n  TAGATA l  3. n  TTACTATA n  TAGATA n  So which one do we choose and on what basis? n  Solution is to Provide a match score and mismatch score
  • 25.
    M.Alroy Mascrenghe 25 Gaps n Introduce gaps and a penalty score for gaps n  TTACTATA n  T_A_GATA n  In gap scores a single indel which is two characters long is preferred to two indels which are each one character long n  However not all gaps are bad l  TTGCAATCT l  CAA l  How do we align? l  ---CAA--- l  These gaps are not biologically significant l  Semi Global Alignments
  • 26.
    M.Alroy Mascrenghe 26 ScoringMatrix n  For DNA/protein sequence alignment we create a matrix n  If A and A score is 1 n  If A and T score is -5 n  If A and C score is -1
  • 27.
    M.Alroy Mascrenghe 27 DynamicProgramming n  As the length of the query sequences increase and the difference of length between the two sequence also increases –more gaps has to be inserted in various places n  We cannot perform an exhaustive search n  Combinatorial explosion occurs – too much combinations to search for n  Dynamic programming is a way of using heuristics to search in the most promising path
  • 28.
    M.Alroy Mascrenghe 28 Databases n Sequence info is stored in databases n  So that they can be manipulated easily n  The db (next slide) are located at diff places n  They exchange info on a daily basis so that they are up-to-date and are in sync n  Primary db – sequence data
  • 29.
    Major Primary DB NucleicAcid Protein EMBL (Europe) PIR - Protein Information Resource GenBank (USA) MIPS DDBJ (Japan) SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISS- PROT NRL-3D
  • 30.
    M.Alroy Mascrenghe 30 CompositeDB n  As there are many db which one to search? Some are good in some aspects and weak in others? n  Composite db is the answer – which has several db for its base data n  Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db
  • 31.
    M.Alroy Mascrenghe 31 CompositeDB n  OWL has these as their primary db l  SWISS PROT (top priority) l  PIR l  GenBank l  NRL-3D
  • 32.
    M.Alroy Mascrenghe 32 Secondarydb n  Store secondary structure info or results of searches of the primary db Compo DB Primary Source PROSITE SWISS-PROT PRINTS OWL
  • 33.
    M.Alroy Mascrenghe 33 DatabaseSearches n  We have sequenced and identified genes. So we know what they do n  The sequences are stored in databases n  So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. n  Since there are large number of databases we cannot do sequence alignment for each and every sequence n  So heuristics must be used again.
  • 34.
    M.Alroy Mascrenghe 34 Areasin Bioinformatics…
  • 35.
    M.Alroy Mascrenghe 35 Genomics n Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic n  i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates
  • 36.
    M.Alroy Mascrenghe 36 Genomics- Finding Genes n  Gene in sequence data – needle in a haystack n  However as the needle is different from the haystack genes are not diff from the rest of the sequence data n  Is whole array of nt we try to find and border mark a set o nt as a gene n  This is one of the challenges of bioinformatics n  Neural networks and dynamic programming are being employed
  • 37.
    Organism Genome Size (Mb) bp *1,000,000 Gene Number Web Site Yeast 13.5 6,241 http://genome- www.stanford.ed u/ Saccharomyces Fruit Flies 180 13,601 http:// flybase.bio.india na.edu Homo Sapiens 3,000 45,000 http:// www.ncbi.nlm.ni h.gov/genome/
  • 38.
    M.Alroy Mascrenghe 38 Proteomics n Proteome is the sum total of an organisms proteins n  More difficult than genomics l  4 20 l  Simple chemical makeup complex l  Can duplicate can’t n  We are entering into the ‘post genome era’ n  Meaning much has been done with the Genes – not that it’s a over
  • 39.
    M.Alroy Mascrenghe 39 Proteomics….. n The relationship between the RNA and the protein it codes are usually very different n  After translation proteins do change l  So aa sequence do not tell anything about the post translation changes n  Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell n  So aa only hint in these things n  Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material
  • 40.
    M.Alroy Mascrenghe 40 ProteinStructure Prediction n  Is one of the biggest challenges of bioinformatics and esp. biochemistry n  No algorithm is there now to consistently predict the structure of proteins
  • 41.
    M.Alroy Mascrenghe 41 StructurePrediction methods n  Comparative Modeling l  Target proteins structure is compared with related proteins l  Proteins with similar sequences are searched for structures
  • 42.
    M.Alroy Mascrenghe 42 Phylogenetics n The taxonomical system reflects evolutionary relationships n  Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/ graph n  Rooted trees where there is only one ancestor n  Un rooted trees just showing the relationship n  Phylogenetic tree reconstruction algorithms are also an area of research
  • 43.
  • 44.
    M.Alroy Mascrenghe 44 MedicalImplications n  Pharmacogenomics l  Not all drugs work on all patients, some good drugs cause death in some patients l  So by doing a gene analysis before the treatment the offensive drugs can be avoided l  Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! l  Customized treatment n  Gene Therapy l  Replace or supply the defective or missing gene l  E.g: Insulin and Factor VIII or Haemophilia n  BioWeapons (??)
  • 45.
    M.Alroy Mascrenghe 45 Diagnosisof Disease n  Diagnosis of disease l  Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease - n  Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment n  Death in 10-15 years n  The gene responsible for the disease has been identified n  Contains excessively repeated sections of CAG n  So once analyzed the couple can be counseled
  • 46.
    M.Alroy Mascrenghe 46 DrugDesign n  Can go up to 15yrs and $700million n  One of the goals of bioinformatics is to reduce the time and cost involved with it. n  The process l  Discovery n  Computational methods can improves this l  Testing
  • 47.
    M.Alroy Mascrenghe 47 Discovery Targetidentification l  Identifying the molecule on which the germs relies for its survival l  Then we develop another molecule i.e. drug which will bind to the target l  So the germ will not be able to interact with the target. l  Proteins are the most common targets
  • 48.
    M.Alroy Mascrenghe 48 Discovery… n For example HIV produces HIV protease which is a protein and which in turn eat other proteins n  This HIV protease has an active site where it binds to other molecules n  So HIV drug will go and bind with that active site l  Easily said than done!
  • 49.
    M.Alroy Mascrenghe 49 Discovery… n Lead compounds are the molecules that go and bind to the target protein’s active site n  Traditionally this has been a trial and error method n  Now this is being moved into the realm of computers
  • 50.
    M.Alroy Mascrenghe 50 RelatedComputer Technology………….
  • 51.
    M.Alroy Mascrenghe 51 PERL n Perl is commonly used for bioinformatics calculations as its ability to manipulate character symbols n  The default CGI language n  It started out as a scripting language but has become a fully fledged language n  IT has everything now, even web service support n  http://bio.perl.org
  • 52.
    M.Alroy Mascrenghe 52 Theplace of XML & Web Services n  Various markup languages are being created – Gene Markup language etc to represent sequence/ gene data n  Web Services – program to program interaction, making the web application centric as opposed to human centric n  So this has to platform language independent n  Protocols like SOAP help in this regard n  In bioinformatics various databases are being used, different platforms, languages etc n  So web services helps achieve platform independence and program interaction n  Since sequence data bases are in various formats, platforms SOAP also helps in this regards
  • 53.
    M.Alroy Mascrenghe 53 Theplace of GRID n  GRID - new kid on the block n  Using many computers to fulfill a single computational tasks n  Bioinformatics is the ideal platform as it has to deal with a large amount of data in alignment and searches n  E-science initiative in the UK n  ORACLE 10g – the worlds first GRID database
  • 54.
    M.Alroy Mascrenghe 54 Databases and Mining n  Lot of the sequence databases are available publicly n  As there is a DB involved various data mining techniques are used to pull the data out n  As there is a lot of literature – articles etc – on this area a data mining on the literature – not on the sequence data has also become a PhD topic for many
  • 55.
    M.Alroy Mascrenghe 55 EuropeanMolecular Biology Network (EMBnet) n  A central system for sharing, training and centralizing up to date bio info n  Some of the EMBnet sites are: n  SQENET l  http://www.seqnet.dl.ac.uk n  UCL l  http://www.biochem.ucl.ac.uk/bsm/ dbbrowser/embnet/ n  EBI – European Bioinformatics Institute l  www.ebi.ac.uk
  • 56.
    M.Alroy Mascrenghe 56 References n Dan E. Krane and Michael L. Raymer l  Basic Concepts of Bioinformatics n  Arthur M Lesk l  Intro to Bioinformatics n  T.K. Attwood & D. J. Parry-Smith l  Intro to Bioinformatics n  The genetic Revolution l  Dr Patrick Dixon n  Prof David Gilbert’s Site l  http://www.brc.dcs.gla.ac.uk/~drg/
  • 57.