This document discusses the sequencing of the human genome and the role of bioinformatics. It notes that in 2000, the human genome was sequenced through a joint British and American effort, marking a major event that changed human history. The document then discusses how bioinformatics uses computational techniques to analyze and manage biological data, allowing things like comparing genetic material of viruses to design medicines. Overall, the document provides a high-level overview of the sequencing of the human genome and introduction to the field of bioinformatics.
Introduction of the speaker, M.Alroy Mascrenghe, discussing the presentation.
In 2000, a pivotal event occurred as the Human Genome was sequenced through a UK and US collaborative race.
Geneticists compare viral genetic material to develop medicines through protein identification from genetic sequences.
Integration of computer science and molecular biology for biological data management and analysis.
Covers DNA, nucleotides, and chromosomes, explaining genetic information storage and heredity.
Details on proteins, their structure, formation from amino acids, and the central dogma of molecular biology involving DNA to RNA to protein.
Techniques in bioinformatics including pattern recognition, sequence alignment, and scoring methods.
Discussion on various primary and composite databases for storing genetic sequences and the importance of organizing bioinformatics data.
Insights into genomics, gene variations across cell types, and complexities involving the proteome.
The challenges associated with predicting protein structures using various methods including comparative modeling.
Exploration of medical implications, disease diagnosis through genetics, and drug design improvements.
Process of target identification in drug discovery and how bioinformatics is improving traditional methods.
The use of programming languages like PERL, XML for data representation, GRID systems for computation, and network resources for sharing bioinformatics information.Summarization, references used in the presentation, and a thank you note from the speaker.
M.Alroy Mascrenghe 2
2000
n A Major event happened that was to
change the course of human history
n It was a joint British and American
effort
n nothing to do with IRAQ!
n It was a race – who will complete first
n Race Test – not whether they have
taken drugs but whether they can
produce them!
n Human genome was sequenced
3.
M.Alroy Mascrenghe 3
ASitu…somewhere in the
near future
n A virus –not ‘I love you’ virus- creates an epidemic
n Geneticists and bioinformaticians role on their
sleeves
n Genetic material of the virus is compared with the
existing base of known genetic material of other
viruses
n As the characteristics of the other viruses are
known
n From genetic material computer programs will
derive the proteins necessary for the survival of the
virus
n When the protein (sequence and structure) is
known then medicines can be designed
4.
M.Alroy Mascrenghe 4
Whatis
n The marriage between computer
science and molecular biology
l The algorithm and techniques of
computer science are being used to
solve the problems faced by molecular
biologists
n ‘Information technology applied to
the management and analysis of
biological data’
l Storage and Analysis are two of the
important functions – bioinformaticians
build tools for each
M.Alroy Mascrenghe 6
Whatis..
n This is the age of the Information
Technology
n However storing info is nothing new
n Information to the volume of
Britannica Encyclopedia is stored in
each of our cells
n ‘Bioinformatics tries to determine
what info is biologically important’
M.Alroy Mascrenghe 8
DNA& Genes
n DNA is where the genetic information is
stored
n Blonde hair and blue eyes are inherited by
this
n Gene - The basic unit of heredity
l There are genes for characteristics i.e. a gene
for blond hair etc
n Genes contain the information as a
sequence of nucleotides
n Genes are abstract concepts – like
longitude and latitudes in the sense that
you cannot see them separately
n Genes are made up of nucleotides
M.Alroy Mascrenghe 10
Nucleotide(nt)
n Each nt I made up of
l Sugar
l Phospate group
l Base
n The base it (nt) contains makes the only
difference between one nt and the other
n There are 4 different bases
n G(uanine),A(denine),T(hymine),C(ytosine)
n The information is in the order of nucleotide
and the order is the info
n Genes can be many thousands of nt long
n The complete set of genetic instructions is
called genomes
11.
M.Alroy Mascrenghe 11
Chromosomes
n DNA strings make
chromosomes
n Analogy
l Letters - nt
l Sentences – genes
l Individual volumes of Britannica
encyclopedia – chromosomes
l All voles together - Genome
12.
M.Alroy Mascrenghe 12
DoubleHelix
n The DNA is a double helix
n Each strand has complementary
information
n Each particular base in one strand is
bonded with another particular base in the
next strand
l G - C
l A - T
n For example -
l AATGC one strand
l TTACG other strand
13.
M.Alroy Mascrenghe 13
Proteins
n Proteins are very important
biological feature
n Amino Acids make up the proteins
n 20 different amino acids are there
n The function of a protein is
dependant on the order of the amino
acids
14.
M.Alroy Mascrenghe 14
Proteins…
n The information required to make aa is
stored in DNA
n DNA sequence determines amino acid
sequence
n Amino Acid sequence determines protein
structure
n Protein structure determines protein
function
n A Substance called RNA is used to carry
the Info stored in the DNA that in turn is
used to make proteins
n Storage - DNA
n Information Transfer – RNA
n RNA is the message boy!
M.Alroy Mascrenghe 17
Proteins…..
n Since there are 20 amino acids to
translate one nt cannot correspond
to one aa, neither can it correspond
as twos
n So in triplet codes – codon – protein
information is carried
n The codons that do not correspond
to a protein are stop codons – UAA,
UAG, UGA (RNA has U instead of T)
n Some codons are used as start
codons - AUG as well as to code
methionine
18.
M.Alroy Mascrenghe 18
ProteinStructure
n Shows a wide variety as opposed to the
DNA whose structure is uniform
n X-ray crystallography or Nuclear Magnetic
Resonance (NMR) is used to figure out the
structure
n Structure is related to the function or rather
structure determines the function
n Although proteins are created as a linear
structure of aa chain they fold into 3 d
structure.
n If you stretch them and leave them they will
go back to this structure – this is the native
structure of a protein
n Only in the native structure the proteins
functions well
n Even after the translation is over protein
goes through some changes to its structure
19.
M.Alroy Mascrenghe 19
GeneExpression
n Gene Expression – the process of
Transcripting a DNA and translating a RNA
to make protein
n Where do the genes begin in a
chromosome?
n How does the RNA identify the beginning
of a gene to make a protein
n A single nt cannot be taken to point out the
beginning of a gene as they occur
frequently
n But a particular combination of a nucleotide
can be
n Promoter sequences – the order of nt
which mark the beginning of a gene
M.Alroy Mascrenghe 21
Predictionand Pattern
Recognition
n The two main areas of bioinformatics
are
n Pattern recognition
l ‘A particular sequence or structure has
been seen before’ and that a particular
characteristic can be associated with it
n Prediction
l From a sequence (what we know) we
can predict the structure and function
(what we don’t know)
22.
M.Alroy Mascrenghe 22
Dotplots….
n Simple way of evaluating
similarity between two
sequences
n In a graph one sequence is on
one side the next on the other
side
n Where there are matches
between the two sequences the
graph is marked
M.Alroy Mascrenghe 24
Alignments
n A match for similarity between the characters of two or
more sequences
n Eg.
l TTACTATA
l TAGATA
n There are so many ways to align the above two
sequences
l 1.
n TTACTATA
n TAGATA
l 2.
n TTACTATA
n TAGATA
l 3.
n TTACTATA
n TAGATA
n So which one do we choose and on what basis?
n Solution is to Provide a match score and mismatch score
25.
M.Alroy Mascrenghe 25
Gaps
n Introduce gaps and a penalty
score for gaps
n TTACTATA
n T_A_GATA
n In gap scores a single indel which is two characters long is preferred to two indels which are each one
character long
n However not all gaps are bad
l TTGCAATCT
l CAA
l How do we align?
l ---CAA---
l These gaps are not biologically significant
l Semi Global Alignments
26.
M.Alroy Mascrenghe 26
ScoringMatrix
n For DNA/protein sequence alignment we create a matrix
n If A and A score is 1
n If A and T score is -5
n If A and C score is -1
27.
M.Alroy Mascrenghe 27
DynamicProgramming
n As the length of the query sequences
increase and the difference of length
between the two sequence also increases
–more gaps has to be inserted in various
places
n We cannot perform an exhaustive search
n Combinatorial explosion occurs – too much
combinations to search for
n Dynamic programming is a way of using
heuristics to search in the most promising
path
28.
M.Alroy Mascrenghe 28
Databases
n Sequence info is stored in
databases
n So that they can be manipulated
easily
n The db (next slide) are located
at diff places
n They exchange info on a daily
basis so that they are up-to-date
and are in sync
n Primary db – sequence data
29.
Major Primary DB
NucleicAcid Protein
EMBL (Europe) PIR -
Protein Information
Resource
GenBank (USA) MIPS
DDBJ (Japan) SWISS-PROT
University of Geneva,
now with EBI
TrEMBL
A supplement to SWISS-
PROT
NRL-3D
30.
M.Alroy Mascrenghe 30
CompositeDB
n As there are many db which one to
search? Some are good in some
aspects and weak in others?
n Composite db is the answer – which
has several db for its base data
n Search on these db is indexed and
streamlined so that the same stored
sequence is not searched twice in
different db
M.Alroy Mascrenghe 32
Secondarydb
n Store secondary structure info or
results of searches of the
primary db
Compo
DB
Primary
Source
PROSITE SWISS-PROT
PRINTS OWL
33.
M.Alroy Mascrenghe 33
DatabaseSearches
n We have sequenced and identified
genes. So we know what they do
n The sequences are stored in
databases
n So if we find a new gene in the
human genome we compare it with
the already found genes which are
stored in the databases.
n Since there are large number of
databases we cannot do sequence
alignment for each and every
sequence
n So heuristics must be used again.
M.Alroy Mascrenghe 35
Genomics
n Because of the multicellular structure, each
cell type does gene expression in a
different way –although each cell has the
same content as far as the genetic
n i.e. All the information for a liver cell to be a
liver cell is also present on nose cell, so
gene expression is the only thing that
differentiates
36.
M.Alroy Mascrenghe 36
Genomics- Finding Genes
n Gene in sequence data – needle in a
haystack
n However as the needle is different
from the haystack genes are not diff
from the rest of the sequence data
n Is whole array of nt we try to find and
border mark a set o nt as a gene
n This is one of the challenges of
bioinformatics
n Neural networks and dynamic
programming are being employed
37.
Organism Genome
Size
(Mb)
bp *1,000,000
Gene
Number
Web Site
Yeast 13.5 6,241 http://genome-
www.stanford.ed
u/
Saccharomyces
Fruit Flies 180 13,601 http://
flybase.bio.india
na.edu
Homo
Sapiens
3,000 45,000 http://
www.ncbi.nlm.ni
h.gov/genome/
38.
M.Alroy Mascrenghe 38
Proteomics
n Proteome is the sum total of an
organisms proteins
n More difficult than genomics
l 4 20
l Simple chemical makeup complex
l Can duplicate can’t
n We are entering into the ‘post
genome era’
n Meaning much has been done with
the Genes – not that it’s a over
39.
M.Alroy Mascrenghe 39
Proteomics…..
n The relationship between the RNA and the protein it codes are
usually very different
n After translation proteins do change
l So aa sequence do not tell anything about the post
translation changes
n Proteins are not active until they are combined into a larger
complex or moved to a relevant location inside or outside the cell
n So aa only hint in these things
n Also proteins must be handled more carefully in labs as they tend
to change when in touch with an inappropriate material
40.
M.Alroy Mascrenghe 40
ProteinStructure Prediction
n Is one of the biggest challenges
of bioinformatics and esp.
biochemistry
n No algorithm is there now to
consistently predict the structure
of proteins
41.
M.Alroy Mascrenghe 41
StructurePrediction methods
n Comparative Modeling
l Target proteins structure is
compared with related proteins
l Proteins with similar sequences
are searched for structures
42.
M.Alroy Mascrenghe 42
Phylogenetics
n The taxonomical system reflects
evolutionary relationships
n Phylogenetics trees are things which reflect
the evolutionary relationship thru a picture/
graph
n Rooted trees where there is only one
ancestor
n Un rooted trees just showing the
relationship
n Phylogenetic tree reconstruction algorithms
are also an area of research
M.Alroy Mascrenghe 44
MedicalImplications
n Pharmacogenomics
l Not all drugs work on all patients, some good
drugs cause death in some patients
l So by doing a gene analysis before the
treatment the offensive drugs can be avoided
l Also drugs which cause death to most can be
used on a minority to whose genes that drug is
well suited – volunteers wanted!
l Customized treatment
n Gene Therapy
l Replace or supply the defective or missing gene
l E.g: Insulin and Factor VIII or Haemophilia
n BioWeapons (??)
45.
M.Alroy Mascrenghe 45
Diagnosisof Disease
n Diagnosis of disease
l Identification of genes which cause the
disease will help detect disease at early
stage e.g. Huntington disease -
n Symptoms – uncontrollable dance like
movements, mental disturbance,
personality changes and intellectual
impairment
n Death in 10-15 years
n The gene responsible for the disease has
been identified
n Contains excessively repeated sections of
CAG
n So once analyzed the couple can be
counseled
46.
M.Alroy Mascrenghe 46
DrugDesign
n Can go up to 15yrs and
$700million
n One of the goals of
bioinformatics is to reduce the
time and cost involved with it.
n The process
l Discovery
n Computational methods can
improves this
l Testing
47.
M.Alroy Mascrenghe 47
Discovery
Targetidentification
l Identifying the molecule on which the
germs relies for its survival
l Then we develop another molecule
i.e. drug which will bind to the target
l So the germ will not be able to interact
with the target.
l Proteins are the most common targets
48.
M.Alroy Mascrenghe 48
Discovery…
n For example HIV produces HIV
protease which is a protein and
which in turn eat other proteins
n This HIV protease has an active
site where it binds to other
molecules
n So HIV drug will go and bind
with that active site
l Easily said than done!
49.
M.Alroy Mascrenghe 49
Discovery…
n Lead compounds are the
molecules that go and bind to
the target protein’s active site
n Traditionally this has been a trial
and error method
n Now this is being moved into the
realm of computers
M.Alroy Mascrenghe 51
PERL
n Perl is commonly used for
bioinformatics calculations as its
ability to manipulate character
symbols
n The default CGI language
n It started out as a scripting language
but has become a fully fledged
language
n IT has everything now, even web
service support
n http://bio.perl.org
52.
M.Alroy Mascrenghe 52
Theplace of XML & Web
Services
n Various markup languages are being created –
Gene Markup language etc to represent sequence/
gene data
n Web Services – program to program interaction,
making the web application centric as opposed to
human centric
n So this has to platform language independent
n Protocols like SOAP help in this regard
n In bioinformatics various databases are being used,
different platforms, languages etc
n So web services helps achieve platform
independence and program interaction
n Since sequence data bases are in various formats,
platforms SOAP also helps in this regards
53.
M.Alroy Mascrenghe 53
Theplace of GRID
n GRID - new kid on the block
n Using many computers to fulfill a
single computational tasks
n Bioinformatics is the ideal
platform as it has to deal with a
large amount of data in
alignment and searches
n E-science initiative in the UK
n ORACLE 10g – the worlds first
GRID database
54.
M.Alroy Mascrenghe 54
Databases and Mining
n Lot of the sequence databases are
available publicly
n As there is a DB involved various
data mining techniques are used to
pull the data out
n As there is a lot of literature – articles
etc – on this area a data mining on
the literature – not on the sequence
data has also become a PhD topic
for many
55.
M.Alroy Mascrenghe 55
EuropeanMolecular Biology
Network (EMBnet)
n A central system for sharing, training
and centralizing up to date bio info
n Some of the EMBnet sites are:
n SQENET
l http://www.seqnet.dl.ac.uk
n UCL
l http://www.biochem.ucl.ac.uk/bsm/
dbbrowser/embnet/
n EBI – European Bioinformatics
Institute
l www.ebi.ac.uk
56.
M.Alroy Mascrenghe 56
References
n Dan E. Krane and Michael L. Raymer
l Basic Concepts of Bioinformatics
n Arthur M Lesk
l Intro to Bioinformatics
n T.K. Attwood & D. J. Parry-Smith
l Intro to Bioinformatics
n The genetic Revolution
l Dr Patrick Dixon
n Prof David Gilbert’s Site
l http://www.brc.dcs.gla.ac.uk/~drg/