Introduction to Genomics,
Proteomics
2
Genomics and Proteomics
• The field of genomics deals with the DNA sequence,
organization, function, and evolution of genomes
• Proteomics aims to identify all the proteins in a cell or
organism including any posttranslationally modified
forms, as well as their cellular localization, functions,
and interactions
• Genomics was made possible by the invention of
techniques of recombinant DNA, also known as gene
cloning or genetic engineering
3
Genetic Engineering
• In genetic engineering, the immediate goal of an
experiment is to insert a particular fragment of
chromosomal DNA into a plasmid or a viral DNA
molecule
• This is accomplished by breaking DNA molecules at
specific sites and isolating particular DNA fragments
• DNA fragments are usually obtained by the treatment of
DNA samples with restriction enzymes
• Cloning from mRNA molecules depends on an unusual
polymerase, reverse transcriptase, which can use a
single-stranded RNA molecule as a template and
synthesize a complementary DNA (cDNA)
4
cDNA Cloning
• The resulting full-length cDNA contains an
uninterrupted by introns coding sequence for the
protein of interest
• If DNA sequence is known at both ends of the
cDNA for design of appropriate primers,
amplification of the cDNA produced by reverse
transcriptase is possible by reverse transcriptase
PCR (RT-PCR)
5
Bioinformatics
• Rapid automated DNA sequencing was instrumental in the success of
the Human Genome Project, an international effort begun in 1990 to
sequence the human genome and that of a number of organisms
• However, a genomic sequence is like a book using an alphabet of
only four letters, without spaces or punctuation. Identifying genes and
their functions is a major challenge
• The annotation of genomic sequences at this level is one aspect of
bioinformatics, defined broadly as the use of computers in the
interpretation and management of biological data
THE “POST-GENOMICS” ERA
6
Goal:
to understand the living cell
Annotation Comparative
genomics
Structural
genomics
Functional
genomics
What’s Next ?
7
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
Annotation
8
Annotation
Identify the genes within a
given sequence of DNA
Identify the sites
Which regulate the gene
Predict the function
9
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT
.................................
.............. TGAAAAACGTA
TF binding site
promoter
Ribosome binding Site
ORF=Open Reading Frame
CDS=Coding Sequence
Transcription
Start
Site
10
Comparative
genomics
Human ATAGCGGGGGGATGCGGGCCCTATACCC
Chimp ATAGGGG - - GGATGCGGGCCCTATACCC
Mouse ATAGCG - - - GGATGCGGCGC -TATACCA
11
Structural
genomics
12
Functional Genomics
• Genomic sequencing has made possible a new approach to genetics called functional
genomics, which focuses on genome-wide patterns of gene expression and the
mechanisms by which gene expression is coordinated
• DNA microarray (or chip) - a flat surface about the size of a postage stamp with up to
100,000 distinct spots, each containing a different immobilized DNA sequence suitable
for hybridization with DNA or RNA isolated from cells growing under different conditions
• DNA microarrays are used to estimate the relative level of gene expression of each gene
in the genome
13
14
Assigning the structures of all proteins
Protein-ligand complexes
Functional sites
fold Evolutionary
relationship
Shape and electrostatics
Active sites
protein complexes
Biologic processes
Origin of “Genomics”: 1987
“For the newly developing discipline of [genome] mapping/sequencing
(including the analysis of the information), we have adopted the term
GENOMICS… The new discipline is born from a marriage of molecular and cell
biology with classical genetics and is fostered by computational science.”
- McKusick and Ruddle, A new discipline, a new name, a new journal,
Genomics, Vol. 1, No. 1. (September 1987), pp. 1-2
What is genomics?
“Genomics is a discipline in genetics that applies
recombinant DNA, DNA sequencing methods, and
bioinformatics to sequence, assemble, and analyze the
function and structure of genomes (the complete set of
DNA within a single cell of an organism).”
-
What is genomics?
“Research of single genes does not fall into the definition
of genomics unless the aim of this genetic, pathway, and
functional information analysis is to elucidate its effect on,
place in, and response to the entire genome's networks.”
-
Central Dogma of Biology
http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
What can genomics tell us?
DNA Sequence
Gene Sequence
Protein/Gene Function
Protein Sequence
Regulatory Sequence
Gene Expression
DNA Variation
Human disease
How do we study genomics?
1. Isolate nucleic acid molecules from biological samples
2. Determine nucleotide sequence using biochemical
techniques
3. Digitize nucleotide sequence
4. Examine digital sequences to identify patterns with
algorithms and statistics
5. Relate patterns to biological observations by:
6. Comparing patterns detected across many samples
7. Manipulating a system to see how patterns change
Sequencing Techniques
Sanger sequencing – fluorescent-labeled DNA fragments
Sequencing by synthesis, NGS
Adapted from http://web.uri.edu/gsc/next-generation-sequencing/
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
2- Next Generation
Sequencing (NGS)
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
NUS-KI Course on Bioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
Sequence Analysis
● Assembly – putting short sequences together to
reconstruct a longer, source sequence
● Mapping – locating where one short sequence is found
in a longer sequence
● Pattern recognition – looking for specific patterns
within sequences that have special meaning
In each of these cases, sequences are aligned to one
another
Sequence Alignment
● Provides a measure of relatedness
● Alignment quantified by similarity (% identity)
● Useful for any sequential data type:
○ DNA/RNA
○ Amino acids
○ Protein secondary structure
● High sequence similarity might imply:
○ Common evolutionary history
○ Similar biological function
What Alignments Can Tell Us
● Homology - Orthologs, Paralogs
● Genomic identity/origin of a
sequence/individual
● Genome/gene structure
○ Genic structure (exons, introns, etc)
○ RNA 2D structure
○ Chromosome rearrangements/3D structure
DNA Sequence Alignment Example
Sequence 1
Sequence 2
ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATACCATAAGCGAG
Alignment 1 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
--------------ATACCA-TAAGCGAG----
Alignment 2 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA--------------TAAGCGAG----
Alignment 3 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA-TA--AG---C--G--AG--------
Match
Gap
Mismatch
Scoring/Substitution Matrices
● Given alignment, how “good” is it?
● Higher score = better alignment
● Implicitly represent evolutionary patterns
A C G T -
A 2 -3 -1 -3 -3
C -3 2 -3 -1 -3
G -1 -3 2 -3 -3
T -3 -1 -3 2 -3
- -3 -3 -3 -3 NA
ATACCAGTAAGGGAG
ATACCA-TAAGAGAG
Score = 22
ATACCAGTAAGG-GAG
ATACCA-TAAG-AGAG
Score = 19
ATACCA-GTAAGGGAG
A-TACCATAAGAGAG-
Score = -20
Sequence Alignment Algorithms
● Global alignments - beginning and end of
both sequences must align
● Local alignments - one sequence may align
anywhere within the other
● Multiplicity:
○ Pairwise alignments (2 sequences)
○ Multiple sequence alignment (3+ sequences)
Global Alignment
Both sequences are aligned from end to end
AAANTAIYYDPNPDMP A--
NTAI-YDPN--M-
AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
A-----N------T-----------AI-YD--P------------N----M
Interior sequences are aligned as well as possible
However, sequences of vastly different length can produce
meaningless alignments
Local Alignment
Alignment may begin and end at any position
AAANTAIYYDPNPDMP -
AANTAI-YDPN--M-
AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
---------------------AANTAI-YDPN--M----------------
Local alignment may produce better alignments when
sequence lengths differ greatly
LOCAL ALIGNMENT
SMITH-WATERMAN
•BEST SCORE FOR ALIGNING PART OF SEQUENCES
• OFTEN BEATS GLOBAL ALIGNMENT SCORE
45
ATTGCAGTG-TCGAGCGTCAGGCT
ATTGCGTCGATCGCAC-GCACGCT
Global Alignment
Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT
GLOBAL VS. LOCAL ALIGNMENT
46
DOROTHY
DOROTHY
HODGKIN
HODGKIN
Global alignment:
DOROTHY--------HODGKIN
DOROTHYCROWFOOTHODGKIN
Local alignment:
Like pairwise alignment, but with N sequences
Sequence consensus among many species suggests
evolutionary pressure
Multiple Sequence Alignment
Alignment Examples
Example: Genome Assembly
We need multiple copies of each book
(genome) to arrive at a consensus text
(DNA sequence) of the original
If your genome was a book that had
its sentences chopped into
fragments, assembly is analogous to
reconstructing all the sentences.
Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/
Example: Genome Assembly
An error?
A polymorphism?
A different allele?
Incorrect alignment?
Greedy approach: take most frequent nucleotide at each aligned position
Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/
Example: Exon Microarray Probes
● Microarray probes are short single-stranded DNA
sequences from a reference genome
● Exon Microarrays have probes only from exons
● Exon probes must map to the correct exon, BUT
● Probes must NOT map anywhere else, they must be
unique in the genome
Example: mRNA-Seq Analysis
Start with a pool of
mRNA molecules
Millions of DNA
sequences 30-150
nucleotides long
Count the number
of sequences that
map to individual
regions (e.g. genes)
Find all locations
where sequences
map in genome
Example: DNA Binding Site Discovery
Identify genomic regions where a particular
TF is bound across the entire genome
By extracting and aligning the DNA
sequence corresponding to these binding
events, we can identify which DNA
sequences this TF tends to bind
Human Genomics and
Gene Expression
The Human Genome Project
● Planning begins 1984, launched 1990,
“completed” 2001, “finished” 2004
● Championed by Dr. Charles DeLisi
● Overview of the Human Genome Project:
http://www.genome.gov/12011238
Human Genome Composition
● Key findings:
○ ~20k genes
○ More segmental duplications than expected
○ Fewer than 7% of protein families vertebrate
specific
○ ~3% of sequence codes for protein coding genes
○ >85% of the genome is transcribed
○ Repetitive elements may comprise >66% of genome
How The Genome Was Determined
International Human Genome Sequencing Consortium
● Fragment DNA with restriction enzymes
● Ligate fragments into bacterial artificial
chromosomes (BACs)
● Amplify BACs with tagged DNA fragments
● Fragment isolated BAC vectors
● Sequence via Sanger-style sequencing to 4x coverage
● Finished draft genome in ~10 years
How The Genome Was Determined
Celera Technologies: shotgun sequencing
● Used public BACs contigs from the
Human Genome Project and theirown
● Much shorter DNA reads, assembled
later in silico using the HGP BAC clones
as a scaffold
● Finished draft genome in ~3 years
The Genome Is All About Genes
● Genic sequences
● What do our genes do?
● How are genes controlled?
● What genes are different between humans?
● How are genes associated with disease?
Gene Expression
Gene Expression
“Gene expression is the process by which
information from a gene is used in the
synthesis of a functional gene product.”
- Wikipedia
But What Is A Gene?
● A specific DNA sequence
● A fundamental unit of inheritance
● A molecule created by transcription of an
RNA product (then translated into a protein)
which has a function
● A “gene” is an abstract concept
But What Is A Gene?
● DNA?
● RNA?
● Protein?
● Informational molecule?
● Functional molecule?
Yes, all of them
What Is Gene Expression?
● Active mRNA transcription?
● mRNA abundance?
● mRNA translation?
● RNA function?
● Protein abundance?
● Protein function?
Yes, all of them
The Gene Expression Landscape
● mRNA - protein coding genes
● Functional non-coding RNA (ncRNA) biotypes:
○ microRNA (miRNA)/small interfering RNA (siRNA)
○ Long (intergenic) non-coding RNA (lncRNA/lincRNA)
○ Ribosomal RNA (rRNA)
○ Transfer RNA (tRNA)
○ Many more (30+)
● Antisense: transcript initiated from TSS
in opposite direction of primary gene
● Pseudogenes
How We Measure Gene Expression
● mRNA transcription/translation
○ Fluorescent tagging + microscopy
○ ribosomal capture
● mRNA abundance
○ Northern blots
○ Quantitative polymerase chain reaction (qPCR)
○ Microarrays
○ High-throughput sequencing
How We Measure Gene Expression
● Protein abundance
○ Western blots
○ Fluorescent tagging + microscopy
○ Mass spectrometry
○ Protein arrays
● mRNA/Protein localization
○ Fluorescent tagging + microscopy
mRNA Measurement Considerations
● Most mRNA quantification techniques
measure steady state abundance
● mRNA measurements are snapshots
○ Measure large populations of cells to quantify
“average” abundance
● Poor concordance between mRNA and
corresponding protein abundance
The Holy Grail of bioinformatics
...to be able to understand the words in a sequence sentence
that form a particular protein structure
In silico function prediction
…a reality check
• What is the function of this
structure?
• What is the function of this sequence?
• What is the function of this motif?
– the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions -
knowing the fold & function allows us to
rationalise how the structure effects its
function at the molecular level
How Is It Possible?
◼ The structure of a protein is uniquely determined by
its amino acid sequence
(but sequence is sometimes not enough):
◼ prions
◼ pH, ions, cofactors, chaperones
◼ Structure is conserved much longer than sequence in
evolution.
◼ Structure > Function >> Sequence
How Often Can We Do It?
◼ There are currently ~47000 structures in the PDB (but
only ~4000 if you include only ones that are not more
than 30% identical and have a resolution better than
3.0 Å).
◼ An estimated 25% of all sequences can be modeled
and structural information can be obtained for ~50%.
Protein Basics:
Proteins are macromolecules
Amino acids are the basic building blocks of proteins
Amino Acids are classified by properties: polar, nonpolar,
and charged (ionic)
Polypeptides are constructed by condensation reactions
with amino acids
Four Levels of
Protein Structure
75
Four Levels of Protein Structure
Different Levels of Protein Structure
Protein function depends on
specific conformation (shape)
There are four levels of protein
structure.
The primary structure is the
linear sequence of amino acids.
What determines this sequence?
Where in the cell are amino acids
joined this way?
Four Levels of Protein
Structure
◼ Primary Structure:
Linear Sequence of Amino Acids
C C
H2
N
H
R
Each amino acid has
central carbon liked to
---hydrogen (H)
---amino group (NH2)
---acid group (COOH)
---unique group (R)
O
OH
The carboxyl group of one amino acid is linked
to the amino group of the next amino acid.
Amino acids are linked together by covalent peptide bonds
(Fig. 4-1)
Proteins are made up of a polypeptide backbone with
attached side chains
(Fig. 4-2)
Schematic amino acid R groups
A Ala
C Cys
D Asp
E Glu
F Phe*
G Gly
H His*
I Ile*
K Lys*
L Leu*
M Met*
N Asn
P Pro
Q Gln
R Arg*
S Ser
T Thr*
V Val*
W Trp*
Y Tyr
 C
◼ N
 O
 S
The secondary structure of
protein depends on hydrogen
bonding between C=O and N-
H groups.
Alpha Helix, Beta sheets, Turn
and loop
Four Levels of Protein Structure
◼ Secondary Structure:
Polypeptide folding into α helix, β sheet, or
random coil (H bonds involved)
C
O
N
H
C
O
N
H
C
O
N
H
C
O
N
H
or
Secondary structure of proteins -  helix
H bond between the N-H of every peptide bond to the C=O of the next peptide bond of the
same chain. R groups are not involved.
(e.g. in protein -keratin - abundant in skin, hair, nails and horns)
[Fig. 4-10, p. 128]
(Pitch)
Secondary structure of proteins – β sheet
Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain
and C=O group of the other chain
(e.g. in the protein fibroin - abundant in silk) [Fig. 4-10, p. 128]
helices can wrap around one another by interactions between their
hydrophobic side chains to form a stable coiled-coil. [Fig. 4-16]
e.g.  keratin in the skin and myosin in muscles
Tertiary structure is determined by the interactions
between the side chains (R groups)
List these types of
interactions and
which ones are
weak or strong
Four Levels of Protein Structure
◼ Tertiary Structure:
Three dimensional folded structure due to
attractions and repulsions between R
groups
All but peptide bonds are
involved in tertiary structure.
Tertiary structure of proteins
• 3D conformation or shape
• Depends on the properties of the R groups of amino acid
residues
• Fold spontaneously or with the help of molecular
chaperones
• Stabilized by covalent and non-covalent bonds
Noncovalent bonds help protein folding (Fig. 4-4)
Also review Panel 2-7 (pp. 78,79) on noncovalent bonds
Covalent disulfide bonds between adjacent cysteine side chains
help stabilize a favored protein conformation [Fig. 4-29]
Quaternary structure is the overall protein structure
resulting from combinations of polypeptide subunits
Four Levels of Protein Structure
◼ Quaternary structure:
Association of two or more protein chains
eg. Hemoglobin is composed of
4 protein chains
2 are called alpha hemoglobin
2 are called beta hemoglobin
Quaternary structure of proteins:
hemoglobin, a protein in red blood cells, has
four sub units (two copies each of - and β-
globins containing a heme molecule [Fig. 4-23].
Bioinformatics how to …
use publicly available free tools to
predict protein structure
Learning Objectives
After this lesson you should be able to:
◼ Explain the individual steps involved in calculating a protein structure
prediction.
◼ Identify suitable templates for modelling.
◼ Outline the principles behind protein structure prediction methods.
◼ Describe the differences between homology modelling and ab initio
structure prediction.
◼ Describe the major pitfalls in protein modelling.
99
Protein Bioinformatics: Protein sequence
analysis
➢ Help to characterize protein sequences in silico and allows
prediction of protein structure and function
➢ Statistically significant BLAST hits usually signifies sequence
homology
➢ Homologous sequences may or may not have the same function
but would always (very few exceptions) have the same structural
fold
➢ Protein sequence analysis allows protein classification
100
Development of protein sequence databases
➢ Atlas of protein sequence and structure – Dayhoff (1966) first
sequence database (pre-bioinformatics). Currently known as
Protein Information Resource (PIR)
➢ Protein data bank (PDB) – structural database (1972) remains
most widely used database of structures
➢ UniProt – The United Protein Databases (UniProt, 2003) is a
central database of protein sequence and function created by
joining the forces of the SWISS-PROT, TrEMBL and PIR protein
database activities
The Protein Data Bank (PDB) is a repository for the 3-D
structural data of large biological molecules, such as
proteins and nucleic acids.
Obtained by X-ray crystallography or NMR spectroscopy.
Submitted by biologists and biochemists from around the
world.
102
Protein sequence analysis overview
➢ Protein databases
⚫ PIR and UniProt
➢ Searching databases
⚫ Peptide search, BLAST search, Text search
➢ Information retrieval and analysis
⚫ Protein records at UniProt and PIR
⚫ Multiple sequence alignment
⚫ Secondary structure prediction
⚫ Homology modeling
103
Universal Protein Knowledgebase
(UniProt)
PIR (Protein Information Resource) has recently joined forces with EBI (European
Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to establish the
UniProt
Literature-Based
Annotation
UniProt Archive
UniProt NREF
Swiss-
Prot
PIR-PSD
TrEMBL RefSeq GenBank/
EMBL/DDBJ
EnsEMBL PDB Patent
Data
Other
Data
UniProt Knowledgebase
Classification
Automated Annotation
Clustering at
100, 90, 50%
Literature-Based
Annotation
UniProt Archive
UniProt NREF
Swiss-
Prot
PIR-PSD
TrEMBL RefSeq GenBank/
EMBL/DDBJ
EnsEMBL PDB Patent
Data
Other
Data
UniProt Knowledgebase
Classification
Automated Annotation
Clustering at
100, 90, 50%
http://www.uniprot.org/
104
Peptide Search

proteome.pdf

  • 1.
  • 2.
    2 Genomics and Proteomics •The field of genomics deals with the DNA sequence, organization, function, and evolution of genomes • Proteomics aims to identify all the proteins in a cell or organism including any posttranslationally modified forms, as well as their cellular localization, functions, and interactions • Genomics was made possible by the invention of techniques of recombinant DNA, also known as gene cloning or genetic engineering
  • 3.
    3 Genetic Engineering • Ingenetic engineering, the immediate goal of an experiment is to insert a particular fragment of chromosomal DNA into a plasmid or a viral DNA molecule • This is accomplished by breaking DNA molecules at specific sites and isolating particular DNA fragments • DNA fragments are usually obtained by the treatment of DNA samples with restriction enzymes • Cloning from mRNA molecules depends on an unusual polymerase, reverse transcriptase, which can use a single-stranded RNA molecule as a template and synthesize a complementary DNA (cDNA)
  • 4.
    4 cDNA Cloning • Theresulting full-length cDNA contains an uninterrupted by introns coding sequence for the protein of interest • If DNA sequence is known at both ends of the cDNA for design of appropriate primers, amplification of the cDNA produced by reverse transcriptase is possible by reverse transcriptase PCR (RT-PCR)
  • 5.
    5 Bioinformatics • Rapid automatedDNA sequencing was instrumental in the success of the Human Genome Project, an international effort begun in 1990 to sequence the human genome and that of a number of organisms • However, a genomic sequence is like a book using an alphabet of only four letters, without spaces or punctuation. Identifying genes and their functions is a major challenge • The annotation of genomic sequences at this level is one aspect of bioinformatics, defined broadly as the use of computers in the interpretation and management of biological data
  • 6.
    THE “POST-GENOMICS” ERA 6 Goal: tounderstand the living cell Annotation Comparative genomics Structural genomics Functional genomics What’s Next ?
  • 7.
  • 8.
    8 Annotation Identify the geneswithin a given sequence of DNA Identify the sites Which regulate the gene Predict the function
  • 9.
    9 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGACAA TTG GTT TCT TCT CTG AAT ................................. .............. TGAAAAACGTA TF binding site promoter Ribosome binding Site ORF=Open Reading Frame CDS=Coding Sequence Transcription Start Site
  • 10.
    10 Comparative genomics Human ATAGCGGGGGGATGCGGGCCCTATACCC Chimp ATAGGGG- - GGATGCGGGCCCTATACCC Mouse ATAGCG - - - GGATGCGGCGC -TATACCA
  • 11.
  • 12.
    12 Functional Genomics • Genomicsequencing has made possible a new approach to genetics called functional genomics, which focuses on genome-wide patterns of gene expression and the mechanisms by which gene expression is coordinated • DNA microarray (or chip) - a flat surface about the size of a postage stamp with up to 100,000 distinct spots, each containing a different immobilized DNA sequence suitable for hybridization with DNA or RNA isolated from cells growing under different conditions • DNA microarrays are used to estimate the relative level of gene expression of each gene in the genome
  • 13.
  • 14.
    14 Assigning the structuresof all proteins Protein-ligand complexes Functional sites fold Evolutionary relationship Shape and electrostatics Active sites protein complexes Biologic processes
  • 15.
    Origin of “Genomics”:1987 “For the newly developing discipline of [genome] mapping/sequencing (including the analysis of the information), we have adopted the term GENOMICS… The new discipline is born from a marriage of molecular and cell biology with classical genetics and is fostered by computational science.” - McKusick and Ruddle, A new discipline, a new name, a new journal, Genomics, Vol. 1, No. 1. (September 1987), pp. 1-2
  • 16.
    What is genomics? “Genomicsis a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genomes (the complete set of DNA within a single cell of an organism).” -
  • 17.
    What is genomics? “Researchof single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.” -
  • 18.
    Central Dogma ofBiology http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
  • 19.
    What can genomicstell us? DNA Sequence Gene Sequence Protein/Gene Function Protein Sequence Regulatory Sequence Gene Expression DNA Variation Human disease
  • 20.
    How do westudy genomics? 1. Isolate nucleic acid molecules from biological samples 2. Determine nucleotide sequence using biochemical techniques 3. Digitize nucleotide sequence 4. Examine digital sequences to identify patterns with algorithms and statistics 5. Relate patterns to biological observations by: 6. Comparing patterns detected across many samples 7. Manipulating a system to see how patterns change
  • 21.
    Sequencing Techniques Sanger sequencing– fluorescent-labeled DNA fragments Sequencing by synthesis, NGS Adapted from http://web.uri.edu/gsc/next-generation-sequencing/
  • 22.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 23.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 24.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 25.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 26.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong 2- Next Generation Sequencing (NGS)
  • 27.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 28.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 29.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 30.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 31.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 32.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 33.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 34.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 35.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 36.
    NUS-KI Course onBioinformatics, Nov 2005 Copyright 2005 © Limsoon Wong
  • 37.
    Sequence Analysis ● Assembly– putting short sequences together to reconstruct a longer, source sequence ● Mapping – locating where one short sequence is found in a longer sequence ● Pattern recognition – looking for specific patterns within sequences that have special meaning In each of these cases, sequences are aligned to one another
  • 38.
    Sequence Alignment ● Providesa measure of relatedness ● Alignment quantified by similarity (% identity) ● Useful for any sequential data type: ○ DNA/RNA ○ Amino acids ○ Protein secondary structure ● High sequence similarity might imply: ○ Common evolutionary history ○ Similar biological function
  • 39.
    What Alignments CanTell Us ● Homology - Orthologs, Paralogs ● Genomic identity/origin of a sequence/individual ● Genome/gene structure ○ Genic structure (exons, introns, etc) ○ RNA 2D structure ○ Chromosome rearrangements/3D structure
  • 40.
    DNA Sequence AlignmentExample Sequence 1 Sequence 2 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG ATACCATAAGCGAG Alignment 1 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG --------------ATACCA-TAAGCGAG---- Alignment 2 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG ATAC-CA--------------TAAGCGAG---- Alignment 3 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG ATAC-CA-TA--AG---C--G--AG-------- Match Gap Mismatch
  • 41.
    Scoring/Substitution Matrices ● Givenalignment, how “good” is it? ● Higher score = better alignment ● Implicitly represent evolutionary patterns A C G T - A 2 -3 -1 -3 -3 C -3 2 -3 -1 -3 G -1 -3 2 -3 -3 T -3 -1 -3 2 -3 - -3 -3 -3 -3 NA ATACCAGTAAGGGAG ATACCA-TAAGAGAG Score = 22 ATACCAGTAAGG-GAG ATACCA-TAAG-AGAG Score = 19 ATACCA-GTAAGGGAG A-TACCATAAGAGAG- Score = -20
  • 42.
    Sequence Alignment Algorithms ●Global alignments - beginning and end of both sequences must align ● Local alignments - one sequence may align anywhere within the other ● Multiplicity: ○ Pairwise alignments (2 sequences) ○ Multiple sequence alignment (3+ sequences)
  • 43.
    Global Alignment Both sequencesare aligned from end to end AAANTAIYYDPNPDMP A-- NTAI-YDPN--M- AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM A-----N------T-----------AI-YD--P------------N----M Interior sequences are aligned as well as possible However, sequences of vastly different length can produce meaningless alignments
  • 44.
    Local Alignment Alignment maybegin and end at any position AAANTAIYYDPNPDMP - AANTAI-YDPN--M- AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM ---------------------AANTAI-YDPN--M---------------- Local alignment may produce better alignments when sequence lengths differ greatly
  • 45.
    LOCAL ALIGNMENT SMITH-WATERMAN •BEST SCOREFOR ALIGNING PART OF SEQUENCES • OFTEN BEATS GLOBAL ALIGNMENT SCORE 45 ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Global Alignment Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
  • 46.
    GLOBAL VS. LOCALALIGNMENT 46 DOROTHY DOROTHY HODGKIN HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
  • 47.
    Like pairwise alignment,but with N sequences Sequence consensus among many species suggests evolutionary pressure Multiple Sequence Alignment
  • 48.
  • 49.
    Example: Genome Assembly Weneed multiple copies of each book (genome) to arrive at a consensus text (DNA sequence) of the original If your genome was a book that had its sentences chopped into fragments, assembly is analogous to reconstructing all the sentences. Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/
  • 50.
    Example: Genome Assembly Anerror? A polymorphism? A different allele? Incorrect alignment? Greedy approach: take most frequent nucleotide at each aligned position Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/
  • 51.
    Example: Exon MicroarrayProbes ● Microarray probes are short single-stranded DNA sequences from a reference genome ● Exon Microarrays have probes only from exons ● Exon probes must map to the correct exon, BUT ● Probes must NOT map anywhere else, they must be unique in the genome
  • 52.
    Example: mRNA-Seq Analysis Startwith a pool of mRNA molecules Millions of DNA sequences 30-150 nucleotides long Count the number of sequences that map to individual regions (e.g. genes) Find all locations where sequences map in genome
  • 53.
    Example: DNA BindingSite Discovery Identify genomic regions where a particular TF is bound across the entire genome By extracting and aligning the DNA sequence corresponding to these binding events, we can identify which DNA sequences this TF tends to bind
  • 54.
  • 55.
    The Human GenomeProject ● Planning begins 1984, launched 1990, “completed” 2001, “finished” 2004 ● Championed by Dr. Charles DeLisi ● Overview of the Human Genome Project: http://www.genome.gov/12011238
  • 56.
    Human Genome Composition ●Key findings: ○ ~20k genes ○ More segmental duplications than expected ○ Fewer than 7% of protein families vertebrate specific ○ ~3% of sequence codes for protein coding genes ○ >85% of the genome is transcribed ○ Repetitive elements may comprise >66% of genome
  • 57.
    How The GenomeWas Determined International Human Genome Sequencing Consortium ● Fragment DNA with restriction enzymes ● Ligate fragments into bacterial artificial chromosomes (BACs) ● Amplify BACs with tagged DNA fragments ● Fragment isolated BAC vectors ● Sequence via Sanger-style sequencing to 4x coverage ● Finished draft genome in ~10 years
  • 58.
    How The GenomeWas Determined Celera Technologies: shotgun sequencing ● Used public BACs contigs from the Human Genome Project and theirown ● Much shorter DNA reads, assembled later in silico using the HGP BAC clones as a scaffold ● Finished draft genome in ~3 years
  • 59.
    The Genome IsAll About Genes ● Genic sequences ● What do our genes do? ● How are genes controlled? ● What genes are different between humans? ● How are genes associated with disease? Gene Expression
  • 60.
    Gene Expression “Gene expressionis the process by which information from a gene is used in the synthesis of a functional gene product.” - Wikipedia
  • 61.
    But What IsA Gene? ● A specific DNA sequence ● A fundamental unit of inheritance ● A molecule created by transcription of an RNA product (then translated into a protein) which has a function ● A “gene” is an abstract concept
  • 62.
    But What IsA Gene? ● DNA? ● RNA? ● Protein? ● Informational molecule? ● Functional molecule? Yes, all of them
  • 63.
    What Is GeneExpression? ● Active mRNA transcription? ● mRNA abundance? ● mRNA translation? ● RNA function? ● Protein abundance? ● Protein function? Yes, all of them
  • 64.
    The Gene ExpressionLandscape ● mRNA - protein coding genes ● Functional non-coding RNA (ncRNA) biotypes: ○ microRNA (miRNA)/small interfering RNA (siRNA) ○ Long (intergenic) non-coding RNA (lncRNA/lincRNA) ○ Ribosomal RNA (rRNA) ○ Transfer RNA (tRNA) ○ Many more (30+) ● Antisense: transcript initiated from TSS in opposite direction of primary gene ● Pseudogenes
  • 65.
    How We MeasureGene Expression ● mRNA transcription/translation ○ Fluorescent tagging + microscopy ○ ribosomal capture ● mRNA abundance ○ Northern blots ○ Quantitative polymerase chain reaction (qPCR) ○ Microarrays ○ High-throughput sequencing
  • 66.
    How We MeasureGene Expression ● Protein abundance ○ Western blots ○ Fluorescent tagging + microscopy ○ Mass spectrometry ○ Protein arrays ● mRNA/Protein localization ○ Fluorescent tagging + microscopy
  • 67.
    mRNA Measurement Considerations ●Most mRNA quantification techniques measure steady state abundance ● mRNA measurements are snapshots ○ Measure large populations of cells to quantify “average” abundance ● Poor concordance between mRNA and corresponding protein abundance
  • 68.
    The Holy Grailof bioinformatics ...to be able to understand the words in a sequence sentence that form a particular protein structure
  • 69.
    In silico functionprediction …a reality check • What is the function of this structure? • What is the function of this sequence? • What is the function of this motif? – the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level
  • 70.
    How Is ItPossible? ◼ The structure of a protein is uniquely determined by its amino acid sequence (but sequence is sometimes not enough): ◼ prions ◼ pH, ions, cofactors, chaperones ◼ Structure is conserved much longer than sequence in evolution. ◼ Structure > Function >> Sequence
  • 71.
    How Often CanWe Do It? ◼ There are currently ~47000 structures in the PDB (but only ~4000 if you include only ones that are not more than 30% identical and have a resolution better than 3.0 Å). ◼ An estimated 25% of all sequences can be modeled and structural information can be obtained for ~50%.
  • 72.
    Protein Basics: Proteins aremacromolecules Amino acids are the basic building blocks of proteins
  • 73.
    Amino Acids areclassified by properties: polar, nonpolar, and charged (ionic)
  • 74.
    Polypeptides are constructedby condensation reactions with amino acids
  • 75.
  • 76.
    Four Levels ofProtein Structure
  • 77.
    Different Levels ofProtein Structure
  • 78.
    Protein function dependson specific conformation (shape) There are four levels of protein structure. The primary structure is the linear sequence of amino acids. What determines this sequence? Where in the cell are amino acids joined this way?
  • 79.
    Four Levels ofProtein Structure ◼ Primary Structure: Linear Sequence of Amino Acids C C H2 N H R Each amino acid has central carbon liked to ---hydrogen (H) ---amino group (NH2) ---acid group (COOH) ---unique group (R) O OH
  • 80.
    The carboxyl groupof one amino acid is linked to the amino group of the next amino acid.
  • 81.
    Amino acids arelinked together by covalent peptide bonds (Fig. 4-1)
  • 82.
    Proteins are madeup of a polypeptide backbone with attached side chains (Fig. 4-2)
  • 83.
    Schematic amino acidR groups A Ala C Cys D Asp E Glu F Phe* G Gly H His* I Ile* K Lys* L Leu* M Met* N Asn P Pro Q Gln R Arg* S Ser T Thr* V Val* W Trp* Y Tyr  C ◼ N  O  S
  • 84.
    The secondary structureof protein depends on hydrogen bonding between C=O and N- H groups. Alpha Helix, Beta sheets, Turn and loop
  • 85.
    Four Levels ofProtein Structure ◼ Secondary Structure: Polypeptide folding into α helix, β sheet, or random coil (H bonds involved) C O N H C O N H C O N H C O N H or
  • 86.
    Secondary structure ofproteins -  helix H bond between the N-H of every peptide bond to the C=O of the next peptide bond of the same chain. R groups are not involved. (e.g. in protein -keratin - abundant in skin, hair, nails and horns) [Fig. 4-10, p. 128] (Pitch)
  • 87.
    Secondary structure ofproteins – β sheet Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain and C=O group of the other chain (e.g. in the protein fibroin - abundant in silk) [Fig. 4-10, p. 128]
  • 88.
    helices can wraparound one another by interactions between their hydrophobic side chains to form a stable coiled-coil. [Fig. 4-16] e.g.  keratin in the skin and myosin in muscles
  • 89.
    Tertiary structure isdetermined by the interactions between the side chains (R groups) List these types of interactions and which ones are weak or strong
  • 90.
    Four Levels ofProtein Structure ◼ Tertiary Structure: Three dimensional folded structure due to attractions and repulsions between R groups All but peptide bonds are involved in tertiary structure.
  • 91.
    Tertiary structure ofproteins • 3D conformation or shape • Depends on the properties of the R groups of amino acid residues • Fold spontaneously or with the help of molecular chaperones • Stabilized by covalent and non-covalent bonds
  • 92.
    Noncovalent bonds helpprotein folding (Fig. 4-4) Also review Panel 2-7 (pp. 78,79) on noncovalent bonds
  • 93.
    Covalent disulfide bondsbetween adjacent cysteine side chains help stabilize a favored protein conformation [Fig. 4-29]
  • 94.
    Quaternary structure isthe overall protein structure resulting from combinations of polypeptide subunits
  • 95.
    Four Levels ofProtein Structure ◼ Quaternary structure: Association of two or more protein chains eg. Hemoglobin is composed of 4 protein chains 2 are called alpha hemoglobin 2 are called beta hemoglobin
  • 96.
    Quaternary structure ofproteins: hemoglobin, a protein in red blood cells, has four sub units (two copies each of - and β- globins containing a heme molecule [Fig. 4-23].
  • 97.
    Bioinformatics how to… use publicly available free tools to predict protein structure
  • 98.
    Learning Objectives After thislesson you should be able to: ◼ Explain the individual steps involved in calculating a protein structure prediction. ◼ Identify suitable templates for modelling. ◼ Outline the principles behind protein structure prediction methods. ◼ Describe the differences between homology modelling and ab initio structure prediction. ◼ Describe the major pitfalls in protein modelling.
  • 99.
    99 Protein Bioinformatics: Proteinsequence analysis ➢ Help to characterize protein sequences in silico and allows prediction of protein structure and function ➢ Statistically significant BLAST hits usually signifies sequence homology ➢ Homologous sequences may or may not have the same function but would always (very few exceptions) have the same structural fold ➢ Protein sequence analysis allows protein classification
  • 100.
    100 Development of proteinsequence databases ➢ Atlas of protein sequence and structure – Dayhoff (1966) first sequence database (pre-bioinformatics). Currently known as Protein Information Resource (PIR) ➢ Protein data bank (PDB) – structural database (1972) remains most widely used database of structures ➢ UniProt – The United Protein Databases (UniProt, 2003) is a central database of protein sequence and function created by joining the forces of the SWISS-PROT, TrEMBL and PIR protein database activities
  • 101.
    The Protein DataBank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. Obtained by X-ray crystallography or NMR spectroscopy. Submitted by biologists and biochemists from around the world.
  • 102.
    102 Protein sequence analysisoverview ➢ Protein databases ⚫ PIR and UniProt ➢ Searching databases ⚫ Peptide search, BLAST search, Text search ➢ Information retrieval and analysis ⚫ Protein records at UniProt and PIR ⚫ Multiple sequence alignment ⚫ Secondary structure prediction ⚫ Homology modeling
  • 103.
    103 Universal Protein Knowledgebase (UniProt) PIR(Protein Information Resource) has recently joined forces with EBI (European Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to establish the UniProt Literature-Based Annotation UniProt Archive UniProt NREF Swiss- Prot PIR-PSD TrEMBL RefSeq GenBank/ EMBL/DDBJ EnsEMBL PDB Patent Data Other Data UniProt Knowledgebase Classification Automated Annotation Clustering at 100, 90, 50% Literature-Based Annotation UniProt Archive UniProt NREF Swiss- Prot PIR-PSD TrEMBL RefSeq GenBank/ EMBL/DDBJ EnsEMBL PDB Patent Data Other Data UniProt Knowledgebase Classification Automated Annotation Clustering at 100, 90, 50% http://www.uniprot.org/
  • 104.