Introduction to
Bioinformatics
Esa Pitkänen
esa.pitkanen@cs.helsinki.fi
Autumn 2008, I period
www.cs.helsinki.fi/mbi/courses/08-09/itb
582606 Introduction to Bioinformatics, Autumn 2008
Introduction to
Bioinformatics
Lecture 1:
Administrative issues
MBI Programme, Bioinformatics courses
What is bioinformatics?
Molecular biology primer
3
How to enrol for the course?
p Use the registration system of the Computer
Science department: https://ilmo.cs.helsinki.fi
n You need your user account at the IT department (“cc
account”)
p If you cannot register yet, don’t worry: attend the
lectures and exercises; just register when you are
able to do so
4
Teachers
p Esa Pitkänen, Department of Computer Science,
University of Helsinki
p Elja Arjas, Department of Mathematics and
Statistics, University of Helsinki
p Sami Kaski, Department of Information and
Computer Science, Helsinki University of
Technology
p Lauri Eronen, Department of Computer Science,
University of Helsinki (exercises)
5
Lectures and exercises
p Lectures: Tuesday and Friday 14.15-16.00
Exactum C221
p Exercises: Tuesday 16.15-18.00 Exactum
C221
n First exercise session on Tue 9 September
6
Status & Prerequisites
p Advanced level course at the Department
of Computer Science, U. Helsinki
p 4 credits
p Prerequisites:
n Basic mathematics skills (probability calculus,
basic statistics)
n Familiarity with computers
n Basic programming skills recommended
n No biology background required
7
Course contents
p What is bioinformatics?
p Molecular biology primer
p Biological words
p Sequence assembly
p Sequence alignment
p Fast sequence alignment using FASTA and BLAST
p Genome rearrangements
p Motif finding (tentative)
p Phylogenetic trees
p Gene expression analysis
8
How to pass the course?
p Recommended method:
n Attend the lectures (not obligatory though)
n Do the exercises
n Take the course exam
p Or:
n Take a separate exam
9
How to pass the course?
p Exercises give you max. 12 points
n 0% completed assignments gives you 0 points,
80% gives 12 points, the rest by linear
interpolation
n “A completed assignment” means that
p You are willing to present your solution in the
exercise session and
p You return notes by e-mail to Lauri Eronen (see
course web page for contact info) describing the main
phases you took to solve the assignment
n Return notes at latest on Tuesdays 16.15
p Course exam gives you max. 48 points
10
How to pass the course?
p Grading: on the scale 0-5
n To get the lowest passing grade 1, you need to get at
least 30 points out of 60 maximum
p Course exam: Wed 15 October 16.00-19.00
Exactum A111
p See course web page for separate exams
p Note: if you take the first separate exam, the
best of the following options will be considered:
n Exam gives you 48 points, exercises 12 points
n Exam gives you 60 points
p In second and subsequent separate exams, only
the 60 point option is in use
11
Literature
p Deonier, Tavaré,
Waterman: Computational
Genome Analysis, an
Introduction. Springer,
2005
p Jones, Pevzner: An
Introduction to
Bioinformatics Algorithms.
MIT Press, 2004
p Slides for some lectures
will be available on the
course web page
12
Additional literature
p Gusfield: Algorithms on
strings, trees and
sequences
p Griffiths et al: Introduction
to genetic analysis
p Alberts et al.: Molecular
biology of the cell
p Lodish et al.: Molecular cell
biology
p Check the course web site
13
Questions about administrative &
practical stuff?
14
Master's Degree Programme in
Bioinformatics (MBI)
p Two-year MSc programme
p Admission for 2009-2010 in January 2009
n You need to have your Bachelor’s degree ready by
August 2009
www.cs.helsinki.fi/mbi
15
MBI programme organizers
Department of Computer Science,
Department of Mathematics and Statistics
Faculty of Science, Kumpula Campus, HY
Laboratory of Computer and
Information Science, Laboratory of
CS and Engineering,TKK
Faculty of Medicine, Meilahti Campus, HY
Faculty of Biosciences
Faculty of Agriculture and Forestry
Viikki Campus, HY
16
TKK, Otaniemi
HY, Meilahti
HY, Kumpula
HY, Viikki
Four MBI campuses
17
MBI highlights
p You can take courses from both HY and
TKK
p Two biology courses tailored specifically
for MBI
p Bioinformatics is a new exciting field, with
a high demand for experts in job market
p Go to www.cs.helsinki.fi/mbi/careers to
find out what a bioinformatician could do
for living
18
Admission
p Admission requirements
n Bachelor’s degree in a suitable field (e.g., computer
science, mathematics, statistics, biology or medicine)
n At least 60 ECTS credits in total in computer science,
mathematics and statistics
n Proficiency in English (standardized language test:
TOEFL, IELTS)
p Admission period opens in late Autumn 2009 and
closes in 2 February 2009
p Details on admission will be posted in
www.cs.helsinki.fi/mbi during this autumn
19
Bioinformatics courses in Helsinki
region: 1st period
p Computational genomics (4-7 credits, TKK)
p Seminar: Neuroinformatics (3 credits, Kumpula)
p Seminar: Machine Learning in Bioinformatics (3
credits, Kumpula)
p Signal processing in neuroinformatics (5 credits,
TKK)
20
A good biology course for computer
scientists and mathematicians?
p Biology for methodological scientists (8 credits, Meilahti)
n Course organized by the Faculties of Bioscience and Medicine
for the MBI programme
n Introduction to basic concepts of microarrays, medical genetics
and developmental biology
n Study group + book exam in I period (2 cr)
n Three lectured modules, 2 cr each
n Each module has an individual registration so you can
participate even if you missed the first module
n www.cs.helsinki.fi/mbi/courses/08-09/bfms/
21
Bioinformatics courses in Helsinki
region: 2nd period
p Bayesian paradigm in genetic bioinformatics (6
credits, Kumpula)
p Biological Sequence Analysis (6 credits, Kumpula)
p Modeling of biological networks (5-7 credits, TKK)
p Statistical methods in genetics (6-8 credits,
Kumpula)
22
Bioinformatics courses in Helsinki
region: 3rd period
p Evolution and the theory of games (5 credits, Kumpula)
p Genome-wide association mapping (6-8 credits, Kumpula)
p High-Throughput Bioinformatics (5-7 credits, TKK)
p Image Analysis in Neuroinformatics (5 credits, TKK)
p Practical Course in Biodatabases (4-5 credits, Kumpula)
p Seminar: Computational systems biology (3 credits,
Kumpula)
p Spatial models in ecology and evolution (8 credits,
Kumpula)
p Special course in bioinformatics I (3-7 credits, TKK)
23
Bioinformatics courses in Helsinki region:
4th period
p Metabolic Modeling (4 credits, Kumpula)
p Phylogenetic data analyses (6-8 credits,
Kumpula)
24
1. What is bioinformatics?
25
What is bioinformatics?
p Bioinformatics, n. The science of information
and information flow in biological systems,
esp. of the use of computational methods in
genetics and genomics. (Oxford English
Dictionary)
p "The mathematical, statistical and computing
methods that aim to solve biological problems
using DNA and amino acid sequences and
related information." -- Fredj Tekaia
26
What is bioinformatics?
p "I do not think all biological computing is
bioinformatics, e.g. mathematical modelling is
not bioinformatics, even when connected with
biology-related problems. In my opinion,
bioinformatics has to do with management and
the subsequent use of biological information,
particular genetic information."
-- Richard Durbin
27
What is not bioinformatics?
p Biologically-inspired computation, e.g., genetic algorithms
and neural networks
p However, application of neural networks to solve some
biological problem, could be called bioinformatics
p What about DNA computing?
http://www.wisdom.weizmann.ac.il/~lbn/new_pages/Visual_Presentation.html
28
Computational biology
p Application of computing to biology (broad
definition)
p Often used interchangeably with bioinformatics
p Or: Biology that is done with computational
means
29
Biometry & biophysics
p Biometry: the statistical analysis of biological
data
n Sometimes also the field of identification of individuals
using biological traits (a more recent definition)
p Biophysics: "an interdisciplinary field which
applies techniques from the physical sciences
to understanding biological structure and
function" -- British Biophysical Society
30
Mathematical biology
p Mathematical biology
“tackles biological
problems, but the methods
it uses to tackle them need
not be numerical and need
not be implemented in
software or hardware.”
-- Damian Counsell
Alan Turing
31
Turing on biological complexity
p “It must be admitted that the biological examples which
it has been possible to give in the present paper are very
limited.
This can be ascribed quite simply to the fact that
biological phenomena are usually very complicated.
Taking this in combination with the relatively elementary
mathematics used in this paper one could hardly expect to
find that many observed biological phenomena would be
covered.
It is thought, however, that the imaginary biological
systems which have been treated, and the principles which
have been discussed, should be of some help in
interpreting real biological forms.”
– Alan Turing, The Chemical Basis of Morphogenesis, 1952
32
Related concepts
p Systems biology
n “Biology of networks”
n Integrating different levels
of information to
understand how biological
systems work
p Computational systems biology
Overview of metabolic pathways in
KEGG database, www.genome.jp/kegg/
33
Why is bioinformatics important?
p New measurement techniques produce
huge quantities of biological data
n Advanced data analysis methods are needed to
make sense of the data
n Typical data sources produce noisy data with a
lot of missing values
p Paradigm shift in biology to utilise
bioinformatics in research
34
Bioinformatician’s skill set
p Statistics, data analysis methods
n Lots of data
n High noise levels, missing values
n #attributes >> #data points
p Programming languages
n Scripting languages: Python, Perl, Ruby, …
n Extensive use of text file formats: need
parsers
n Integration of both data and tools
p Data structures, databases
35
Bioinformatician’s skill set
p Modelling
n Discrete vs continuous domains
n -> Systems biology
p Scientific computation packages
n R, Matlab/Octave, …
p Communication skills!
36
Communication skills: case 1
Biologist presents a problem
to computer scientists /
mathematicians
?
”I am interested in finding what affects the
regulation gene x during condition y and how
that relates to the organism’s phenotype.”
”Define input and output of the problem.”
37
Communication skills: case 2
Bioinformatician is a part
of a group that consists
mostly of biologists.
38
Communication skills: case 2
...biologist/bioinformatician ratio is important!
39
Communication skills: case 3
A group of
bioinformaticians
offers their services to
more than one group
40
Bioinformatician’s skill set
p How much biology you should know?
41
Computer Science
• Programming
• Databases
• Algorithmics
Mathematics and
statistics
• Calculus
• Probability calculus
• Linear algebra
Biology & Medicine
• Basics in molecular and
cell biology
• Measurement techniques
Bioinformatics
• Biological sequence analysis
• Biological databases
• Analysis of gene expression
• Modeling protein structure and
function
• Gene, protein and metabolic
networks
• …
Bioinformatician’s skill set
Prof. Juho Rousu, 2006
Where would you be in this triangle?
42
A problem involving bioinformatics?
- ”I found a fruit fly that is immune to all diseases!”
- ”It was one of these”
Pertti Jarla, http://www.hs.fi/fingerpori/
43
Molecular biology primer
Molecular Biology Primer by Angela Brooks, Raymond Brown,
Calvin Chen, Mike Daly, Hoa Dinh, Erinn Hama, Robert Hinman,
Julio Ng, Michael Sneddon, Hoa Troung, Jerry Wang, Che Fung
Yung
Edited for Introduction to Bioinformatics (Autumn 2007, Summer
2008, Autumn 2008) by Esa Pitkänen
44
Molecular biology primer
p Part 1: What is life made of?
p Part 2: Where does the variation in
genomes come from?
45
Life begins with Cell
p A cell is a smallest structural unit of an
organism that is capable of independent
functioning
p All cells have some common features
46
Cells
p Fundamental working units of every living system.
p Every organism is composed of one of two radically different types of
cells:
n prokaryotic cells or
n eukaryotic cells.
p Prokaryotes and Eukaryotes are descended from the same
primitive cell.
n All prokaryotic and eukaryotic cells are the result of a total of 3.5
billion years of evolution.
47
Two types of cells: Prokaryotes and
Eukaryotes
48
Prokaryotes and Eukaryotes
p According to the most
recent evidence, there
are three main
branches to the tree of
life
p Prokaryotes include
Archaea (“ancient
ones”) and bacteria
p Eukaryotes are
kingdom Eukarya and
includes plants,
animals, fungi and
certain algae Lecture: Phylogenetic trees
49
All Cells have common Cycles
p Born, eat, replicate, and die
50
Common features of organisms
p Chemical energy is stored in ATP
p Genetic information is encoded by DNA
p Information is transcribed into RNA
p There is a common triplet genetic code
p Translation into proteins involves ribosomes
p Shared metabolic pathways
p Similar proteins among diverse groups of
organisms
51
All Life depends on 3 critical molecules
p DNAs (Deoxyribonucleic acid)
n Hold information on how cell works
p RNAs (Ribonucleic acid)
n Act to transfer short pieces of information to different
parts of cell
n Provide templates to synthesize into protein
p Proteins
n Form enzymes that send signals to other cells and
regulate gene activity
n Form body’s major components (e.g. hair, skin, etc.)
n “Workhorses” of the cell
52
DNA: The Code of Life
p The structure and the four genomic letters code for all living
organisms
p Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G
on complimentary strands.
Lecture: Genome sequencing
and assembly
53
Discovery of the structure of DNA
p 1952-1953 James D. Watson and Francis H. C. Crick
deduced the double helical structure of DNA from X-ray
diffraction images by Rosalind Franklin and data on amounts
of nucleotides in DNA
James Watson and
Francis Crick
Rosalind
Franklin
”Photo 51”
54
DNA, continued
p DNA has a double helix
structure which is
composed of
n sugar molecule
n phosphate group
n and a base (A,C,G,T)
p By convention, we read
DNA strings in direction of
transcription: from 5’ end
to 3’ end
5’ ATTTAGGCC 3’
3’ TAAATCCGG 5’
55
DNA is contained in chromosomes
http://en.wikipedia.org/wiki/Image:Chromatin_Structures.png
p In eukaryotes, DNA is packed into chromatids
n In metaphase, the “X” structure consists of two identical
chromatids
p In prokaryotes, DNA is usually contained in a single,
circular chromosome
56
Human chromosomes
p Somatic cells in humans
have 2 pairs of 22
chromosomes + XX
(female) or XY (male) =
total of 46 chromosomes
p Germline cells have 22
chromosomes + either X or
Y = total of 23
chromosomes
Karyogram of human male using Giemsa staining
(http://en.wikipedia.org/wiki/Karyotype)
57
Length of DNA and number of chromosomes
Organism #base pairs #chromosomes (germline)
Prokayotic
Escherichia coli (bacterium) 4x106 1
Eukaryotic
Saccharomyces cerevisia (yeast) 1.35x107 17
Drosophila melanogaster (insect) 1.65x108 4
Homo sapiens (human) 2.9x109 23
Zea mays (corn / maize) 5.0x109 10
58
1 atgagccaag ttccgaacaa ggattcgcgg ggaggataga tcagcgcccg agaggggtga
61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc
121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag
181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac
241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga
301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac
361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag
421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat
481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga
541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt
601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg
661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc
721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc
781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga
841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac
901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact
961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga
1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga
1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg
1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg
1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct
1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga
1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg
1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga
1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga
1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt
1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt
1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg
1681 ag
Hepatitis delta virus, complete genome
59
RNA
p RNA is similar to DNA chemically. It is usually only a
single strand. T(hyamine) is replaced by U(racil)
p Several types of RNA exist for different functions in
the cell.
http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif
tRNA linear and 3D view:
60
DNA, RNA, and the Flow of
Information
Translation
Transcription
Replication
”The central dogma”
Is this true?
Denis Noble: The principles of Systems Biology illustrated using the virtual heart
http://velblod.videolectures.net/2007/pascal/eccs07_dresden/noble_denis/eccs07_noble_psb_01.ppt
61
Proteins
p Proteins are polypeptides
(strings of amino acid
residues)
p Represented using strings
of letters from an alphabet
of 20: AEGLV…WKKLAG
p Typical length 50…1000
residues
Urease enzyme from Helicobacter pylori
62
Amino acids
http://upload.wikimedia.org/wikipedia/commons/c/c5/Amino_acids_2.png
63
How DNA/RNA codes for protein?
p DNA alphabet contains four
letters but must specify
protein, or polypeptide
sequence of 20 letters.
p Dinucleotides are not
enough: 42 = 16 possible
dinucleotides
p Trinucleotides (triplets)
allow 43 = 64 possible
trinucleotides
p Triplets are also called
codons
64
How DNA/RNA codes for protein?
p Three of the possible
triplets specify ”stop
translation”
p Translation usually starts
at triplet AUG (this codes
for methionine)
p Most amino acids may be
specified by more than
triplet
p How to find a gene? Look
for start and stop codons
(not that easy though)
65
Proteins: Workhorses of the Cell
p 20 different amino acids
n different chemical properties cause the protein chains to fold
up into specific three-dimensional structures that define their
particular functions in the cell.
p Proteins do all essential work for the cell
n build cellular structures
n digest nutrients
n execute metabolic functions
n mediate information flow within a cell and among cellular
communities.
p Proteins work together with other proteins or nucleic acids
as "molecular machines"
n structures that fit together and function in highly specific, lock-
and-key ways.
Lecture 8: Proteomics
66
Genes
p “A gene is a union of genomic sequences encoding a
coherent set of potentially overlapping functional products”
--Gerstein et al.
p A DNA segment whose information is expressed either as
an RNA molecule or protein
5’ 3’
3’ 5’
… a t g a g t g g a …
… t a c t c a c c t …
augagugga ...
(transcription)
(translation)
MSG …
(folding)
http://fold.it
67
FoldIt: Protein folding game
http://fold.it
68
Genes & alleles
p A gene can have different variants
p The variants of the same gene are called
alleles
5’
3’
… a t g a g t g g a …
… t a c t c a c c t …
augagugga ...
MSG …
5’
3’
… a t g a g t c g a …
… t a c t c a g c t …
augagucga ...
MSR …
69
Genes can be found on both strands
3’
5’
5’
3’
70
Exons and introns & splicing
3’
5’
5’
3’
Introns are removed from RNA after transcription
Exons
Exons are joined:
This process is called splicing
71
Alternative splicing
A 3’
5’
5’
3’
B C
Different splice variants may be generated
A B C
B C
A C
…
72
Where does the variation in genomes come
from?
p Prokaryotes are typically
haploid: they have a single
(circular) chromosome
p DNA is usually inherited
vertically (parent to
daughter)
p Inheritance is clonal
n Descendants are faithful
copies of an ancestral DNA
n Variation is introduced via
mutations, transposable
elements, and horizontal
transfer of DNA
Chromosome map of S. dysenteriae, the nine rings
describe different properties of the genome
http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm
73
Causes of variation
p Mistakes in DNA replication
p Environmental agents (radiation, chemical
agents)
p Transposable elements (transposons)
n A part of DNA is moved or copied to another location in
genome
p Horizontal transfer of DNA
n Organism obtains genetic material from another
organism that is not its parent
n Utilized in genetic engineering
74
Biological string manipulation
p Point mutation: substitution of a base
n …ACGGCT… => …ACGCCT…
p Deletion: removal of one or more contiguous
bases (substring)
n …TTGATCA… => …TTTCA…
p Insertion: insertion of a substring
n …GGCTAG… => …GGTCAACTAG…
Lecture: Sequence alignment
Lecture: Genome rearrangements
75
Meiosis
p Sexual organisms are usually
diploid
n Germline cells (gametes)
contain N chromosomes
n Somatic (body) cells have 2N
chromosomes
p Meiosis: reduction of
chromosome number from
2N to N during reproductive
cycle
n One chromosome doubling is
followed by two cell divisions
Major events in meiosis
http://en.wikipedia.org/wiki/Meiosis
http://www.ncbi.nlm.nih.gov/About/Primer
76
Recombination and variation
p Recap: Allele is a viable DNA
coding occupying a given locus
(position in the genome)
p In recombination, alleles from
parents become suffled in
offspring individuals via
chromosomal crossover over
p Allele combinations in
offspring are usually different
from combinations found in
parents
p Recombination errors lead into
additional variations
Chromosomal crossover as described by
T. H. Morgan in 1916
77
Mitosis
http://en.wikipedia.org/wiki/Image:Major_events_in_mitosis.svg
p Mitosis: growth and development of the organism
n One chromosome doubling is followed by one cell
division
78
Recombination frequency and linked genes
p Genetic marker: some DNA sequence of interest
(e.g., gene or a part of a gene)
p Recombination is more likely to separate two
distant markers than two close ones
p Linked markers: ”tend” to be inherited together
p Marker distances measured in centimorgans: 1
centimorgan corresponds to 1% chance that two
markers are separated in recombination
79
Biological databases
p Exponential growth of
biological data
n New measurement
techniques
n Before we are able to use
the data, we need to store
it efficiently -> biological
databases
n Published data is
submitted to databases
p General vs specialised
databases
p This topic is discussed
extensively in Practical
course in biodatabases (III
period)
80
10 most important biodatabases… according
to ”Bioinformatics for dummies”
p GenBank/DDJB/EMBL www.ncbi.nlm.nih.gov Nucleotide sequences
p Ensembl www.ensembl.org Human/mouse genome
p PubMed www.ncbi.nlm.nih.gov Literature references
p NR www.ncbi.nlm.nih.gov Protein sequences
p UniProt www.expasy.org Protein sequences
p InterPro www.ebi.ac.uk Protein domains
p OMIM www.ncbi.nlm.nih.gov Genetic diseases
p Enzymes www.expasy.org Enzymes
p PDB www.rcsb.org/pdb/ Protein structures
p KEGG www.genome.ad.jp Metabolic pathways
Sophia Kossida, Introduction to Bioinformatics, Summer 2008
81
FASTA format
p A simple format for DNA and protein sequence
data is FASTA
>Hepatitis delta virus, complete genome
atgagccaagttccgaacaaggattcgcggggaggatagatcagcgcccgagaggggtga
gtcggtaaagagcattggaacgtcggagatacaactcccaagaaggaaaaaagagaaagc
aagaagcggatgaatttccccataacgccagtgaaactctaggaaggggaaagagggaag
gtggaagagaaggaggcgggcctcccgatccgaggggcccggcggccaagtttggaggac
actccggcccgaagggttgagagtaccccagagggaggaagccacacggagtagaacaga
gaaatcacctccagaggaccccttcagcgaacagagagcgcatcgcgagagggagtagac
catagcgataggaggggatgctaggagttgggggagaccgaagcgaggaggaaagcaaag
agagcagcggggctagcaggtgggtgttccgccccccgagaggggacgagtgaggcttat
cccggggaactcgacttatcgtccccacatagcagactcccggaccccctttcaaagtga
…
Header line,
begins with >
Introduction to
Bioinformatics
Biological words
83
Recap
p DNA codes information with alphabet of 4
letters: A, C, G, T
p In proteins, the alphabet size is 20
p DNA -> RNA -> Protein (genetic code)
n Three DNA bases (triplet, codon) code for one
amino acid
n Redundancy, start and stop codons
84
1 atgagccaag ttccgaacaa ggattcgcgg ggaggataga tcagcgcccg agaggggtga
61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc
121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag
181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac
241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga
301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac
361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag
421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat
481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga
541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt
601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg
661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc
721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc
781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga
841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac
901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact
961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga
1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga
1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg
1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg
1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct
1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga
1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg
1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga
1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga
1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt
1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt
1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg
1681 ag
Given a DNA sequence, we might ask a number of questions
What sort of statistics should be used to describe the sequence?
What sort of organism did this sequence come from?
Does the description of this sequence differ from
the description of other DNA in the organism?
What sort of sequence is this? What does it do?
85
Biological words
p We can try to answer questions like these
by considering the words in a sequence
p A k-word (or a k-tuple) is a string of length
k drawn from some alphabet
p A DNA k-word is a string of length k that
consists of letters A, C, G, T
n 1-words: individual nucleotides (bases)
n 2-words: dinucleotides (AA, AC, AG, AT, CA, ...)
n 3-words: codons (AAA, AAC, …)
n 4-words and beyond
86
1-words: base composition
p Typically DNA exists as duplex molecule
(two complementary strands)
5’-GGATCGAAGCTAAGGGCT-3’
3’-CCTAGCTTCGATTCCCGA-5’
Top strand: 7 G, 3 C, 5 A, 3 T
Bottom strand: 3 G, 7 C, 3 A, 5 T
Duplex molecule: 10 G, 10 C, 8 A, 8 T
Base frequencies: 10/36 10/36 8/36 8/36
fr(G + C) = 20/36, fr(A + T) = 1 – fr(G + C) = 16/36
These are something
we can determine
experimentally.
87
G+C content
p fr(G + C), or G+C content is a simple
statistics for describing genomes
p Notice that one value is enough
characterise fr(A), fr(C), fr(G) and fr(T)
for duplex DNA
p Is G+C content (= base composition) able
to tell the difference between genomes of
different organisms?
n Simple computational experiment, if we have
the genome sequences under study
(-> exercises)
88
G+C content and genome sizes (in
megabasepairs, Mb) for various organisms
p Mycoplasma genitalium 31.6% 0.585
p Escherichia coli K-12 50.7% 4.693
p Pseudomonas aeruginosa PAO1 66.4% 6.264
p Pyrococcus abyssi 44.6% 1.765
p Thermoplasma volcanium 39.9% 1.585
p Caenorhabditis elegans 36% 97
p Arabidopsis thaliana 35% 125
p Homo sapiens 41% 3080
89
Base frequencies in duplex molecules
p Consider a DNA sequence generated randomly,
with probability of each letter being independent
of position in sequence
p You could expect to find a uniform distribution of
bases in genomes…
p This is not, however, the case in genomes,
especially in prokaryotes
n This phenomena is called GC skew
5’-...GGATCGAAGCTAAGGGCT...-3’
3’-...CCTAGCTTCGATTCCCGA...-5’
90
DNA replication fork
p When DNA is replicated,
the molecule takes the
replication fork form
p New complementary
DNA is synthesised at
both strands of the
”fork”
p New strand in 5’-3’
direction corresponding
to replication fork
movement is called
leading strand and the
other lagging strand
Leading strand
Lagging strand
Replication fork
Replication fork movement
91
DNA replication fork
p This process has
specific starting
points in genome
(origins of
replication)
p Observation:
Leading strands
have an excess of G
over C
p This can be
described by GC
skew statistics Lagging strand
Replication fork
Leading strand
Replication fork movement
92
GC skew
p GC skew is defined as (#G - #C) / (#G +
#C)
p It is calculated at successive positions in
intervals (windows) of specific width
5’-...GGATCGAAGCTAAGGGCT...-3’
3’-...CCTAGCTTCGATTCCCGA...-5’
(3 – 2) / (3 + 2) = 1/5
(4 – 2) / (4 + 2) = 1/3
93
p G-C content &
GC skew
statistics can be
displayed with a
circular genome
map
Chromosome map of S. dysenteriae, the nine rings
describe different properties of the genome
http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm
G-C content & GC skew
G+C content
GC skew
(10kb window size)
94
GC skew
p GC skew
often
changes sign
at origins
and termini
of replication
G+C content
GC skew
(10kb window size)
Nie et al., BMC Genomics, 2006
95
2-words: dinucleotides
p Let’s consider a sequence L1,L2,...,Ln
where each letter Li is drawn from the
DNA alphabet {A, C, G, T}
p We have 16 possible dinucleotides lili+1:
AA, AC, AG, ..., TG, TT.
96
i.i.d. model for nucleotides
p Assume that bases
n occur independently of each other
n bases at each position are identically
distributed
p Probability of the base A, C, G, T occuring
is pA, pC, pG, pT, respectively
n For example, we could use pA=pC=pG=pT=0.25
or estimate the values from known genome
data
p Probability of lili+1 is then PliPli+1
n For example, P(TG) = pT pG
97
2-words: is what we see surprising?
p We can test whether a sequence is ”unexpected”,
for example, with a 2 test
p Test statistic for a particular dinucleotide r1r2 is
2 = (O – E)2 / E where
n O is the observed number of dinucleotide r1r2
n E is the expected number of dinucleotide r1r2
n E = (n – 1)pr1pr2 under i.i.d. model
p Basic idea: high values of 2 indicate deviation
from the model
n Actual procedure is more detailed -> basic statistics
courses
98
Refining the i.i.d. model
p i.i.d. model describes some organisms well
(see Deonier’s book) but fails to
characterise many others
p We can refine the model by having the
DNA letter at some position depend on
letters at preceding positions
…TCGTGACGCCG ?
Sequence context to
consider
99
First-order Markov chains
p Lets assume that in sequence X the letter at
position t, Xt, depends only on the previous letter
Xt-1 (first-order markov chain)
p Probability of letter j occuring at position t given
Xt-1 = i: pij = P(Xt = j | Xt-1 = i)
p We consider homogeneous markov chains:
probability pij is independent of position t
…TCGTGACGCCG ?
Xt
Xt-1
100
Estimating pij
p We can estimate probabilities pij (”the probability
that j follows i”) from observed dinucleotide
frequencies
A C G T
A pAA pAC pAG pAT
C pCA pCC pCG pCT
G pGA pGC pGG pGT
T pTA pTC pTG pTT
Frequency
of dinucleotide AT
in sequence
…the values pAA, pAC, ..., pTG, pTT sum to 1
+ + + Base frequency
fr(C)
101
Estimating pij
p pij = P(Xt = j | Xt-1 = i) = P(Xt = j, Xt-1 = i)
P(Xt-1 = i)
Probability of transition i -> j
Dinucleotide frequency
Base frequency of nucleotide i,
fr(i)
A C G T
A 0.146 0.052 0.058 0.089
C 0.063 0.029 0.010 0.056
G 0.050 0.030 0.028 0.051
T 0.086 0.047 0.063 0.140
P(Xt = j, Xt-1 = i)
A C G T
A 0.423 0.151 0.168 0.258
C 0.399 0.184 0.063 0.354
G 0.314 0.189 0.176 0.321
T 0.258 0.138 0.187 0.415
P(Xt = j | Xt-1 = i)
0.052 / 0.345 0.151
102
Simulating a DNA sequence
p From a transition matrix, it is easy to generate a
DNA sequence of length n:
n First, choose the starting base randomly according to
the base frequency distribution
n Then, choose next base according to the distribution
P(xt | xt-1) until n bases have been chosen
A C G T
A 0.423 0.151 0.168 0.258
C 0.399 0.184 0.063 0.354
G 0.314 0.189 0.176 0.321
T 0.258 0.138 0.187 0.415
P(Xt = j | Xt-1 = i)
T T C T T C A
A
Look for R code in Deonier’s
book
103
Simulating a DNA sequence
ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttc
ttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttaga
gtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaac
cacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaaga
ttaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttattt
ttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcata
gctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgt
tcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggtttt
ggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttat
ttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaa
agtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttct
ttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcc
caatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatatttta
acttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaa
ttaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaa
atataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctc
catttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctacta
acctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgc
cccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaa
atgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtct
gaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgacta
taaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatg
ttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc
p Now we can quickly generate sequences of
arbitrary length...
104
Simulating a DNA sequence
aa 0.145 0.146
ac 0.050 0.052
ag 0.055 0.058
at 0.092 0.089
ca 0.065 0.063
cc 0.028 0.029
cg 0.011 0.010
ct 0.058 0.056
ga 0.048 0.050
gc 0.032 0.030
gg 0.029 0.028
gt 0.050 0.051
ta 0.084 0.086
tc 0.052 0.047
tg 0.064 0.063
tt 0.138 0.0140
Dinucleotide frequencies
Simulated Observed
n = 10000
105
Simulating a DNA sequence
ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttc
ttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttaga
gtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaac
cacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaaga
ttaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttattt
ttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcata
gctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgt
tcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggtttt
ggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttat
ttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaa
agtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttct
ttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcc
caatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatatttta
acttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaa
ttaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaa
atataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctc
catttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctacta
acctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgc
cccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaa
atgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtct
gaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgacta
taaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatg
ttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc
p The model is able to generate correct proportions
of 1- and 2-words in genomes...
p ...but fails with k=3 and beyond.
106
3-words: codons
p We can extend the previous method to 3-
words
p k=3 is an important case in study of DNA
sequences because of genetic code
5’ 3’
3’ 5’
… a t g a g t g g a …
… t a c t c a c c t …
a u g a g u g g a ...
M S G …
107
3-word probabilities
p Let’s again assume a sequence L of
independent bases
p Probability of 3-word r1r2r3 at position
i,i+1,i+2 in sequence L is
P(Li = r1, Li+1 = r2, Li+2 = r3) =
P(Li = r1)P(Li+1 = r2)P(Li+2 = r3)
108
3-words in Escherichia coli genome
AAA 108924 0.02348 0.01492
AAC 82582 0.01780 0.01541
AAG 63369 0.01366 0.01537
AAT 82995 0.01789 0.01490
ACA 58637 0.01264 0.01541
ACC 74897 0.01614 0.01591
ACG 73263 0.01579 0.01588
ACT 49865 0.01075 0.01539
AGA 56621 0.01220 0.01537
AGC 80860 0.01743 0.01588
AGG 50624 0.01091 0.01584
AGT 49772 0.01073 0.01536
ATA 63697 0.01373 0.01490
ATC 86486 0.01864 0.01539
ATG 76238 0.01643 0.01536
ATT 83398 0.01797 0.01489
CAA 76614 0.01651 0.01541
CAC 66751 0.01439 0.01591
CAG 104799 0.02259 0.01588
CAT 76985 0.01659 0.01539
CCA 86436 0.01863 0.01591
CCC 47775 0.01030 0.01643
CCG 87036 0.01876 0.01640
CCT 50426 0.01087 0.01589
CGA 70938 0.01529 0.01588
CGC 115695 0.02494 0.01640
CGG 86877 0.01872 0.01636
CGT 73160 0.01577 0.01586
CTA 26764 0.00577 0.01539
CTC 42733 0.00921 0.01589
CTG 102909 0.02218 0.01586
CTT 63655 0.01372 0.01537
Word Count Observed Expected Word Count Observed Expected
109
3-words in Escherichia coli genome
GAA 83494 0.01800 0.01537
GAC 54737 0.01180 0.01588
GAG 42465 0.00915 0.01584
GAT 86551 0.01865 0.01536
GCA 96028 0.02070 0.01588
GCC 92973 0.02004 0.01640
GCG 114632 0.02471 0.01636
GCT 80298 0.01731 0.01586
GGA 56197 0.01211 0.01584
GGC 92144 0.01986 0.01636
GGG 47495 0.01024 0.01632
GGT 74301 0.01601 0.01582
GTA 52672 0.01135 0.01536
GTC 54221 0.01169 0.01586
GTG 66117 0.01425 0.01582
GTT 82598 0.01780 0.01534
TAA 68838 0.01484 0.01490
TAC 52592 0.01134 0.01539
TAG 27243 0.00587 0.01536
TAT 63288 0.01364 0.01489
TCA 84048 0.01812 0.01539
TCC 56028 0.01208 0.01589
TCG 71739 0.01546 0.01586
TCT 55472 0.01196 0.01537
TGA 83491 0.01800 0.01536
TGC 95232 0.02053 0.01586
TGG 85141 0.01835 0.01582
TGT 58375 0.01258 0.01534
TTA 68828 0.01483 0.01489
TTC 83848 0.01807 0.01537
TTG 76975 0.01659 0.01534
TTT 109831 0.02367 0.01487
Word Count Observed Expected Word Count Observed Expected
110
2nd order Markov Chains
p Markov chains readily generalise to higher orders
p In 2nd order markov chain, position t depends on
positions t-1 and t-2
p Transition matrix:
p Probabilistic models for DNA and amino acid
sequences will be discussed in Biological
sequence analysis course (II period)
A C G T
AA
AC
AG
AT
CA
...
111
Codon Adaptation Index (CAI)
p Observation: cells prefer certain codons
over others in highly expressed genes
n Gene expression: DNA is transcribed into RNA
(and possibly translated into protein)
Phe TTT 0.493 0.551 0.291
TTC 0.507 0.449 0.709
Ala GCT 0.246 0.145 0.275
GCC 0.254 0.276 0.164
GCA 0.246 0.196 0.240
GCG 0.254 0.382 0.323
Asn AAT 0.493 0.409 0.172
AAC 0.507 0.591 0.828
Amino
acid Codon Predicted Gene class I Gene class II
Highly
expressed
Moderately
expressed
Codon frequencies for some genes in E. coli
112
Codon Adaptation Index (CAI)
p CAI is a statistic used to compare the
distribution of codons observed with the
preferred codons for highly expressed
genes
113
Codon Adaptation Index (CAI)
p Consider an amino acid sequence X = x1x2...xn
p Let pk be the probability that codon k is used in
highly expressed genes
p Let qk be the highest probability that a codon
coding for the same amino acid as codon k has
n For example, if codon k is ”GCC”, the
corresponding amino acid is Alanine (see genetic
code table; also GCT, GCA, GCG code for Alanine)
n Assume that pGCC = 0.164, pGCT = 0.275, pGCA =
0.240, pGCG = 0.323
n Now qGCC = qGCT = qGCA = qGCG = 0.323
114
Codon Adaptation Index (CAI)
p CAI is defined as
p CAI can be given also in log-odds form:
log(CAI) = (1/n) log(pk / qk)
CAI = ( pk / qk )
k=1
n 1/n
k=1
n
115
CAI: example with an E. coli gene
M A L T K A E M S E Y L …
ATG GCG CTT ACA AAA GCT GAA ATG TCA GAA TAT CTG
1.00 0.47 0.02 0.45 0.80 0.47 0.79 1.00 0.43 0.79 0.19 0.02
0.06 0.02 0.47 0.20 0.06 0.21 0.32 0.21 0.81 0.02
0.28 0.04 0.04 0.28 0.03 0.04
0.20 0.03 0.05 0.20 0.01 0.03
0.01 0.04 0.01
0.89 0.18 0.89
ATG GCT TTA ACT AAA GCT GAA ATG TCT GAA TAT TTA
GCC TTG ACC AAG GCC GAG TCC GAG TAC TTG
GCA CTT ACA GCA TCA CTT
GCG CTC ACG GCG TCG CTC
CTA AGT CTA
CTG AGC CTG
1.00 0.20 0.04 0.04 0.80 0.47 0.79 1.00 0.03 0.79 0.19 0.89…
1.00 0.47 0.89 0.47 0.80 0.47 0.79 1.00 0.43 0.79 0.81 0.89
1/n
qk
pk
116
CAI: properties
p CAI = 1.0 : each codon was the most frequently
used codon in highly expressed genes
p Log-odds used to avoid numerical problems
n What happens if you multiply many values <1.0
together?
p In a sample of E.coli genes, CAI ranged from 0.2
to 0.85
p CAI correlates with mRNA levels: can be used to
predict high expression levels
117
Biological words: summary
p Simple 1-, 2- and 3-word models can
describe interesting properties of DNA
sequences
n GC skew can identify DNA replication origins
n It can also reveal genome rearrangement
events and lateral transfer of DNA
n GC content can be used to locate genes:
human genes are comparably GC-rich
n CAI predicts high gene expression levels
118
Biological words: summary
p k=3 models can help to identify correct
reading frames
n Reading frame starts from a start codon and
stops in a stop codon
n Consider what happens to translation when a
single extra base is introduced in a reading
frame
p Also word models for k > 3 have their
uses
119
Next lecture
p Genome sequencing & assembly – where
do we get sequence data?
120
Note on programming languages
p Working with probability distributions is
straightforward with R, for example
n Deonier’s book contains many computational
examples
n You can use R in CS Linux systems
p Python works too!
121
#!/usr/bin/env python
import sys, random
n = int(sys.argv[1])
tm = {'a' : {'a' : 0.423, 'c' : 0.151, 'g' : 0.168, 't' : 0.258},
'c' : {'a' : 0.399, 'c' : 0.184, 'g' : 0.063, 't' : 0.354},
'g' : {'a' : 0.314, 'c' : 0.189, 'g' : 0.176, 't' : 0.321},
't' : {'a' : 0.258, 'c' : 0.138, 'g' : 0.187, 't' : 0.415}}
pi = {'a' : 0.345, 'c' : 0.158, 'g' : 0.159, 't' : 0.337}
def choose(dist):
r = random.random()
sum = 0.0
keys = dist.keys()
for k in keys:
sum += dist[k]
if sum > r:
return k
return keys[-1]
c = choose(pi)
for i in range(n - 1):
sys.stdout.write(c)
c = choose(tm[c])
sys.stdout.write(c)
sys.stdout.write("n")
Example Python code for generating
DNA sequences with first-order
Markov chains.
Function choose(), returns a key (here ’a’, ’c’, ’g’ or
’t’) of the dictionary ’dist’ chosen randomly
according to probabilities in dictionary values.
Choose the first letter, then choose
next letter according to P(xt | xt-1).
Transition matrix
tm and initial
distribution pi.
Initialisation: use packages ’sys’ and ’random’,
read sequence length from input.
Introduction to
Bioinformatics
Genome sequencing & assembly
123
Genome sequencing & assembly
p DNA sequencing
n How do we obtain DNA sequence information from
organisms?
p Genome assembly
n What is needed to put together DNA sequence
information from sequencing?
p First statement of sequence assembly problem
(according to G. Myers):
n Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for
some string matching problems arising in molecular
genetics. Proc. 9th IFIP World Computer Congress, 1983
124
?
Recovery of shredded newspaper
125
DNA sequencing
p DNA sequencing: resolving a nucleotide
sequence (whole-genome or less)
p Many different methods developed
n Maxam-Gilbert method (1977)
n Sanger method (1977)
n High-throughput methods
126
Sanger sequencing: sequencing by
synthesis
p A sequencing technique developed by Fred
Sanger
p Also called dideoxy sequencing
127
http://en.wikipedia.org/wiki/DNA_polymerase
DNA polymerase
p A DNA polymerase is an
enzyme that catalyzes
DNA synthesis
p DNA polymerase needs
a primer
n Synthesis proceeds
always in 5’->3’ direction
128
Dideoxy sequencing
p In Sanger sequencing, chain-terminating
dideoxynucleoside triphosphates (ddXTPs)
are employed
n ddATP, ddCTP, ddGTP, ddTTP lack the 3’-OH
tail of dXTPs
p A mixture of dXTPs with small amount of
ddXTPs is given to DNA polymerase with
DNA template and primer
p ddXTPs are given fluorescent labels
129
Dideoxy sequencing
p When DNA polymerase encounters a
ddXTP, the synthesis cannot proceed
p The process yields copied sequences of
different lengths
p Each sequence is terminated by a labeled
ddXTP
130
Determining the sequence
p Sequences are sorted
according to length by
capillary
electrophoresis
p Fluorescent signals
corresponding to
labels are registered
p Base calling:
identifying which base
corresponds to each
position in a read
n Non-trivial problem!
Output sequences from
base calling are called reads
131
Reads are short!
p Modern Sanger sequencers can produce
quality reads up to ~750 bases1
n Instruments provide you with a quality file for
bases in reads, in addition to actual sequence
data
p Compare the read length against the size
of the human genome (2.9x109 bases)
p Reads have to be assembled!
1 Nature Methods - 5, 16 - 18 (2008)
132
Problems with sequencing
p Sanger sequencing error rate per base
varies from 1% to 3%1
p Repeats in DNA
n For example, ~300 base Alu sequence
repeated is over million times in human
genome
n Repeats occur in different scales
p What happens if repeat length is longer
than read length?
n We will get back to this problem later
1 Jones, Pevzner (2004)
133
Shortest superstring problem
p Find the shortest string that ”explains” the
reads
p Given a set of strings (reads), find a
shortest string that contains all of them
134
Example: Shortest superstring
Set of strings: {000, 001, 010, 011, 100, 101, 110, 111}
Concetenation of strings: 000001010011100101110111
010
110
011
000
Shortest superstring: 0001110100
001
111
101
100
135
Shortest superstrings: issues
p NP-complete problem: unlike to have an
efficient (exact) algorithm
p Reads may be from either strand of DNA
p Is the shortest string necessarily the
correct assembly?
p What about errors in reads?
p Low coverage -> gaps in assembly
n Coverage: average number of times each base
occurs in the set of reads (e.g., 5x coverage)
136
Sequence assembly and combination
locks
p What is common with sequence assembly
and opening keypad locks?
137
Whole-genome shotgun sequence
p Whole-genome shotgun sequence
assembly starts with a large sample of
genomic DNA
1. Sample is randomly partitioned into inserts of
length > 500 bases
2. Inserts are multiplied by cloning them into a
vector which is used to infect bacteria
3. DNA is collected from bacteria and sequenced
4. Reads are assembled
138
Assembly of reads with Overlap-Layout-
Consensus algorithm
p Overlap
n Finding potentially overlapping reads
p Layout
n Finding the order of reads along DNA
p Consensus (Multiple alignment)
n Deriving the DNA sequence from the layout
p Next, the method is described at a very
abstract level, skipping a lot of details
Kececioglu, J.D. and E.W. Myers. 1995. Combinatorial algorithms for
DNA sequence assembly. Algorithmica 13: 7-51.
139
Finding overlaps
p First, pairwise overlap
alignment of reads is
resolved
p Reads can be from
either DNA strand:
The reverse
complement r* of
each read r has to be
considered
acggagtcc
agtccgcgctt
5’ 3’
3’ 5’
… a t g a g t g g a …
… t a c t c a c c t …
r1
r2
r1: tgagt, r1
*: actca
r2: tccac, r2
*: gtgga
140
Example sequence to assemble
p 20 reads:
5’ – CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCG
TGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’
# Read Read*
1 CATCGTCA TCACGATG
2 CGGTGAAG CTTCACCG
3 TATGCGCA TGCGCATA
4 GACGAGTC GACTCGTC
5 CTGACAAA TTTGTCAG
6 ATGCGCAT ATGCGCAT
7 ATGCGGTC GACCGCAT
8 CTGCGTGA TCACGCAG
9 GCGTGACG CGTCACGC
10 GTCGGTGA TCACCGAC
# Read Read*
11 GGTCGGTG CACCGACC
12 ATCGTGAT ATCACGAT
13 GCGCTGCG CGCAGCGC
14 GCATCGTG CACGATGC
15 AGCGCGCT AGCGCGCT
16 GAAGTTGT ACAACTTC
17 AGTGAAAC GTTTCACT
18 ACGCGATG CATCGCGT
19 GCGCATCG CGATGCGC
20 AAGTGAAA TTTCACTT
141
Finding overlaps
p Overlap between two reads
can be found with a
dynamic programming
algorithm
n Errors can be taken into
account
p Dynamic programming will
be discussed more on next
lecture
p Overlap scores stored into
the overlap matrix
n Entries (i, j) below the
diagonal denote overlap of
read ri and rj
*
1 CATCGTCA
6 ATGCGCAT
12 ATCGTGAT
Overlap(1, 6) = 3
Overlap(1, 12) = 7
1
6 12
3 7
142
Finding layout & consensus
p Method extends the
assembly greedily by
choosing the best
overlaps
p Both orientations are
considered
p Sequence is extended
as far as possible
7* GACCGCAT
6=6* ATGCGCAT
14 GCATCGTG
1 CATCGTGA
12 ATCGTGAT
19 GCGCATCG
13* CGCAGCGC
---------------------
CGCATCGTGAT
Ambiguous bases
Consensus sequence
143
Finding layout & consensus
p We move on to next best
overlaps and extend the
sequence from there
p The method stops when
there are no more overlaps
to consider
p A number of contigs is
produced
p Contig stands for
contiguous sequence,
resulting from merging
reads
2 CGGTGAAG
10 GTCGGTGA
11 GGTCGGTG
7 ATGCGGTC
---------------------
ATGCGGTCGGTGAAG
144
Whole-genome shotgun sequencing:
summary
p Ordering of the reads is initially unknown
p Overlaps resolved by aligning the reads
p In a 3x109 bp genome with 500 bp reads and 5x
coverage, there are ~107 reads and ~107(107-1)/2
= ~5x1013 pairwise sequence comparisons
… …
Original genome sequence
Reads
Non-overlapping
read
Overlapping reads
=> Contig
145
Repeats in DNA and genome assembly
Pop, Salzberg, Shumway (2002)
Two instances of the same repeat
146
Repeats in DNA cause problems in
sequence assembly
p Recap: if repeat length exceeds read
length, we might not get the correct
assembly
p This is a problem especially in eukaryotes
n ~3.1% of genome consists of repeats in
Drosophila, ~45% in human
p Possible solutions
1. Increase read length – feasible?
2. Divide genome into smaller parts, with known
order, and sequence parts individually
147
”Divide and conquer” sequencing
approaches: BAC-by-BAC
Whole-genome shotgun sequencing
Divide-and-conquer
Genome
Genome
BAC library
148
BAC-by-BAC sequencing
p Each BAC (Bacterial Artificial
Chromosome) is about 150 kbp
p Covering the human genome requires
~30000 BACs
p BACs shotgun-sequenced separately
n Number of repeats in each BAC is
significantly smaller than in the whole
genome...
n ...needs much more manual work compared
to whole-genome shotgun sequencing
149
Hybrid method
p Divide-and-conquer and whole-genome
shotgun approaches can be combined
n Obtain high coverage from whole-genome
shotgun sequencing for short contigs
n Generate of a set of BAC contigs with low
coverage
n Use BAC contigs to ”bin” short contigs to
correct places
p This approach was used to sequence the
brown Norway rat genome in 2004
150
Paired end sequencing
p Paired end (or mate-pair) sequencing is
technique where
n both ends of an insert are sequenced
n For each insert, we get two reads
n We know the distance between reads, and that
they are in opposite orientation
n Typically read length < insert length
k
Read 1 Read 2
151
Paired end sequencing
p The key idea of paired end sequencing:
n Both reads from an insert are unlikely to be in repeat
regions
n If we know where the first read is, we know also
second’s location
p This technique helps to WGSS higher organisms
k
Read 1 Read 2
Repeat region
152
First whole-genome shotgun sequencing
project: Drosophila melanogaster
p Fruit fly is a common
model organism in
biological studies
p Whole-genome
assembly reported in
Eugene Myers, et al.,
A Whole-Genome
Assembly of
Drosophila, Science
24, 2000
p Genome size 120 Mbp
http://en.wikipedia.org/wiki/Drosophila_melanogaster
153
Sequencing of the Human Genome
p The (draft) human
genome was published
in 2001
p Two efforts:
n Human Genome Project
(public consortium)
n Celera (private
company)
p HGP: BAC-by-BAC
approach
p Celera: whole-genome
shotgun sequencing
HGP: Nature 15 February 2001
Vol 409 Number 6822
Celera: Science 16 February 2001
Vol 291, Issue 5507
154
Genome assembly software
p phrap (Phil’s revised assembly program)
p AMOS (A Modular, Open-Source whole-
genome assembler)
p CAP3 / PCAP
p TIGR assembler
155
Next generation sequencing techniques
p Sanger sequencing is the prominent first-
generation sequencing method
p Many new sequencing methods are
emerging
p See Lars Paulin’s slides (course web page)
for details
156
Next-gen sequencing: 454
p Genome Sequencer FLX (454 Life Science
/ Roche)
n >100 Mb / 7.5 h run
n Read length 250-300 bp
n >99.5% accuracy / base in a single run
n >99.99% accuracy / base in consensus
157
Next-gen sequencing: Illumina Solexa
p Illumina / Solexa Genome Analyzer
n Read length 35 - 50 bp
n 1-2 Gb / 3-6 day run
n > 98.5% accuracy / base in a single run
n 99.99% accuracy / consensus with 3x
coverage
158
Next-gen sequencing: SOLiD
p SOLiD
n Read length 25-30 bp
n 1-2 Gb / 5-10 day run
n >99.94% accuracy / base
n >99.999% accuracy / consensus with 15x
coverage
159
Next-gen sequencing: Helicos
p Helicos: Single Molecule Sequencer
n No amplification of sequences needed
n Read length up to 55 bp
p Accuracy does not decrease when read length is
increased
p Instead, throughput goes down
n 25-90 Mb / h
n >2 Gb / day
160
Next-gen sequencing: Pacific
Biosciences
p Pacific Biosciences
n Single-Molecule Real-Time (SMRT) DNA
sequencing technology
n Read length “thousands of nucleotides”
p Should overcome most problems with repeats
n Throughput estimate: 100 Gb / hour
n First instruments in 2010?
161
Exercise groups
p 1st group: Tuesdays 16.15-18.00 C221
p 2nd group: Wednesday 14.15-16.00 B120
p You can choose freely which group you
want to attend to
p Send exercise notes before the 1st group
starts (Tue 16.15), even if you go to the
2nd group
Introduction to
Bioinformatics
Lecture 3:
Sequence alignment
163
Sequence alignment
p The biological problem
p Global alignment
p Local alignment
p Multiple alignment
164
Background: comparative genomics
p Basic question in biology: what properties
are shared among organisms?
p Genome sequencing allows comparison of
organisms at DNA and protein levels
p Comparisons can be used to
n Find evolutionary relationships between
organisms
n Identify functionally conserved sequences
n Identify corresponding genes in human and
model organisms: develop models for human
diseases
165
Homologs
p Two genes (sequences in
general) gB and gC
evolved from the same
ancestor gene gA are
called homologs
p Homologs usually exhibit
conserved functions
p Close evolutionary
relationship => expect a
high number of homologs
gB = agtgccgttaaagttgtacgtc
gC = ctgactgtttgtggttc
gA = agtgtccgttaagtgcgttc
166
p We expect homologs to be ”similar” to each other
p Intuitively, similarity of two sequences refers to
the degree of match between corresponding
positions in sequence
p What about sequences that differ in length?
Sequence similarity
agtgccgttaaagttgtacgtc
ctgactgtttgtggttc
167
Similarity vs homology
p Sequence similarity is not sequence
homology
n If the two sequences gB and gC have accumulated
enough mutations, the similarity between them is likely
to be low
Homology is more difficult to detect over greater
evolutionary distances.
0 agtgtccgttaagtgcgttc
1 agtgtccgttatagtgcgttc
2 agtgtccgcttatagtgcgttc
4 agtgtccgcttaagggcgttc
8 agtgtccgcttcaaggggcgt
16 gggccgttcatgggggt
32 gcagggcgtcactgagggct
64 acagtccgttcgggctattg
128 cagagcactaccgc
256 cacgagtaagatatagct
512 taatcgtgata
1024 acccttatctacttcctggagtt
2048 agcgacctgcccaa
4096 caaac
#mutations #mutations
168
Similarity vs homology (2)
p Sequence similarity can occur by chance
n Similarity does not imply homology
p Consider comparing two short sequences
against each other
169
Orthologs and paralogs
p We distinguish between two types of homology
n Orthologs: homologs from two different species,
separated by a speciation event
n Paralogs: homologs within a species, separated by a
gene duplication event
gA
gB gC
Organism B Organism C
gA
gA gA’
gB gC
Organism A
Gene duplication event
Orthologs
Paralogs
170
Orthologs and paralogs (2)
p Orthologs typically retain the original function
p In paralogs, one copy is free to mutate and
acquire new function (no selective pressure)
gA
gB gC
Organism B Organism C
gA
gA gA’
gB gC
Organism A
171
Paralogy example: hemoglobin
p Hemoglobin is a protein
complex which transports
oxygen
p In humans, hemoglobin
consists of four protein
subunits and four non-
protein heme groups
Hemoglobin A,
www.rcsb.org/pdb/explore.do?structureId=1GZX
Sickle cell diseases
are caused by mutations
in hemoglobin genes
http://en.wikipedia.org/wiki/Image:Sicklecells.jpg
172
Paralogy example: hemoglobin
p In adults, three types are
normally present
n Hemoglobin A: 2 alpha and
2 beta subunits
n Hemoglobin A2: 2 alpha
and 2 delta subunits
n Hemoglobin F: 2 alpha and
2 gamma subunits
p Each type of subunit
(alpha, beta, gamma,
delta) is encoded by a
separate gene
Hemoglobin A,
www.rcsb.org/pdb/explore.do?structureId=1GZX
173
Paralogy example: hemoglobin
p The subunit genes are
paralogs of each other, i.e.,
they have a common ancestor
gene
p Exercise: hemoglobin human
paralogs in NCBI sequence
databases
http://www.ncbi.nlm.nih.gov/sites/entre
z?db=Nucleotide
n Find human hemoglobin alpha, beta,
gamma and delta
n Compare sequences
Hemoglobin A,
www.rcsb.org/pdb/explore.do?structureId=1GZX
174
Orthology example: insulin
p The genes coding for insulin in human
(Homo sapiens) and mouse (Mus
musculus) are orthologs:
n They have a common ancestor gene in the
ancestor species of human and mouse
n Exercise: find insulin orthologs from human
and mouse in NCBI sequence databases
175
Sequence alignment
p Alignment specifies which positions in two
sequences match
acgtctag
|||||
-actctag
5 matches
2 mismatches
1 not aligned
acgtctag
||
actctag-
2 matches
5 mismatches
1 not aligned
acgtctag
|| |||||
ac-tctag
7 matches
0 mismatches
1 not aligned
176
Sequence alignment
p Maximum alignment length is the total length of
the two sequences
acgtctag-------
--------actctag
0 matches
0 mismatches
15 not aligned
-------acgtctag
actctag--------
0 matches
0 mismatches
15 not aligned
177
Mutations: Insertions, deletions and
substitutions
p Insertions and/or deletions are called
indels
n We can’t tell whether the ancestor sequence
had a base or not at indel position!
acgtctag
|||||
-actctag
Indel: insertion or
deletion of a base
with respect to the
ancestor sequence
Mismatch: substitution
(point mutation) of
a single base
178
Problems
p What sorts of alignments should be considered?
p How to score alignments?
p How to find optimal or good scoring alignments?
p How to evaluate the statistical significance of
scores?
In this course, we discuss each of these problems
briefly.
179
Sequence Alignment (chapter 6)
p The biological problem
p Global alignment
p Local alignment
p Multiple alignment
180
Global alignment
p Problem: find optimal scoring alignment between
two sequences (Needleman & Wunsch 1970)
p Every position in both sequences is included in
the alignment
p We give score for each position in alignment
n Identity (match) +1
n Substitution (mismatch) -µ
n Indel -
p Total score: sum of position scores
181
Scoring: Toy example
p Consider two sequences
with characters drawn
from the English
language alphabet:
WHAT, WHY
WHAT
||
WH-Y
S(WHAT/WH-Y) = 1 + 1 – – µ
WHAT
-WHY
S(WHAT/-WHY) = – – µ – µ – µ
182
Dynamic programming
p How to find the optimal alignment?
p We use previous solutions for optimal
alignments of smaller subsequences
p This general approach is known as
dynamic programming
183
Introduction to dynamic programming:
the money change problem
p Suppose you buy a pen for 4.23€ and pay for
with a 5€ note
p You get 77 cents in change – what coins is the
cashier going to give you if he or she tries to
minimise the number of coins?
p The usual algorithm: start with largest coin
(denominator), proceed to smaller coins until no
change is left:
n 50, 20, 5 and 2 cents
p This greedy algorithm is incorrect, in the sense
that it does not always give you the correct
answer
184
The money change problem
p How else to compute the
change?
p We could consider all possible
ways to reduce the amount of
change
p Suppose we have 77 cents
change, and the following
coins: 50, 20, 5 cents
p We can compute the change
with recursion
n C(n) = min { C(n – 50) + 1,
C(n – 20) + 1, C(n – 5) + 1 }
p Figure shows the recursion
tree for the example
77
72
27 57
7 22 7 37 52 22 52 67
…
50 20 5
p Many values are computed more
than once!
p This leads to a correct but very
inefficient algorithm
185
The money change problem
p We can speed the computation up by
solving the change problem for all i n
n Example: solve the problem for 9 cents with
available coins being 1, 2 and 5 cents
p Solve the problem in steps, first for 1
cent, then 2 cents, and so on
p In each step, utilise the solutions from the
previous steps
186
The money change problem
0 1 2 3 4 5 6 7 8 9
Amount of
change left
p Algorithm runs in time proportional to Md, where M is the
amount of change and d is the number of coin types
p The same technique of storing solutions of subproblems can
be utilised in aligning sequences
Number of
coins used
0 1 1 2 2 1 2 2 3 3
187
Representing alignments and scores
Y
H
W
-
T
A
H
W
-
WHAT
||
WH-Y
Alignments can be
represented in the
following tabular form.
Each alignment
corresponds to a path
through the table.
188
Representing alignments and scores
Y
H
W
-
T
A
H
W
-
WHAT---
----WHY
WH-AT
||
WHY--
189
Y
H
W
-
T
A
H
W
-
WHAT
||
WH-Y
Global alignment
score S3,4 = 2- -µ
2- -µ
2-
2
1
0
Representing alignments and scores
190
Filling the alignment matrix
Y
H
W
-
T
A
H
W
-
Case 1
Case 2
Case 3
Consider the alignment process
at shaded square.
Case 1. Align H against H
(match)
Case 2. Align H in WHY against
– (indel) in WHAT
Case 3. Align H in WHAT
against – (indel) in WHY
191
Filling the alignment matrix (2)
Y
H
W
-
T
A
H
W
-
Case 1
Case 2
Case 3
Scoring the alternatives.
Case 1. S2,2 = S1,1 + s(2, 2)
Case 2. S2,2 = S1,2
Case 3. S2,2 = S2,1
s(i, j) = 1 for matching positions,
s(i, j) = - µ for substitutions.
Choose the case (path) that
yields the maximum score.
Keep track of path choices.
192
Global alignment: formal
development
A = a1a2a3…an,
B = b1b2b3…bm
a3
a2
a1
-
b4
b3
b2
b1
-
3
2
1
0
4
3
2
1
0
b1 b2 b3 b4 -
- -
a1 a2 a3
l Any alignment can be written
as a unique path through the
matrix
l Score for aligning A and B up
to positions i and j:
Si,j = S(a1a2a3…ai, b1b2b3…bj)
193
Scoring partial alignments
p Alignment of A = a1a2a3…ai with B = b1b2b3…bj
can be end in three possible ways
n Case 1: (a1a2…ai-1) ai
(b1b2…bj-1) bj
n Case 2: (a1a2…ai-1) ai
(b1b2…bj) -
n Case 3: (a1a2…ai) –
(b1b2…bj-1) bj
194
Scoring alignments
p Scores for each case:
n Case 1: (a1a2…ai-1) ai
(b1b2…bj-1) bj
n Case 2: (a1a2…ai-1) ai
(b1b2…bj) –
n Case 3: (a1a2…ai) –
(b1b2…bj-1) bj
s(ai, bj) = { -µ otherwise
+1 if ai = bj
s(ai, -) = s(-, bj) = -
195
Scoring alignments (2)
p First row and first column
correspond to initial alignment
against indels:
S(i, 0) = -i
S(0, j) = -j
p Optimal global alignment score
S(A, B) = Sn,m
a3
a2
a1
-
b4
b3
b2
b1
-
-3
3
-2
2
-
1
-4
-3
-2
-
0
0
4
3
2
1
0
196
Algorithm for global alignment
Input sequences A, B, n = |A|, m = |B|
Set Si,0 := - i for all i
Set S0,j := - j for all j
for i := 1 to n
for j := 1 to m
Si,j := max{Si-1,j – , Si-1,j-1 + s(ai,bj), Si,j-1 – }
end
end
Algorithm takes O(nm) time
197
Global alignment: example
?
-10
T
-8
G
-6
C
-4
T
-2
A
-10
-8
-6
-4
-2
0
-
G
T
G
G
T
-
µ = 1
= 2
198
Global alignment: example
?
-10
T
-8
G
-6
C
-4
T
-2
A
-10
-8
-6
-4
-2
0
-
G
T
G
G
T
-
µ = 1
= 2 -1 -3
199
Global alignment: example (2)
-2
0
-3
-4
-7
-10
T
-4
-3
-1
-2
-5
-8
G
-5
-5
-3
-2
-3
-6
C
-6
-4
-4
-2
-1
-4
T
-9
-7
-5
-3
-1
-2
A
-10
-8
-6
-4
-2
0
-
G
T
G
G
T
-
µ = 1
= 2
ATCGT-
| ||
-TGGTG
200
Sequence Alignment (chapter 6)
p The biological problem
p Global alignment
p Local alignment
p Multiple alignment
201
Local alignment: rationale
p Otherwise dissimilar proteins may have local regions of
similarity
-> Proteins may share a function
Human bone
morphogenic protein
receptor type II
precursor (left) has a
300 aa region that
resembles 291 aa
region in TGF-
receptor (right).
The shared function
here is protein kinase.
202
Local alignment: rationale
p Global alignment would be inadequate
p Problem: find the highest scoring local alignment
between two sequences
p Previous algorithm with minor modifications solves this
problem (Smith & Waterman 1981)
A
B
Regions of
similarity
203
From global to local alignment
p Modifications to the global alignment
algorithm
n Look for the highest-scoring path in the
alignment matrix (not necessarily through the
matrix), or in other words:
n Allow preceding and trailing indels without
penalty
204
Scoring local alignments
A = a1a2a3…an, B = b1b2b3…bm
Let I and J be intervals (substrings) of A and B, respectively:
Best local alignment score:
where S(I, J) is the alignment score for substrings I and J.
205
Allowing preceding and trailing
indels
p First row and column
initialised to zero:
Mi,0 = M0,j = 0
a3
a2
a1
-
b4
b3
b2
b1
-
0
3
0
2
0
1
0
0
0
0
0
0
4
3
2
1
0
b1 b2 b3
- - a1
206
Recursion for local alignment
p Mi,j = max {
Mi-1,j-1 + s(ai, bi),
Mi-1,j – ,
Mi,j-1 – ,
0
}
0
2
0
0
1
0
T
1
0
1
1
0
0
G
0
0
0
0
0
0
C
0
1
0
0
1
0
T
0
0
0
0
0
0
A
0
0
0
0
0
0
-
G
T
G
G
T
-
Allow alignment to
start anywhere in
sequences
207
Finding best local alignment
p Optimal score is the highest
value in the matrix
= maxi,j Mi,j
p Best local alignment can be
found by backtracking from the
highest value in M
p What is the best local
alignment in this example?
0
2
0
0
1
0
T
1
0
1
1
0
0
G
0
0
0
0
0
0
C
0
1
0
0
1
0
T
0
0
0
0
0
0
A
0
0
0
0
0
0
-
G
T
G
G
T
-
208
Local alignment: example
0
G
8
0
G
7
0
A
6
0
A
5
0
T
4
0
C
3
0
C
2
0
A
1
0
0
0
0
0
0
0
0
0
0
0
-
0
A
C
T
A
A
C
T
C
G
G
-
10
9
8
7
6
5
4
3
2
1
0
Mi,j = max {
Mi-1,j-1 + s(ai,
bi),
Mi-1,j ,
Mi,j-1 ,
0
}
0
Scoring (for example)
Match: +2
Mismatch: -1
Indel: -2
209
Local alignment: example
0
G
8
0
G
7
0
A
6
0
A
5
0
T
4
0
C
3
0
C
2
0
A
1
0
0
0
0
0
0
0
0
0
0
0
-
0
A
C
T
A
A
C
T
C
G
G
-
10
9
8
7
6
5
4
3
2
1
0
Mi,j = max {
Mi-1,j-1 + s(ai,
bi),
Mi-1,j ,
Mi,j-1 ,
0
}
0 0 0 0 0 2
Scoring (for example)
Match: +2
Mismatch: -1
Indel: -2
210
C T – A A
C T C A A
Local alignment: example
Scoring (for example)
Match: +2
Mismatch: -1
Indel: -2
Optimal local
alignment:
2
4
3
2
1
0
0
2
4
2
0
G
8
1
3
5
4
3
0
0
0
2
2
0
G
7
3
2
4
6
5
1
0
0
0
0
0
A
6
3
1
1
3
4
3
2
0
0
0
0
A
5
2
1
2
0
1
2
4
0
0
0
0
T
4
1
3
0
0
1
2
1
2
0
0
0
C
3
0
2
1
1
0
2
0
2
0
0
0
C
2
2
0
0
2
2
0
0
0
0
0
0
A
1
0
0
0
0
0
0
0
0
0
0
0
-
0
A
C
T
A
A
C
T
C
G
G
-
10
9
8
7
6
5
4
3
2
1
0
211
Multiple optimal alignments
Non-optimal, good-scoring alignments
2
4
3
2
1
0
0
2
4
2
0
G
8
1
3
5
4
3
0
0
0
2
2
0
G
7
3
2
4
6
5
1
0
0
0
0
0
A
6
3
1
1
3
4
3
2
0
0
0
0
A
5
2
1
2
0
1
2
4
0
0
0
0
T
4
1
3
0
0
1
2
1
2
0
0
0
C
3
0
2
1
1
0
2
0
2
0
0
0
C
2
2
0
0
2
2
0
0
0
0
0
0
A
1
0
0
0
0
0
0
0
0
0
0
0
-
0
A
C
T
A
A
C
T
C
G
G
-
10
9
8
7
6
5
4
3
2
1
0
How can you find
1. Optimal
alignments if
more than one
exist?
2. Non-optimal,
good-scoring
alignments?
212
Overlap alignment
p Overlap matrix used by Overlap-Layout-
Consensus algorithm can be computed with
dynamic programming
p Initialization: Oi,0 = O0,j = 0 for all i, j
p Recursion:
Oi,j = max {
Oi-1,j-1 + s(ai, bi),
Oi-1,j – ,
Oi,j-1 – ,
}
Best overlap: maximum value from rightmost
column and bottom row
213
Non-uniform mismatch penalties
p We used uniform penalty for mismatches:
s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ
p Transition mutations (A->G, G->A, C->T, T->C) are
approximately twice as frequent than transversions (A->T,
T->A, A->C, G->T)
n use non-uniform mismatch
penalties collected into a
substitution matrix
1
-1
-0.5
-1
T
-1
1
-1
-0.5
G
-0.5
-1
1
-1
C
-1
-0.5
-1
1
A
T
G
C
A
214
Gaps in alignment
p Gap is a succession of indels in alignment
p Previous model scored a length k gap as
w(k) = -k
p Replication processes may produce longer
stretches of insertions or deletions
n In coding regions, insertions or deletions of
codons may preserve functionality
C T – - - A A
C T C G C A A
215
Gap open and extension penalties (2)
p We can design a score that allows the
penalty opening gap to be larger than
extending the gap:
w(k) = - – (k – 1)
p Gap open cost , Gap extension cost
p Alignment algorithms can be extended to
use w(k) (not discussed on this course)
216
Amino acid sequences
p We have discussed mainly DNA sequences
p Amino acid sequences can be aligned as
well
p However, the design of the substitution
matrix is more involved because of the
larger alphabet
p More on the topic in the course Biological
sequence analysis
217
Demonstration of the EBI web site
p European Bioinformatics Institute (EBI)
offers many biological databases and
bioinformatics tools at
http://www.ebi.ac.uk/
n Sequence alignment: Tools -> Sequence
Analysis -> Align
218
Sequence Alignment (chapter 6)
p The biological problem
p Global alignment
p Local alignment
p Multiple alignment
219
Multiple alignment
p Consider a set of n sequences
on the right
n Orthologous sequences from
different organisms
n Paralogs from multiple
duplications
p How can we study
relationships between these
sequences?
aggcgagctgcgagtgcta
cgttagattgacgctgac
ttccggctgcgac
gacacggcgaacgga
agtgtgcccgacgagcgaggac
gcgggctgtgagcgcta
aagcggcctgtgtgcccta
atgctgctgccagtgta
agtcgagccccgagtgc
agtccgagtcc
actcggtgc
220
Optimal alignment of three
sequences
p Alignment of A = a1a2…ai and B = b1b2…bj can
end either in (-, bj), (ai, bj) or (ai, -)
p 22 – 1 = 3 alternatives
p Alignment of A, B and C = c1c2…ck can end in 23 –
1 ways: (ai, -, -), (-, bj, -), (-, -, ck), (-, bj, ck),
(ai, -, ck), (ai, bj, -) or (ai, bj, ck)
p Solve the recursion using three-dimensional
dynamic programming matrix: O(n3) time and
space
p Generalizes to n sequences but impractical with
even a moderate number of sequences
221
Multiple alignment in practice
p In practice, real-world multiple alignment
problems are usually solved with heuristics
p Progressive multiple alignment
n Choose two sequences and align them
n Choose third sequence w.r.t. two previous sequences
and align the third against them
n Repeat until all sequences have been aligned
n Different options how to choose sequences and score
alignments
n Note the similarity to Overlap-Layout-Consensus
222
Multiple alignment in practice
p Profile-based progressive multiple
alignment: CLUSTALW
n Construct a distance matrix of all pairs of
sequences using dynamic programming
n Progressively align pairs in order of decreasing
similarity
n CLUSTALW uses various heuristics to
contribute to accuracy
223
Additional material
p R. Durbin, S. Eddy, A. Krogh, G.
Mitchison: Biological sequence analysis
p N. C. Jones, P. A. Pevzner: An introduction
to bioinformatics algorithms
p Course Biological sequence analysis in
period II, 2008
224
Rapid alignment methods: FASTA and
BLAST
p The biological problem
p Search strategies
p FASTA
p BLAST
225
The biological problem
p Global and local
alignment algoritms are
slow in practice
p Consider the scenario of
aligning a query
sequence against a large
database of sequences
n New sequence with
unknown function n NCBI GenBank size in January
2007 was 65 369 091 950
bases (61 132 599 sequences)
n Feb 2008: 85 759 586 764
bases (82 853 685 sequences)
226
Problem with large amount of sequences
p Exponential growth in both number and
total length of sequences
p Possible solution: Compare against model
organisms only
p With large amount of sequences, chances
are that matches occur by random
n Need for statistical analysis
227
Rapid alignment methods: FASTA and
BLAST
p The biological problem
p Search strategies
p FASTA
p BLAST
228
FASTA
p FASTA is a multistep algorithm for sequence
alignment (Wilbur and Lipman, 1983)
p The sequence file format used by the FASTA
software is widely used by other sequence
analysis software
p Main idea:
n Choose regions of the two sequences (query and
database) that look promising (have some degree of
similarity)
n Compute local alignment using dynamic programming in
these regions
229
FASTA outline
p FASTA algorithm has five steps:
n 1. Identify common k-words between I and J
n 2. Score diagonals with k-word matches,
identify 10 best diagonals
n 3. Rescore initial regions with a substitution
score matrix
n 4. Join initial regions using gaps, penalise for
gaps
n 5. Perform dynamic programming to find final
alignments
230
Search strategies
p How to speed up the computation?
n Find ways to limit the number of pairwise
comparisons
p Compare the sequences at word level to
find out common words
n Word means here a k-tuple (or a k-word), a
substring of length k
231
Analyzing the word content
p Example query string I: TGATGATGAAGACATCAG
p For k = 8, the set of k-words (substring of length
k) of I is
TGATGATG
GATGATGA
ATGATGAA
TGATGAAG
…
GACATCAG
232
Analyzing the word content
p There are n-k+1 k-words in a string of length n
p If at least one word of I is not found from
another string J, we know that I differs from J
p Need to consider statistical significance: I and J
might share words by chance only
p Let n=|I| and m=|J|
233
Word lists and comparison by content
p The k-words of I can be arranged into a table of
word occurences Lw(I)
p Consider the k-words when k=2 and
I=GCATCGGC:
GC, CA, AT, TC, CG, GG, GC
AT: 3
CA: 2
CG: 5
GC: 1, 7
GG: 6
TC: 4
Start indecies of k-word GC in I
Building Lw(I) takes O(n) time
234
Common k-words
p Number of common k-words in I and J can
be computed using Lw(I) and Lw(J)
p For each word w in I, there are |Lw(J)|
occurences in J
p Therefore I and J have
common words
p This can be computed in O(n + m + 4k)
time
n O(n + m) time to build the lists
n O(4k) time to calculate the sum (in DNA
strings)
235
Common k-words
p I = GCATCGGC
p J = CCATCGCCATCG
Lw(J)
AT: 3, 9
CA: 2, 8
CC: 1, 7
CG: 5, 11
GC: 6
TC: 4, 10
Lw(I)
AT: 3
CA: 2
CG: 5
GC: 1, 7
GG: 6
TC: 4
Common words
2
2
0
2
2
0
2
10 in total
236
Properties of the common word list
p Exact matches can be found using binary search
(e.g., where TCGT occurs in I?)
n O(log 4k) time
p For large k, the table size is too large to compute
the common word count in the previous fashion
p Instead, an approach based on merge sort can be
utilised (details skipped)
p The common k-word technique can be combined
with the local alignment algorithm to yield a rapid
alignment approach
237
FASTA outline
p FASTA algorithm has five steps:
n 1. Identify common k-words between I and J
n 2. Score diagonals with k-word matches,
identify 10 best diagonals
n 3. Rescore initial regions with a substitution
score matrix
n 4. Join initial regions using gaps, penalise for
gaps
n 5. Perform dynamic programming to find final
alignments
238
Dot matrix comparisons
p Word matches in two sequences I and J can be
represented as a dot matrix
p Dot matrix element (i, j) has ”a dot”, if the word
starting at position i in I is identical to the word
starting at position j in J
p The dot matrix can be plotted for various k
i
j
I = … ATCGGATCA …
J = … TGGTGTCGC …
i
j
239
k=1 k=4
k=8 k=16
Dot matrix (k=1,4,8,16)
for two DNA sequences
X85973.1 (1875 bp)
Y11931.1 (2013 bp)
240
k=1 k=4
k=8 k=16
Dot matrix
(k=1,4,8,16) for two
protein sequences
CAB51201.1 (531 aa)
CAA72681.1 (588 aa)
Shading indicates
now the match score
according to a
score matrix
(Blosum62 here)
241
Computing diagonal sums
p We would like to find high scoring diagonals of the dot
matrix
p Lets index diagonals by the offset, l = i - j
C C A T C G C C A T C G
G *
C * *
A * *
T * *
C * *
G
G *
C
k=2
I
J
Diagonal l = i – j = -6
242
Computing diagonal sums
p As an example, lets compute diagonal sums for
I = GCATCGGC, J = CCATCGCCATCG, k = 2
p 1. Construct k-word list Lw(J)
p 2. Diagonal sums Sl are computed into a table,
indexed with the offset and initialised to zero
l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Sl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
243
Computing diagonal sums
p 3. Go through k-words of I, look for matches in
Lw(J) and update diagonal sums
C C A T C G C C A T C G
G *
C * *
A * *
T * *
C * *
G
G *
C
I
J
For the first 2-word in I,
GC, LGC(J) = {6}.
We can then update
the sum of diagonal
l = i – j = 1 – 6 = -5 to
S-5 := S-5 + 1 = 0 + 1 = 1
244
Computing diagonal sums
p 3. Go through k-words of I, look for matches in
Lw(J) and update diagonal sums
C C A T C G C C A T C G
G *
C * *
A * *
T * *
C * *
G
G *
C
I
J
Next 2-word in I is CA,
for which LCA(J) = {2, 8}.
Two diagonal sums are
updated:
l = i – j = 2 – 2 = 0
S0 := S0 + 1 = 0 + 1 = 1
I = i – j = 2 – 8 = -6
S-6 := S-6 + 1 = 0 + 1 = 1
245
Computing diagonal sums
p 3. Go through k-words of I, look for matches in
Lw(J) and update diagonal sums
C C A T C G C C A T C G
G *
C * *
A * *
T * *
C * *
G
G *
C
I
J
Next 2-word in I is AT,
for which LAT(J) = {3, 9}.
Two diagonal sums are
updated:
l = i – j = 3 – 3 = 0
S0 := S0 + 1 = 1 + 1 = 2
I = i – j = 3 – 9 = -6
S-6 := S-6 + 1 = 1 + 1 = 2
246
Computing diagonal sums
After going through the k-words of I, the result is:
l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
Sl 0 0 0 0 4 1 0 0 0 0 4 1 0 0 0 0 0
C C A T C G C C A T C G
G *
C * *
A * *
T * *
C * *
G
G *
C
I
J
247
Algorithm for computing diagonal sum of scores
Sl := 0 for all 1 – m l n – 1
Compute Lw(J) for all words w
for i := 1 to n – k – 1 do
w := IiIi+1…Ii+k-1
for j Lw(J) do
l := i – j
Sl := Sl + 1
end
end
Match score is here 1
248
FASTA outline
p FASTA algorithm has five steps:
n 1. Identify common k-words between I and J
n 2. Score diagonals with k-word matches,
identify 10 best diagonals
n 3. Rescore initial regions with a substitution
score matrix
n 4. Join initial regions using gaps, penalise for
gaps
n 5. Perform dynamic programming to find final
alignments
249
Rescoring initial regions
p Each high-scoring diagonal chosen in the
previous step is rescored according to a score
matrix
p This is done to find subregions with identities
shorter than k
p Non-matching ends of the diagonal are trimmed
I: C C A T C G C C A T C G
J: C C A A C G C A A T C A
I’: C C A T C G C C A T C G
J’: A C A T C A A A T A A A
75% identity, no 4-word identities
33% identity, one 4-word identity
250
Joining diagonals
p Two offset diagonals can be joined with a gap, if
the resulting alignment has a higher score
p Separate gap open and extension are used
p Find the best-scoring combination of diagonals
High-scoring
diagonals
Two diagonals
joined by a gap
251
FASTA outline
p FASTA algorithm has five steps:
n 1. Identify common k-words between I and J
n 2. Score diagonals with k-word matches,
identify 10 best diagonals
n 3. Rescore initial regions with a substitution
score matrix
n 4. Join initial regions using gaps, penalise for
gaps
n 5. Perform dynamic programming to find final
alignments
252
Local alignment in the highest-scoring
region
p Last step of FASTA: perform local
alignment using dynamic
programming around the highest-
scoring
p Region to be aligned covers –w and
+w offset diagonal to the highest-
scoring diagonals
p With long sequences, this region is
typically very small compared to the
whole n x m matrix w
w
Dynamic programming matrix
M filled only for the green region
253
Properties of FASTA
p Fast compared to local alignment using dynamic
programming only
n Only a narrow region of the full matrix is aligned
p Increasing parameter k decreases the number of
hits:
n Increases specificity
n Decreases sensitivity
n Decreases running time
p FASTA can be very specific when identifying long
regions of low similarity
n Specific method does not find many incorrect results
n Sensitive method finds many of the correct results
254
Properties of FASTA
p FASTA looks for initial exact matches to
query sequence
n Two proteins can have very different amino
acid sequences and still be biologically similar
n This may lead into a lack of sensitivity with
diverged sequences
255
Demonstration of FASTA at EBI
p http://www.ebi.ac.uk/fasta/
p Note that parameter ktup in the software
corresponds to parameter k in lectures
256
Note on sequences and alignment
matrices in exercises
p Example solutions to alignment problems
will have sequences arranged like this:
n Perform global alignment of the sequences
p s = AGCTGCGTACT
p t = ATGAGCGTTA
AGCTGCGTACT
ATGAGCGTTA
So if you want to be able to
compare your solution easily
against the example, use this
convention.
257
Rapid alignment methods: FASTA and
BLAST
p The biological problem
p Search strategies
p FASTA
p BLAST
258
BLAST: Basic Local Alignment Search
Tool
p BLAST (Altschul et al., 1990) and its variants are
some of the most common sequence search tools
in use
p Roughly, the basic BLAST has three parts:
n 1. Find segment pairs between the query sequence and
a database sequence above score threshold (”seed hits”)
n 2. Extend seed hits into locally maximal segment pairs
n 3. Calculate p-values and a rank ordering of the local
alignments
p Gapped BLAST introduced in 1997 allows for gaps
in alignments
259
Finding seed hits
p First, we generate a set of neighborhood
sequences for given k, match score matrix and
threshold T
p Neighborhood sequences of a k-word w include
all strings of length k that, when aligned against
w, have the alignment score at least T
p For instance, let I = GCATCGGC, J =
CCATCGCCATCG and k = 5, match score be 1,
mismatch score be 0 and T = 4
260
Finding seed hits
p I = GCATCGGC, J = CCATCGCCATCG, k = 5,
match score 1, mismatch score 0, T = 4
p This allows for one mismatch in each k-word
p The neighborhood of the first k-word of I, GCATC,
is GCATC and the 15 sequences
A A C A A
CCATC,G GATC,GC GTC,GCA CC,GCAT G
T T T G T
261
Finding seed hits
p I = GCATCGGC has 4 k-words and thus 4x16 =
64 5-word patterns to locate in J
n Occurences of patterns in J are called seed hits
p Patterns can be found using exact search in time
proportional to the sum of pattern lengths +
length of J + number of matches (Aho-Corasick
algorithm)
n Methods for pattern matching are developed on course
58093 String processing algorithms
p Compare this approach to FASTA
262
Extending seed hits: original BLAST
p Initial seed hits are extended into
locally maximal segment pairs
or High-scoring Segment Pairs
(HSP)
p Extensions do not add gaps to the
alignment
p Sequence is extended until the
alignment score drops below the
maximum attained score minus a
threshold parameter value
p All statistically significant HSPs
reported
AACCGTTCATTA
| || || ||
TAGCGATCTTTT
Initial seed hit
Extension
Altschul, S.F., Gish, W., Miller, W., Myers, E. W. and
Lipman, D. J., J. Mol. Biol., 215, 403-410, 1990
263
Extending seed hits: gapped BLAST
p In a later version of BLAST, two
seed hits have to be found on the
same diagonal
n Hits have to be non-overlapping
n If the hits are closer than A
(additional parameter), then they
are joined into a HSP
p Threshold value T is lowered to
achieve comparable sensitivity
p If the resulting HSP achieves a
score at least Sg, a gapped
extension is triggered
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and
Lipman DJ, Nucleic Acids Res. 1;25(17), 3389-402, 1997
264
Gapped extensions of HSPs
p Local alignment is performed
starting from the HSP
p Dynamic programming matrix
filled in ”forward” and
”backward” directions (see
figure)
p Skip cells where value would
be Xg below the best
alignment score found so far
Region potentially searched
by the alignment algorithm
HSP
Region searched with score
above cutoff parameter
265
Estimating the significance of results
p In general, we have a score S(D, X) = s for a
sequence X found in database D
p BLAST rank-orders the sequences found by p-
values
p The p-value for this hit is P(S(D, Y) s) where Y
is a random sequence
n Measures the amount of ”surprise” of finding sequence X
p A smaller p-value indicates more significant hit
n A p-value of 0.1 means that one-tenth of random
sequences would have as large score as our result
266
Estimating the significance of results
p In BLAST, p-values are computed roughly as
follows
p There are nm places to begin an optimal
alignment in the n x m alignment matrix
p Optimal alignment is preceded by a mismatch
and has t matching (identical) letters
n (Assume match score 1 and mismatch/indel score - )
p Let p = P(two random letters are equal)
p The probability of having a mismatch and then t
matches is (1-p)pt
267
Estimating the significance of results
p We model this event by a Poisson distribution
(why?) with mean = nm(1-p)pt
p P(there is local alignment t or longer)
1 – P(no such event)
= 1 – e- = 1 – exp(-nm(1-p)pt)
p An equation of the same form is used in Blast:
p E-value = P(S(D, Y) s) 1 – exp(-nm t) where
> 0 and 0 < < 1
p Parameters and are estimated from data
268
Scoring amino acid
alignments
p We need a way to compute the
score S(D, X) for aligning the
sequence X against database D
p Scoring DNA alignments was
discussed previously
p Constructing a scoring model for
amino acids is more challenging
n 20 different amino acids vs. 4
bases
p Figure shows the molecular
structures of the 20 amino acids
http://en.wikipedia.org/wiki/List_of_standard_amino_acids
269
Scoring amino acid
alignments
p Substitutions between chemically
similar amino acids are more
frequent than between dissimilar
amino acids
p We can check our scoring model
against this
http://en.wikipedia.org/wiki/List_of_standard_amino_acids
270
Score matrices
p Scores s = S(D, X) are obtained from score
matrices
p Let A = A1a2…an and B = b1b2…bn be sequences
of equal length (no gaps allowed to simplify
things)
p To obtain a score for alignment of A and B, where
ai is aligned against bi, we take the ratio of two
probabilities
n The probability of having A and B where the characters
match (match model M)
n The probability that A and B were chosen randomly
(random model R)
271
Score matrices: random model
p Under the random model, the probability
of having X and Y is
where qxi is the probability of occurence of
amino acid type xi
p Position where an amino acid occurs does
not affect its type
272
Score matrices: match model
p Let pab be the probability of having amino acids
of type a and b aligned against each other given
they have evolved from the same ancestor c
p The probability is
273
Score matrices: log-odds ratio score
p We obtain the score S by taking the ratio
of these two probabilities
and taking a logarithm of the ratio
274
Score matrices: log-odds ratio score
p The score S is obtained by summing over
character pair-specific scores:
p The probabilities qa and pab are extracted
from data
275
Calculating score matrices for amino
acids
p Probabilities qa are in
principle easy to obtain:
n Count relative frequencies of
every amino acid in a sequence
database
276
p To calculate pab we can use a
known pool of aligned sequences
p BLOCKS is a database of highly
conserved regions for proteins
p It lists multiply aligned, ungapped
and conserved protein segments
p Example from BLOCKS shows
genes related to human gene
associated with DNA-repair
defect xeroderma pigmentosum
Calculating score matrices for amino
acids
Block PR00851A
ID XRODRMPGMNTB; BLOCK
AC PR00851A; distance from previous block=(52,131)
DE Xeroderma pigmentosum group B protein signature
BL adapted; width=21; seqs=8; 99.5%=985; strength=1287
XPB_HUMAN|P19447 ( 74) RPLWVAPDGHIFLEAFSPVYK 54
XPB_MOUSE|P49135 ( 74) RPLWVAPDGHIFLEAFSPVYK 54
P91579 ( 80) RPLYLAPDGHIFLESFSPVYK 67
XPB_DROME|Q02870 ( 84) RPLWVAPNGHVFLESFSPVYK 79
RA25_YEAST|Q00578 ( 131) PLWISPSDGRIILESFSPLAE 100
Q38861 ( 52) RPLWACADGRIFLETFSPLYK 71
O13768 ( 90) PLWINPIDGRIILEAFSPLAE 100
O00835 ( 79) RPIWVCPDGHIFLETFSAIYK 86
http://blocks.fhcrc.org
277
BLOSUM matrix
p BLOSUM is a score matrix
for amino acid sequences
derived from BLOCKS data
p First, count pairwise
matches fx,y for every amino
acid type pair (x, y)
p For example, for column 3
and amino acids L and W,
we find 8 pairwise matches:
fL,W = fW,L = 8
RPLWVAPD
RPLWVAPR
RPLWVAPN
PLWISPSD
RPLWACAD
PLWINPID
RPIWVCPD
278
p Probability pab is obtained by
dividing fab with the total
number of pairs (note
difference with course book):
p We get probabilities qa by
RPLWVAPD
RPLWVAPR
RPLWVAPN
PLWISPSD
RPLWACAD
PLWINPID
RPIWVCPD
Creating a BLOSUM matrix
279
Creating a BLOSUM matrix
p The probabilities pab and qa can now be plugged
into
to get a 20 x 20 matrix of scores s(a, b).
p Next slide presents the BLOSUM62 matrix
n Values scaled by factor of 2 and rounded to integers
n Additional step required to take into account expected
evolutionary distance
n Described in Deonier’s book in more detail
280
BLOSUM62
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
281
Using BLOSUM62 matrix
MQLEANADTSV
| | |
LQEQAEAQGEM
= 2 + 5 – 3 – 4 + 4 + 0 + 4 + 0 – 2 + 0 +
1
= 7
282
Demonstration of BLAST at NCBI
p http://www.ncbi.nlm.nih.gov/BLAST/
Introduction to
Bioinformatics
Lecture 4:
Genome rearrangements
284
Why study genome rearrangements?
p Provide insight into evolution of species
p Fun algorithmic problem!
p Structure of this lecture:
n The biological phenomenon
n How to computationally model it?
n How to compute interesting things?
n Studying the phenomenon using existing tools
(continued in exercises)
285
Genome rearrangements as an
algorithmic problem
286
Background
p Genome sequencing enables us to
compare genomes of two or more
different species
n -> Comparative genomics
p Basic observation:
n Closely related species (such as human and
mouse) can be almost identical in terms of
genome contents...
n ...but the order of genomic segments can be
very different between species
287
Synteny blocks and segments
p Synteny – derived from Greek ’on the
same ribbon’ – means genomic segments
located on the same chromosome
n Genes, markers (any sequence)
p Synteny block (or syntenic block)
n A set of genes or markers that co-occur
together in two species
p Synteny segment (or syntenic segment)
n Syntenic block where the order of genes or
markers is preserved
288
Synteny blocks and segments
Chromosome i, species B
Chromosome j, species C
Synteny segment
Synteny block
Homologs
of the same
gene
289
Observations from sequencing
1. Large chromosome inversions and
translocations (we’ll get to these shortly)
are common
n ...Even between closely related species
2. Chromosome inversions are usually
symmetric around the origin of DNA
replication
3. Inversions are less common within
species...
290
What causes rearrangements?
p RecA, Recombinase A,
is a protein used to
repair chromosomal
damage
p It uses a duplicate
copy of the damaged
sequence as template
p Template is usually a
homologous sequence
on a sister
chromosome
Diarmaid Hughes: Evaluating genome dynamics: the constraints on
rearrangements within bacterial genomes, Genome Biology 2000, 1
291
Chromosomes: recap
p Linear chromosomes
n Eukaryotes (mostly)
p Circular chromosomes
n Prokaryotes (mostly)
n Mitochondria
chromatid
centromere
gene 1
gene 3
gene 2
Also double-stranded: genes can be
found on both strands (orientations)
292
What effects does RecA have on
genome?
p Repeated sequences cause RecA to fail to
choose correct recombination start
position
p This leads to
n Tandem duplications
n Translocations
n Inversions
Repeat 1 Repeat 2
RecA
?
Damaged sequence
293
Diarmaid Hughes: Evaluating genome dynamics: the constraints on
rearrangements within bacterial genomes, Genome Biology 2000, 1
X, Y, Z and W are repeats of
the same sequence.
a, b, c and d are sequences on genome
bounded by repeats.
In a tandem duplication example,
RecA recombines a sequence that
starts from Y instead of Z after Z.
This leads to duplication of segment Y-Z.
294
Diarmaid Hughes: Evaluating genome dynamics: the constraints on
rearrangements within bacterial genomes, Genome Biology 2000, 1
Recombination of two
repeat sequences in the
same chromosome can
lead to a fragment translocation
Here sequence d is translocated
295
Diarmaid Hughes: Evaluating genome dynamics: the constraints on
rearrangements within bacterial genomes, Genome Biology 2000, 1
Inversion happens when two
sequences of opposite orientations
are recombined.
296
Example: human vs mouse genome
p Human and mouse genomes share
thousands of homologous genes, but they
are
n Arranged in different order
n Located in different chromosomes
p Examples
n Human chromosome 6 contains elements from
six different mouse chromosomes
n Analysis of X chromosome indicates that
rearrangements have happened primarily
within chromosome
297
Jones & Pevzner, 2004
298
299
Representing genome rearrangments
p When comparing two genomes, we can
find homologous sequences in both using
BLAST, for example
p This gives us a map between sequences in
both genomes
300
Representing genome rearrangments
p We assign numbers 1,...,n to
the found homologous
sequences
p By convention, we number the
sequences in the first genome
by their order of appearance
in chromosomes
p If the homolog of i is in
reverse orientation, it receives
number –i (signed data)
p For example, consider human
vs mouse gene numbering on
the right
(il10)
9
(at3)
-6
(pdc)
8
(lamc1)
-7
(lamc1)
7
(pdc)
-8
(at3)
6
(il10)
-9
(pklr)
5
(pax3)
15
(gba)
4
(fn1)
14
(ngfb)
3
(cd28)
13
(nras)
2
(inpp1)
12
(gnat2)
1
Mouse
Human
List order corresponds to
physical order on chromosomes!
301
Permutations
p The basic data structure in the study of
genome rearrangements is permutation
p A permutation of a sequence of n numbers
is a reordering of the sequence
p For example, 4 1 3 2 5 is a permutation of
1 2 3 4 5
302
Genome rearrangement problem
p Given two genomes (set of markers), how
many
n duplications,
n inversions and
n translocations
do we need to do to transform the first
genome to the second?
Minimum number of operations?
What operations? Which order?
303
Genome rearrangement problem
6 1 2 3 4 5 1 2 3 4 5 6
#duplications?
#inversions?
#translocations?
304
Genome rearrangement problem
6 1 2 3 4 5 1 2 3 4 5 6
1 2 3 4 5 6
Keep in mind, that the two genomes
have been evolved from a common
ancestor genome!
Permutation
of 1,...,6
305
Genome rearrangements using reversals
(=inversions) only
p Lets consider a simpler problem where we just
study reversals with unsigned data
p A reversal p(i, j) reverses the order of the
segment i i+1 ... j-1 j (indexing starts from 1)
p For example, given permutation
6 1 2 3 4 5 and reversal p(3, 5) we get
permutation 6 1 4 3 2 5
...note that we do not care about exact positions on the genome
p(3, 5)
306
Reversal distance problem
p Find the shortest series of reversals that, given
a permutation , transforms it to the identity
permutation (1, 2, ..., n)
p This quantity is denoted by d( )
p Reversal distance for a pair of chromosomes:
n Find synteny blocks in both
n Number blocks in the first chromosome to identity
n Set to correspond matching of second chromosome’s
blocks against the first
n Find reversal distance
307
Reversal distance problem: discussion
p If we can find the minimal series of
reversals for some pair of genomes
n Is that what happened during evolution?
n If not, is it the correct number of reversals?
p In any case, reversal distance gives us a
measure of evolutionary distance between
the two genomes and species
308
Solving the problem by sorting
p Our first approach to solve the reversal
distance problem:
n Examine each position i of the permutation
n At each position, if i i, do a reversal such
that i = i
p This is a greedy approach: we try to
choose the best option at each step
309
Simple reversal sort: example
6 1 2 3 4 5 -> 1 6 2 3 4 5 -> 1 2 6 3 4 5 -> 1 2 3 4 6 5
-> 1 2 3 4 5 6
Reversal series: p(1,2), p(2,3), p(3,4), p(5,6)
Is d(6 1 2 3 4 5) then 4?
6 1 2 3 4 5 -> 5 4 3 2 1 6 -> 1 2 3 4 5 6
D(6 1 2 3 4 5) = 2
310
Pancake flipping problem
p No pancake made by
the chef is of the
same size
p Pancakes need to be
rearranged before
delivery
p Flipping operation:
take some from the
top and flip them over
p This corresponds to
always reversing the
sequence prefix
1 2 3 6 4 5 -> 6 3 2 1 4 5 ->
5 4 1 2 3 6 -> 3 2 1 4 5 6 ->
1 2 3 4 5 6
311
How good is simple reversal sort?
p Not so good actually
p It has to do at most n-1 reversals with
permutation of length n
p The algorithm can return a distance that is
as large as (n – 1)/2 times the correct
result d( )
n For example, if n = 1001, result can be as bad
as 500 x d( )
312
Estimating reversal distance by cycle
decomposition
p We can estimate d( ) by cycle
decomposition
p Lets represent permutation = 1 2 4 5 3
with the following graph
where edges correspond to adjacencies
(identity, permutation F)
1 2 4 5 3
0 6
313
Estimating reversal distance by cycle
decomposition
p Cycle decomposition: a set of cycles that
n have edges with alternating colors
n do not share edges with other cycles (=cycles
are edge disjoint)
1 2 4 5 3
0 6
1 2 4 5
314
Cycle decompositions
p Let c( ) the maximum number of alternating,
edge-disjoint cycles in the graph representation
of
p The following formula allows estimation of d( )
n d( ) n + 1 – c( ), where n is the permutation length
1 2 4 5 3
0 6
1 2 4 5
d( ) 5 + 1 – 4 = 2
Claim in Deonier: equality holds for ”most of the usual and
interesting biological systems.
315
Cycle decompositions
p Cycle decomposition is NP-complete
n We cannot solve the general problem exactly
for large instances
p However, with signed data the problem
becomes easy
n Before going into signed data, lets discuss
another algorithm for the general case
316
Computing reversals with breakpoints
p Lets investigate a better way to compute
reversal distance
p First, some concepts related to
permutation 1 2,,, n-1 n
n Breakpoint: two elements i and i+1 are a
breakpoint, if they are not consecutive
numbers
n Adjacency: if i and i+1 are consecutive, they
are called adjacency
317
Breakpoints and adjacencies
2 1 3 4 5 8 7 6
This permutation contains
four breakpoints begin-2, 13, 58, 6-end and
five adjacencies 21, 34, 45, 87, 76
Breakpoints
318
Breakpoints
p Each breakpoint in permutation needs to be
removed to get to the identity permutation (=our
target)
n Identity permutation does not contain any breakpoints
p First and last positions special cases
p Note that each reversal can remove at most two
breakpoints
p Denote the number of breakpoints by b( )
2 1 3 4 5 8 7 6 b( ) = 4
319
Breakpoint reversal sort
p Idea: try to remove as many breakpoints
as possible (max 2) in every step
1. While b( ) > 0
2. Choose reversal p that removes most breakpoints
3. Perform reversal p to
4. Output
5. return
320
Breakpoint removal: example
8 2 7 6 5 1 4 3 b( ) = 6
2 8 7 6 5 1 4 3 b( ) = 5
2 3 4 1 5 6 7 8 b( ) = 3
4 3 2 1 5 6 7 8 b( ) = 2
1 2 3 4 5 6 7 8 b( ) = 0
321
Breakpoint removal
p The previous algorithm needs refinement
to be correct
p Consider the following permutation:
1 5 6 7 2 3 4 8
p There is no reversal that decreases the
number of breakpoints!
p See Jones & Pevzner for detailed
description on this
322
Breakpoint removal
p Reversal can only decrease breakpoint
count if permutation contains decreasing
strips
1 5 6 7 2 3 4 8
1 5 6 7 4 3 2 8
1 2 3 4 7 6 5 8
Increasing strip
Decreasing strip
Strip: maximal segment without breakpoints
323
Improved breakpoint reversal sort
1. While b( ) > 0
2. If has a decreasing strip
3. Do reversal p that removes most BPs
4. Else
5. Reverse an increasing strip
6. Output
7. return
324
Is Improved BP removal enough?
p The algorithm works pretty well:
n It produces a result that is at most four times
worse than the optimal result
n ...is this good?
p We considered only reversals
p What about translocations & duplications?
325
Translocations via reversals
1 2 3 4 5 6 7 8
1 5 6 7 8 2 3 4
1 4 3 2 8 7 6 5
1 2 3 4 8 7 6 5
1 2 3 4 5 6 7 8
Translocation of 2,3,4
p(2,8)
p(2,4)
p(5,8)
326
Genome rearrangements with reversals
p With unsigned data, the problem of finding
minimum reversal distances is NP-
complete
n Why is this so if sorting is easy?
p An algorithm has been developed that
achieves 1.375-approximation
p However, reversal distance in signed data
can be computed quickly!
n It takes linear time w.r.t. the length of
permutation (Bader, Moret, Yan, 2001)
327
Cycle decomposition with signed data
p Consider the following two permutations
that include orientation of markers
n J: +1 +5 -2 +3 +4
n K: +1 -3 +2 +4 -5
p We modify this representation a bit to
include both endpoints of each marker:
n J’: 0 1a 1b 5a 5b 2b 2a 3a 3b 4a 4b 6
n K’: 0 1a 1b 3b 3a 2a 2b 4a 4b 5b 5a 6
328
Graph representation of J’ and K’
p Drawn online in lecture!
329
Multiple chromosomes
p In unichromosomal genomes, inversion
(reversal) is the most common operation
p In multichromosomal genomes,
inversions, translocations, fissions and
fusions are most common
330
Multiple chromosomes
p Lets represent multichromosomal genome
as a set of permutations, with $ denoting
the boundary of a chromosome:
5 9 $
1 3 2 8 $
7 6 4 $
This notation is frequently used in software
used to analyse genome rearrangements.
Chr 1
Chr 2
Chr 3
331
Multiple chromosomes
p Note that when dealing with multiple
chromosomes, you need to specify
numbering for elements on both genomes
332
Reversals & translocations
p Reversal p( , i, j)
p Translocation p( , , i, j)
i
j
Translocation
333
Fusions & fissions
p Fusion: merging of two chromosomes
p Fission: chromosome is split into two
chromosomes
p Both events can be represented with a
translocation
334
Fusion
p Fusion by translocation p( , , n+1, 1)
i = n + 1
j = 1
Fusion
335
Fission
p Fission by translocation p( , , i, 1)
i
Empty chromosome
Fission
336
Algorithms for general genomic distance
problem
p Hannenhalli, Pevzner: Transforming Men into
Mice (polynomial algorithm for genomic distance
problem), 36th Annual IEEE Symposium on
Foundations of Computer Science, 1995
337
Human & mouse revisited
p Human and mouse are separated by about
75-83 million years of evolutionary history
p Only a few hundred rearrangements have
happened after speciation from the
common ancestory
p Pevzner & Tesler identified in 2003 for 281
synteny blocks a rearrangement from
mouse to human with
n 149 inversions
n 93 translocations
n 9 fissions
338
Discussion
p Genome rearrangement events are very
rare compared to, e.g., point mutations
n We can study rearrangement events further
back in the evolutionary history
p Rearrangements are easier to detect in
comparison to many other genomic events
p We cannot detect homologs 100%
correctly so the input permutation can
contain errors
339
Discussion
p Genome rearrangement is to some degree
constrained by the number and size of
repeats in a genome
n Notice how the importance of genomic repeats
pops up once again
p Sequencing gives us (usually) signed data
so we can utilize faster algorithms
p What if there are more than one optimal
solution?
340
Two different genome rearrangement scenarios
giving the same result.
341
GRIMM demonstration
Glenn Tesler, GRIMM: genome rearrangements web server.
Bioinformatics, 2002,
342
GRIMM file format
# useful comment about first genome
# another useful comment about it
>name of first genome
1 -4 2 $ # chromosome 1
-3 5 6 # chromosome 2
>name of second genome
5 -3 $
6 $
2 -4 1 $
http://grimm.ucsd.edu/GRIMM/grimm_instr.html
GRIMM supports analysis of
one, two or more genomes
Introduction to
Bioinformatics
Phylogenetic trees
344
Inferring the Past: Phylogenetic
Trees
p The biological problem
p Parsimony and distance methods
p Models for mutations and estimation of
distances
p Maximum likelihood methods
345
Phylogeny
p We want to study ancestor-
descendant relationships, or
phylogeny, among groups of
organisms
p Groups are called taxa (singular:
taxon)
p Organisms are usually called
operational taxonomic units or
OTUs in the context of
phylogeny
346
Phylogenetic trees
p Leaves (external nodes)
~ species, observed
(OTUs)
p Internal nodes ~
ancestral
species/divergence
events, not observed
p Unrooted tree does not
specify ancestor-
descendant relationships
beyond the observation
”leaves are not
ancestors”
1
2
3
4
5
6
7
8
Unrooted tree with 5
leaves and 3 internal
nodes.
Is node 7 ancestor of node
6?
347
Phylogenetic trees
p Rooting a tree specifies all
ancestor-descendant
relationships in the tree
p Root is the ancestor to the
other species
p There are n-1 ways to root
a tree with n nodes
1
2
3
4
5
6
7
8
R1 R2
2 3 4 5
1
6
7
8
R1
2 3 4
5
1
6
7
8
R2
r
o
o
t
(
R
1
)
r
o
o
t
(
R
2
)
348
Questions
p Can we enumerate all possible
phylogenetic trees for n species (or
sequences?)
p How to score a phylogenetic tree with
respect to data?
p How to find the best phylogenetic tree
given data?
349
Finding the best phylogenetic tree:
naive method
p How can we find the phylogenetic tree
that best represents the data?
p Naive method: enumerate all possible
trees
p How many different trees are there of n
species?
p Denote this number by bn
350
Enumerating unordered trees
p Start with the only
unordered tree with 3
leaves (b3 = 1)
p Consider all ways to add
a leaf node to this tree
p Fourth node can be added
to 3 different branches
(edges), creating 1 new
internal branch
p Total number of branches
is n external and n – 3
internal branches
p Unrooted tree with n
leaves has 2n – 3 branches
1 2
3
1 2
3
4
1 2
3
4
1 2
3
4
351
Enumerating unordered trees
p Thus, we get the number of unrooted trees
bn = (2(n – 1) – 3)bn-1 = (2n – 5)bn-1
= (2n – 5) * (2n – 7) * …* 3 * 1
= (2n – 5)! / ((n-3)!2n-3), n > 2
p Number of rooted trees b’n is
b’n = (2n – 3)bn = (2n – 3)! / ((n-2)!2n-2),
n > 2
that is, the number of unrooted trees times the
number of branches in the trees
352
Number of possible rooted and
unrooted trees
8.20E+021
2.22E+020
20
4.95E+038
8.69E+036
30
34459425
2027025
10
2027025
135135
9
135135
10395
8
10395
954
7
945
105
6
105
15
5
15
3
4
3
1
3
b’n
Bn
n
353
Too many trees?
p We can’t construct and evaluate every
phylogenetic tree even for a smallish
number of species
p Better alternative is to
n Devise a way to evaluate an individual tree
against the data
n Guide the search using the evaluation criteria
to reduce the search space
354
Inferring the Past: Phylogenetic
Trees (chapter 12)
p The biological problem
p Parsimony and distance methods
p Models for mutations and estimation of
distances
p Maximum likelihood methods
355
Parsimony method
p The parsimony method finds the tree that
explains the observed sequences with a
minimal number of substitutions
p Method has two steps
n Compute smallest number of substitutions for
a given tree with a parsimony algorithm
n Search for the tree with the minimal number of
substitutions
356
Parsimony: an example
p Consider the following short sequences
1 ACTTT
2 ACATT
3 AACGT
4 AATGT
5 AATTT
p There are 105 possible rooted trees for 5
sequences
p Example: which of the following trees
explains the sequences with least number
of substitutions?
357
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 AATGT
7 AATTT
8 ACTTT
9 AATTT
T->C
T->G
T->A
A->C
This tree explains the sequences
with 4 substitutions
358
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 AATGT
7 AATTT
8 ACTTT
9 AATTT
T->C
T->G
T->A
A->C
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 ACCTT
C->T
7 AACGT
8 AATGT
9 AATTT
G->T
T->C
T->G
A->C
C->A
6
substitutions…
First tree is
more
parsimonious!
4
substitutions…
359
Computing parsimony
p Parsimony treats each site (position in a
sequence) independently
p Total parsimony cost is the sum of parsimony
costs (=required substitutions) of each site
p We can compute the minimal parsimony cost for
a given tree by
n First finding out possible assignments at each node,
starting from leaves and proceeding towards the root
n Then, starting from the root, assign a letter at each
node, proceeding towards leaves
360
Labelling tree nodes
p An unrooted tree with n leaves contains 2n-1
nodes altogether
p Assign the following labels to nodes in a rooted
tree
n leaf nodes: 1, 2, …, n
n internal nodes: n+1, n+2, …, 2n-1
n root node: 2n-1
p The label of a child node is always
smaller than the label of the
parent node
2 3 4 5
1
6
8
7
9
361
Parsimony algorithm: first phase
p Find out possible assignments at every node for each
site u independently. Denote site u in sequence i by
si,u.
For i := 1, …, n do
Fi := {si,u} % possible assignments at node i
Li := 0 % number of substitutions up to node i
For i := n+1, …, 2n-1 do
Let j and k be the children of node i
If Fj Fk =
then Li := Lj + Lk + 1, Fi := Fj Fk
else Li := Lj + Lk, Fi := Fj Fk
362
Parsimony algorithm: first phase
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
Choose u = 3 (for example, in general we do this for all sites)
F1 := {T}
L1 := 0
F2 := {A}
L2 := 0
F3 := {C}, L3 := 0
F4 := {T}, L4 := 0
F5 := {T}, L5 := 0
6
7
8
9
363
Parsimony algorithm: first phase
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 {C,T}
7 T
8 {A, T}
9 T
F8 := F1 F2 = {A, T}
L8 := L1 + L2 + 1 = 1
F6 := F3 F4 = {C, T}
L6 := L3 + L4 + 1 = 1
F7 := F5 F6 = {T}
L7 := L5 + L6 = 1
F9 := F7 F8 = {T}
L9 := L7 + L8 = 2
Parsimony cost for site 3 is 2
364
Parsimony algorithm: second phase
p Backtrack from the root and assign x Fi
at each node
p If we assigned y at parent of node i and y
Fi, then assign y
p Else assign x Fi by random
365
Parsimony algorithm: second phase
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 {C,T}
7 T
8 {A,T}
9 T
At node 6, the
algorithm assigns T
because T was
assigned to parent
node 7 and T F6.
T is assigned to node 8
for the same reason.
The other nodes have
only one possible letter
to assign
366
Parsimony algorithm
3
AACGT
4
AATGT
5
AATTT
2
ACATT
1
ACTTT
6 T
7 T
8 T
9 T
First and second phase are
repeated for each site in
the sequences,
summing the parsimony
costs at each site
367
Properties of parsimony algorithm
p Parsimony algorithm requires that the sequences
are of same length
n First align the sequences against each other and,
optionally, remove indels
n Then compute parsimony for the resulting sequences
n Indels (if present) considered as characters
p Is the most parsimonious tree the correct tree?
n Not necessarily but it explains the sequences with least
number of substitutions
n We can assume that the probability of having fewer
mutations is higher than having many mutations
368
Finding the most parsimonious tree
p Parsimony algorithm calculates the
parsimony cost for a given tree…
p …but we still have the problem of finding
the tree with the lowest cost
p Exhaustive search (enumerating all trees)
is in general impossible
p More efficient methods exist, for example
n Probabilistic search
n Branch and bound
369
Branch and bound in parsimony
p We can exploit the fact that adding edges
to a tree can only increase the parsimony
cost
1
AATGT
2
AATTT
3
AACGT
1
AATGT
2
AATTT
{T}
{T}
{C, T}
cost 0 cost 1
370
Branch and bound in parsimony
Branch and bound is
a general search
strategy where
p Each solution is
potentially generated
p Track is kept of the
best solution found
p If a partial solution
cannot achieve better
score, we abandon
the current search
path
In parsimony…
p Start from a tree with 1
sequence
p Add a sequence to the tree
and calculate parsimony
cost
p If the tree is complete,
check if found the best tree
so far
p If tree is not complete and
cost exceeds best tree
cost, do not continue
adding edges to this tree
371
Branch and bound example
1 1 2 3
1 2 3
1 2 4
…
Complete tree:
compute parsimony
cost
Example with 4 sequences
Partial tree:
Compute parsimony cost
and compare against best
so far;
Do not continue expansion
if above cost of the best tree
3
1 2 4
2
1 3 4
2
1 3 4
1 3
2
1 3
372
Distance methods
p The parsimony method works on sequence
(character string) data
p We can also build phylogenetic trees in a
more general setting
p Distance methods work on a set of
pairwise distances dij for the data
p Distances can be obtained from
phenotypes as well as from genotypes
(sequences)
373
Distances in a phylogenetic tree
p Distance matrix D = (dij)
gives pairwise distances
for leaves of the
phylogenetic tree
p In addition, the
phylogenetic tree will
now specify distances
between leaves and
internal nodes
n Denote these with dij as
well
2 3 4 5
1
6
7
8
Distance dij states how
far apart species i and j
are evolutionary (e.g.,
number of mismatches in
aligned sequences)
374
Distances in evolutionary context
p Distances dij in evolutionary context
satisfy the following conditions
n Symmetry: dij = dji for each i, j
n Distinguishability: dij 0 if and only if i j
n Triangle inequality: dij dik + dkj for each i, j, k
p Distances satisfying these conditions are called
metric
p In addition, evolutionary mechanisms may
impose additional constraints on the distances
additive and ultrametric distances
375
Additive trees
p A tree is called additive, if the distance
between any pair of leaves (i, j) is the
sum of the distances between the leaves
and a node k on the shortest path from i
to j in the tree
dij = dik + djk
p ”Follow the path from the leaf i to the leaf
j to find the exact distance dij between the
leaves.”
376
Additive trees: example
0
2
4
4
D
2
0
4
4
C
4
4
0
2
B
4
4
2
0
A
D
C
B
A
A
B
C
D
1
1
2 1
1
377
Ultrametric trees
p A rooted additive tree is called an ultrametric
tree, if the distances between any two leaves i
and j, and their common ancestor k are equal
dik = djk
p Edge length dij corresponds to the time elapsed
since divergence of i and j from the common
parent
p In other words, edge lengths are measured by a
molecular clock with a constant rate
378
Identifying ultrametric data
p We can identify distances to be ultrametric
by the three-point condition:
D corresponds to an ultrametric tree if
and only if for any three species i, j and
k, the distances satisfy dij max(dik, dkj)
p If we find out that the data is ultrametric, we can
utilise a simple algorithm to find the
corresponding tree
379
Ultrametric trees
9
8
7
5 4 3 2 1
6
Observation time
Time
380
Ultrametric trees
9
8
7
5 4 3 2 1
6
Observation time
Time
Only vertical segments of the
tree have correspondence to
some distance dij:
Horizontal segments act as
connectors.
d8,9
381
Ultrametric trees
9
8
7
5 4 3 2 1
6
Observation time
Time
dik = djk for any two leaves
i, j and any ancestor k of
i and j
382
Ultrametric trees
9
8
7
5 4 3 2 1
6
Observation time
Time
Three-point condition: there are
no leafs i, j for which dij > max(dik, djk)
for some leaf k.
383
UPGMA algorithm
p UPGMA (unweighted pair group method
using arithmetic averages) constructs a
phylogenetic tree via clustering
p The algorithm works by at the same time
n Merging two clusters
n Creating a new node on the tree
p The tree is built from leaves towards the
root
p UPGMA produces a ultrametric tree
384
Cluster distances
p Let distance dij between clusters Ci and Cj
be
that is, the average distance between
points (species) in the cluster.
385
UPGMA algorithm
p Initialisation
n Assign each point i to its own cluster Ci
n Define one leaf for each sequence, and place it at height
zero
p Iteration
n Find clusters i and j for which dij is minimal
n Define new cluster k by Ck = Ci Cj, and define dkl for all l
n Define a node k with children i and j. Place k at height dij/2
n Remove clusters i and j
p Termination:
n When only two clusters i and j remain, place root at height dij/2
386
1 2
3
4
5
387
1 2
3
4
5
1 2
6
388
1 2
3
4
5
1 2 4 5
6 7
389
1 2
3
4
5
1 2 4 5
6 7
8
3
390
1 2
3
4
5
1 2 4 5
6 7
8
3
9
391
UPGMA implementation
p In naive implementation, each iteration
takes O(n2) time with n sequences =>
algorithm takes O(n3) time
p The algorithm can be implemented to take
only O(n2) time (see Gronau & Moran,
2006, for a survey)
392
Problem solved?
p We now have a simple algorithm which finds a
ultrametric tree
n If the data is ultrametric, then there is exactly one
ultrametric tree corresponding to the data (we skip the
proof)
n The tree found is then the ”correct” solution to the
phylogeny problem, if the assumptions hold
p Unfortunately, the data is not ultrametric in
practice
n Measurement errors distort distances
n Basic assumption of a molecular clock does not hold
usually very well
393
Incorrect reconstruction of non-
ultrametric data by UPGMA
1
2 3
4
1 2 3
4
Tree which corresponds
to non-ultrametric
distances
Incorrect ultrametric reconstruction
by UPGMA algorithm
394
Checking for additivity
p How can we check if our data is additive?
p Let i, j, k and l be four distinct species
p Compute 3 sums: dij + dkl, dik + djl, dil
+ djk
395
Four-point condition
i
j l
k i
j l
k i
j l
k
dik
djl
dil
djk
dij dkl
p The sums are represented by the three figures
n Left and middle sum cover all edges, right sum does not
p Four-point condition: i, j, k and l satisfy the four-
point condition if two of the sums dij + dkl, dik +
djl, dil + djk are the same, and the third one is
smaller than these two
396
Checking for additivity
p An n x n matrix D is additive if and only if
the four point condition holds for every 4
distinct elements 1 i, j, k, l n
397
Finding an additive phylogenetic tree
p Additive trees can be found with, for example,
the neighbor joining method (Saitou & Nei, 1987)
p The neighbor joining method produces unrooted
trees, which have to be rooted by other means
n A common way to root the tree is to use an outgroup
n Outgroup is a species that is known to be more distantly
related to every other species than they are to each
other
n Root node candidate: position where the outgroup would
join the phylogenetic tree
p However, in real-world data, even additivity
usually does not hold very well
398
Neighbor joining algorithm
p Neighbor joining works in a similar fashion
to UPGMA
n Find clusters C1 and C2 that minimise a
function f(C1, C2)
n Join the two clusters C1 and C2 into a new
cluster C
n Add a node to the tree corresponding to C
n Assign distances to the new branches
p Differences in
n The choice of function f(C1, C2)
n How to assign the distances
399
Neighbor joining algorithm
p Recall that the distance dij for clusters Ci and Cj
was
p Let u(Ci) be the separation of cluster Ci from
other clusters defined by
where n is the number of clusters.
400
Neighbor joining algorithm
p Instead of trying to choose the clusters Ci
and Cj closest to each other, neighbor
joining at the same time
n Minimises the distance between clusters Ci and
Cj and
n Maximises the separation of both Ci and Cj
from other clusters
401
Neighbor joining algorithm
p Initialisation as in UPGMA
p Iteration
n Find clusters i and j for which dij – u(Ci) – u(Cj) is minimal
n Define new cluster k by Ck = Ci Cj, and define dkl for all l
n Define a node k with edges to i and j. Remove clusters i and j
n Assign length ½ dij + ½ (u(Ci) – u(Cj)) to the edge i -> k
n Assign length ½ dij + ½ (u(Cj) – u(Ci)) to the edge j -> k
p Termination:
n When only one cluster remains
402
Neighbor joining algorithm: example
a b c d
a 0 6 7 5
b 0 11 9
c 0 6
d 0
i u(i)
a (6+7+5)/2 = 9
b (6+11+9)/2 = 13
c (7+11+6)/2 = 12
d (5+9+6)/2 = 10
i,j dij – u(Ci) – u(Cj)
a,b 6 - 9 - 13 = -16
a,c 7 - 9 - 12 = -14
a,d 5 - 9 - 10 = -14
b,c 11 - 13 - 12 = -14
b,d 9 - 13 - 10 = -14
c,d 6 - 12 - 10 = -16
Choose either pair
to join
403
Neighbor joining algorithm: example
a b c d
a 0 6 7 5
b 0 11 9
c 0 6
d 0
i u(i)
a (6+7+5)/2 = 9
b (6+11+9)/2 = 13
c (7+11+6)/2 = 12
d (5+9+6)/2 = 10
i,j dij – u(Ci) – u(Cj)
a,b 6 - 9 - 13 = -16
a,c 7 - 9 - 12 = -14
a,d 5 - 9 - 10 = -14
b,c 11 - 13 - 12 = -14
b,d 9 - 13 - 10 = -14
c,d 6 - 12 - 10 = -16
a b c d
e
dae = ½ 6 + ½ (9 – 13) = 1
dbe = ½ 6 + ½ (13 – 9) = 5
dbe
dae
This is the first step only…
404
Inferring the Past: Phylogenetic
Trees (chapter 12)
p The biological problem
p Parsimony and distance methods
p Models for mutations and estimation of
distances
p Maximum likelihood methods
n These parts of the book is skipped on this
course (see slides of 2007 course for material
on these topics)
n No questions in exams on these topics!
405
Problems with tree-building
p Assumptions
n Sites evolve independently of one other
n (Sites evolve according to the same stochastic
model; not really covered this year)
n The tree is rooted
n The sequences are aligned
n Vertical inheritance
406
Additional material on phylogenetic
trees
p Durbin, Eddy, Krogh, Mitchison: Biological
sequence analysis
p Jones, Pevzner: An introduction to
bioinformatics algorithms
p Gusfield: Algorithms on strings, trees, and
sequences
p Course on phylogenetic analyses in Spring
2009
Introduction to
Bioinformatics
Wrap-up
408
Topics
Sequencing
-Sanger
-High-throughput
atgagccaag ttccgaacaa ggattcgcgg
gtcggtaaag agcattggaa cgtcggagat
aagaagcgga tgaatttccc cataacgcca
gtggaagaga aggaggcggg cctcccgatc
actccggccc gaagggttga gagtacccca
gaaatcacct ccagaggacc ccttcagcga
catagcgata ggaggggatg ctaggagttg
Biodatabases
BLAST query
gtggaagagaaggaggcgg…
gaagggttgagagtacccca...
ccagaggaccccttcagcga…
ggaggggatgctaggagttg…
Sequence
assembler
Contigs
=> Genomes
Reads
Similar
sequences = homologs?
Proteomics
Gene expression
analysis
DNA chips
Protein gels
MS/MS techniques
Comparative genomics
- Phylogenetics
- Genome rearrangements
- …
Systems
Biology
2010: Human genome
in 15 minutes!
atgagccaag
aagaagcgga
cagcggaaga
409
Exams
p Course exam Wednesday 15 October
16.00-19.00 Exactum A111
p Separate exams
n Tue 18 November 16.00-20.00 Exactum A111
n Fri 16 January 16.00-20.00 Exactum A111
n Tue 31 March 16.00-20.00 Exactum A111
p Check exam date and place before taking
the exam! (previous week or so)
410
Exam regulations
p If you are late more than 30 min, you
cannot take the exam
p You are not allowed to bring material such
as books or lecture notes to the exam
p Allowed stuff: blank paper (distributed in
the exam), pencils, pens, erasers,
calculators, snacks
p Bring your student card or other id!
411
Grading
p Grading: on the scale 0-5
n To get the lowest passing grade 1, you need to get at
least 30 points out of 60 maximum
p Course exam gives you maximum of 48 points
p Note: if you take the first separate exam, the
best of the following options will be considered:
n Exam gives you max 48 points, exercises max 12 points
n Exam gives you max 60 points
p In second and subsequent separate exams, only
the 60 point option is in use
412
Exercise points
p Max. marks: 31
p 80% of 31 ~= 24 marks -> 12 points
p 2 marks = 1 point
413
Topics covered by exams
p Exams cover everything presented in lectures
(exception: biological background not covered)
p Word distributions and occurrences (course book
chapters 2-3)
p Genome rearrangements (chapter 5)
p Sequence alignment (chapter 6)
p Rapid alignment methods: FASTA and BLAST
(chapter 7)
p Sequencing and sequence assembly (chapter 8)
414
Topics covered by exams
p Similarity, distance and clustering
(chapter 10)
p Expression data analysis (chapter 11)
p Phylogenetic trees (chapter 12)
p Systems biology: modelling biological
networks (no chapter in course book)
415
Bioinformatics courses in 2008
p Biological sequence analysis (II period,
Kumpula)
n Focus on probabilistic methods: Hidden Markov
Models, Profile HMMs, finding regulatory
elements, …
p Modeling of biological networks (20-
24.10., TKK)
n Biochemical network modelling and parameter
estimation in biochemical networks using
mechanistic differential equation models.
416
Bioinformatics courses in autumn 2008
p Bayesian paradigm in genetic
bioinformatics (II period, Kumpula)
n Applications of Bayesian approach in computer
programs and data analysis of
p genetic past,
p phylogenetics,
p coalescence,
p relatedness,
p haplotype structure,
p disease gene associations.
417
Bioinformatics courses in autumn 2008
p Statistical methods in genetics (II period,
Kumpula)
n Introduction to statistical methods in gene
mapping and genetic epidemiology.
n Basic concepts of linkage and association
analysis as well as some concepts of
population genetics will be covered.
418
Bioinformatics courses in Spring 2009
p Practical Course in Biodatabases (III
period, Kumpula)
p High-throughput bioinformatics (III-IV
periods, TKK)
p Phylogenetic data analyses (IV period,
Kumpula)
n Maximum likelihood methods, Bayesian
methods, program packages
p Metabolic modelling (IV period, Kumpula)
419
Genomes sequenced – all done?
p Sequencing is just the beginning
n What do genes and proteins do?
p Functional genomics
n How do they interact with other genes
and proteins?
p Systems biology
Two sides of the same question!
420
Bioinformatics (at least mathematical
biology) can exist outside molecular biology
The metapopulation capacity of a fragmented landscape
Ilkka Hanski and Otso Ovaskainen
Nature 404, 755-758(13 April 2000)
Melitaea cinxia, Glanville Fritillary butterfly
421
Metagenomics
p Metagenomics or environmental genomics
n ”At the last count 1.8 million species were known to
science. That sounds like a lot, but in truth it's no big
deal. We may have done a reasonable job of describing
the larger stuff, but the fact remains that an average
teaspoon of water, soil or ice contains millions of micro-
organisms that have never been counted or named. ”
-- Henry Nicholls
422
Omics
p Phenome
p Exposome
p Textome
p Receptorome
p Kinome
p Neurome
p Cytome
p Predictome
p Omeome
p Reactome
p Connectome
p Genome
p Transcriptome
p Metabolome
p Metallome
p Lipidome
p Glycome
p Interactome
p Spliceome
p ORFeome
p Speechome
p Mechanome
http://en.wikipedia.org/wiki/-omics
423
Take-home messages
p Don’t trust biodatabases blindly!
n Annotation errors tend to accumulate
p Consider
n Statistical significance
n Sensitivity
of your results
p Think about the whole ”bioinformatics workflow”:
n Biological phenomenon -> Modelling -> Computation ->
Validation of results
p Results from bioinformatics tools and methods
must be validated!
p Actively seek cooperation with experts
424
Bioinformatics journals
p Bioinformatics, http://bioinformatics.oupjournals.org/
p BMC Bioinformatics,
http://www.biomedcentral.com/bmcbioinformatics
p Journal of Bioinformatics and Computational Biology
(JBCB), http://www.worldscinet.com/jbcb/jbcb.shtml
p Journal of Computational Biology,
http://www.liebertpub.com/CMB/
p IEEE/ACM Transactions on Computational Biology and
Bioinformatics , http://www.computer.org/tcbb/
p PLoS Computational Biology, www.ploscompbiol.org
p In Silico Biology, http://www.bioinfo.de/isb/
p Nature, Science (bedtime reading)
425
Bioinformatics conferences
p ISMB, Intelligent Systems for
Molecular Biology (Toronto, July 2008)
p ICSB, International Conference on
Systems Biology (Göteborg, Sweden;
22-28 August)
p RECOMB, Research in Computational
Molecular Biology
p ECCB, European Conference on
Computational Biology
p WABI, Workshop on Algorithms in
Bioinformatics
p PSB, Pacific Symposium on
Biocomputing
January 5-9, 2009
The Big Island of Hawaii
426
Master’s degree in bioinformatics?
p You can apply to MBI during the
application period November ’08 – 2
February ’09
n Bachelor’s degree in suitable field
n At least 60 ECTS credits in CS or mathstat
n English language certificate
p Passing this course gives you the first 4
credits for Bioinformatics MSc!
427
Information session on MBI
p Wednesday 19.11. 13.00-15.00 Exactum
D122
p www.cs.helsinki.fi/mbi/events/info08
p Talks in Finnish
428
Mailing list for bioinformatics courses
and events
p MBI maintains a mailing list for
announcement on bioinformatics courses
and events
p Send email to bioinfo a t cs.helsinki.fi if
you want to subscribe to the list (you can
unsubscribe in the same way)
p List is moderated
429
Computer Science
Mathematics
and Statistics
Biology & Medicine
Bioinformatics
The aim of this course
Where would you be in this triangle?
Has your position shifted during the course?
430
Feedback
p Please give feedback on the course!
n https://ilmo.cs.helsinki.fi/kurssit/servlet/Valint
a?kieli=en
p Don’t worry about your grade – you can
give feedback anonymously
431
Thank you!
p I hope you enjoyed the course!
taivasalla.net
Halichoerus grypus, Gray seal or harmaahylje in Finnish

bioinformatics.pdf

  • 1.
    Introduction to Bioinformatics Esa Pitkänen esa.pitkanen@cs.helsinki.fi Autumn2008, I period www.cs.helsinki.fi/mbi/courses/08-09/itb 582606 Introduction to Bioinformatics, Autumn 2008
  • 2.
    Introduction to Bioinformatics Lecture 1: Administrativeissues MBI Programme, Bioinformatics courses What is bioinformatics? Molecular biology primer
  • 3.
    3 How to enrolfor the course? p Use the registration system of the Computer Science department: https://ilmo.cs.helsinki.fi n You need your user account at the IT department (“cc account”) p If you cannot register yet, don’t worry: attend the lectures and exercises; just register when you are able to do so
  • 4.
    4 Teachers p Esa Pitkänen,Department of Computer Science, University of Helsinki p Elja Arjas, Department of Mathematics and Statistics, University of Helsinki p Sami Kaski, Department of Information and Computer Science, Helsinki University of Technology p Lauri Eronen, Department of Computer Science, University of Helsinki (exercises)
  • 5.
    5 Lectures and exercises pLectures: Tuesday and Friday 14.15-16.00 Exactum C221 p Exercises: Tuesday 16.15-18.00 Exactum C221 n First exercise session on Tue 9 September
  • 6.
    6 Status & Prerequisites pAdvanced level course at the Department of Computer Science, U. Helsinki p 4 credits p Prerequisites: n Basic mathematics skills (probability calculus, basic statistics) n Familiarity with computers n Basic programming skills recommended n No biology background required
  • 7.
    7 Course contents p Whatis bioinformatics? p Molecular biology primer p Biological words p Sequence assembly p Sequence alignment p Fast sequence alignment using FASTA and BLAST p Genome rearrangements p Motif finding (tentative) p Phylogenetic trees p Gene expression analysis
  • 8.
    8 How to passthe course? p Recommended method: n Attend the lectures (not obligatory though) n Do the exercises n Take the course exam p Or: n Take a separate exam
  • 9.
    9 How to passthe course? p Exercises give you max. 12 points n 0% completed assignments gives you 0 points, 80% gives 12 points, the rest by linear interpolation n “A completed assignment” means that p You are willing to present your solution in the exercise session and p You return notes by e-mail to Lauri Eronen (see course web page for contact info) describing the main phases you took to solve the assignment n Return notes at latest on Tuesdays 16.15 p Course exam gives you max. 48 points
  • 10.
    10 How to passthe course? p Grading: on the scale 0-5 n To get the lowest passing grade 1, you need to get at least 30 points out of 60 maximum p Course exam: Wed 15 October 16.00-19.00 Exactum A111 p See course web page for separate exams p Note: if you take the first separate exam, the best of the following options will be considered: n Exam gives you 48 points, exercises 12 points n Exam gives you 60 points p In second and subsequent separate exams, only the 60 point option is in use
  • 11.
    11 Literature p Deonier, Tavaré, Waterman:Computational Genome Analysis, an Introduction. Springer, 2005 p Jones, Pevzner: An Introduction to Bioinformatics Algorithms. MIT Press, 2004 p Slides for some lectures will be available on the course web page
  • 12.
    12 Additional literature p Gusfield:Algorithms on strings, trees and sequences p Griffiths et al: Introduction to genetic analysis p Alberts et al.: Molecular biology of the cell p Lodish et al.: Molecular cell biology p Check the course web site
  • 13.
  • 14.
    14 Master's Degree Programmein Bioinformatics (MBI) p Two-year MSc programme p Admission for 2009-2010 in January 2009 n You need to have your Bachelor’s degree ready by August 2009 www.cs.helsinki.fi/mbi
  • 15.
    15 MBI programme organizers Departmentof Computer Science, Department of Mathematics and Statistics Faculty of Science, Kumpula Campus, HY Laboratory of Computer and Information Science, Laboratory of CS and Engineering,TKK Faculty of Medicine, Meilahti Campus, HY Faculty of Biosciences Faculty of Agriculture and Forestry Viikki Campus, HY
  • 16.
    16 TKK, Otaniemi HY, Meilahti HY,Kumpula HY, Viikki Four MBI campuses
  • 17.
    17 MBI highlights p Youcan take courses from both HY and TKK p Two biology courses tailored specifically for MBI p Bioinformatics is a new exciting field, with a high demand for experts in job market p Go to www.cs.helsinki.fi/mbi/careers to find out what a bioinformatician could do for living
  • 18.
    18 Admission p Admission requirements nBachelor’s degree in a suitable field (e.g., computer science, mathematics, statistics, biology or medicine) n At least 60 ECTS credits in total in computer science, mathematics and statistics n Proficiency in English (standardized language test: TOEFL, IELTS) p Admission period opens in late Autumn 2009 and closes in 2 February 2009 p Details on admission will be posted in www.cs.helsinki.fi/mbi during this autumn
  • 19.
    19 Bioinformatics courses inHelsinki region: 1st period p Computational genomics (4-7 credits, TKK) p Seminar: Neuroinformatics (3 credits, Kumpula) p Seminar: Machine Learning in Bioinformatics (3 credits, Kumpula) p Signal processing in neuroinformatics (5 credits, TKK)
  • 20.
    20 A good biologycourse for computer scientists and mathematicians? p Biology for methodological scientists (8 credits, Meilahti) n Course organized by the Faculties of Bioscience and Medicine for the MBI programme n Introduction to basic concepts of microarrays, medical genetics and developmental biology n Study group + book exam in I period (2 cr) n Three lectured modules, 2 cr each n Each module has an individual registration so you can participate even if you missed the first module n www.cs.helsinki.fi/mbi/courses/08-09/bfms/
  • 21.
    21 Bioinformatics courses inHelsinki region: 2nd period p Bayesian paradigm in genetic bioinformatics (6 credits, Kumpula) p Biological Sequence Analysis (6 credits, Kumpula) p Modeling of biological networks (5-7 credits, TKK) p Statistical methods in genetics (6-8 credits, Kumpula)
  • 22.
    22 Bioinformatics courses inHelsinki region: 3rd period p Evolution and the theory of games (5 credits, Kumpula) p Genome-wide association mapping (6-8 credits, Kumpula) p High-Throughput Bioinformatics (5-7 credits, TKK) p Image Analysis in Neuroinformatics (5 credits, TKK) p Practical Course in Biodatabases (4-5 credits, Kumpula) p Seminar: Computational systems biology (3 credits, Kumpula) p Spatial models in ecology and evolution (8 credits, Kumpula) p Special course in bioinformatics I (3-7 credits, TKK)
  • 23.
    23 Bioinformatics courses inHelsinki region: 4th period p Metabolic Modeling (4 credits, Kumpula) p Phylogenetic data analyses (6-8 credits, Kumpula)
  • 24.
    24 1. What isbioinformatics?
  • 25.
    25 What is bioinformatics? pBioinformatics, n. The science of information and information flow in biological systems, esp. of the use of computational methods in genetics and genomics. (Oxford English Dictionary) p "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." -- Fredj Tekaia
  • 26.
    26 What is bioinformatics? p"I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information." -- Richard Durbin
  • 27.
    27 What is notbioinformatics? p Biologically-inspired computation, e.g., genetic algorithms and neural networks p However, application of neural networks to solve some biological problem, could be called bioinformatics p What about DNA computing? http://www.wisdom.weizmann.ac.il/~lbn/new_pages/Visual_Presentation.html
  • 28.
    28 Computational biology p Applicationof computing to biology (broad definition) p Often used interchangeably with bioinformatics p Or: Biology that is done with computational means
  • 29.
    29 Biometry & biophysics pBiometry: the statistical analysis of biological data n Sometimes also the field of identification of individuals using biological traits (a more recent definition) p Biophysics: "an interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function" -- British Biophysical Society
  • 30.
    30 Mathematical biology p Mathematicalbiology “tackles biological problems, but the methods it uses to tackle them need not be numerical and need not be implemented in software or hardware.” -- Damian Counsell Alan Turing
  • 31.
    31 Turing on biologicalcomplexity p “It must be admitted that the biological examples which it has been possible to give in the present paper are very limited. This can be ascribed quite simply to the fact that biological phenomena are usually very complicated. Taking this in combination with the relatively elementary mathematics used in this paper one could hardly expect to find that many observed biological phenomena would be covered. It is thought, however, that the imaginary biological systems which have been treated, and the principles which have been discussed, should be of some help in interpreting real biological forms.” – Alan Turing, The Chemical Basis of Morphogenesis, 1952
  • 32.
    32 Related concepts p Systemsbiology n “Biology of networks” n Integrating different levels of information to understand how biological systems work p Computational systems biology Overview of metabolic pathways in KEGG database, www.genome.jp/kegg/
  • 33.
    33 Why is bioinformaticsimportant? p New measurement techniques produce huge quantities of biological data n Advanced data analysis methods are needed to make sense of the data n Typical data sources produce noisy data with a lot of missing values p Paradigm shift in biology to utilise bioinformatics in research
  • 34.
    34 Bioinformatician’s skill set pStatistics, data analysis methods n Lots of data n High noise levels, missing values n #attributes >> #data points p Programming languages n Scripting languages: Python, Perl, Ruby, … n Extensive use of text file formats: need parsers n Integration of both data and tools p Data structures, databases
  • 35.
    35 Bioinformatician’s skill set pModelling n Discrete vs continuous domains n -> Systems biology p Scientific computation packages n R, Matlab/Octave, … p Communication skills!
  • 36.
    36 Communication skills: case1 Biologist presents a problem to computer scientists / mathematicians ? ”I am interested in finding what affects the regulation gene x during condition y and how that relates to the organism’s phenotype.” ”Define input and output of the problem.”
  • 37.
    37 Communication skills: case2 Bioinformatician is a part of a group that consists mostly of biologists.
  • 38.
    38 Communication skills: case2 ...biologist/bioinformatician ratio is important!
  • 39.
    39 Communication skills: case3 A group of bioinformaticians offers their services to more than one group
  • 40.
    40 Bioinformatician’s skill set pHow much biology you should know?
  • 41.
    41 Computer Science • Programming •Databases • Algorithmics Mathematics and statistics • Calculus • Probability calculus • Linear algebra Biology & Medicine • Basics in molecular and cell biology • Measurement techniques Bioinformatics • Biological sequence analysis • Biological databases • Analysis of gene expression • Modeling protein structure and function • Gene, protein and metabolic networks • … Bioinformatician’s skill set Prof. Juho Rousu, 2006 Where would you be in this triangle?
  • 42.
    42 A problem involvingbioinformatics? - ”I found a fruit fly that is immune to all diseases!” - ”It was one of these” Pertti Jarla, http://www.hs.fi/fingerpori/
  • 43.
    43 Molecular biology primer MolecularBiology Primer by Angela Brooks, Raymond Brown, Calvin Chen, Mike Daly, Hoa Dinh, Erinn Hama, Robert Hinman, Julio Ng, Michael Sneddon, Hoa Troung, Jerry Wang, Che Fung Yung Edited for Introduction to Bioinformatics (Autumn 2007, Summer 2008, Autumn 2008) by Esa Pitkänen
  • 44.
    44 Molecular biology primer pPart 1: What is life made of? p Part 2: Where does the variation in genomes come from?
  • 45.
    45 Life begins withCell p A cell is a smallest structural unit of an organism that is capable of independent functioning p All cells have some common features
  • 46.
    46 Cells p Fundamental workingunits of every living system. p Every organism is composed of one of two radically different types of cells: n prokaryotic cells or n eukaryotic cells. p Prokaryotes and Eukaryotes are descended from the same primitive cell. n All prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution.
  • 47.
    47 Two types ofcells: Prokaryotes and Eukaryotes
  • 48.
    48 Prokaryotes and Eukaryotes pAccording to the most recent evidence, there are three main branches to the tree of life p Prokaryotes include Archaea (“ancient ones”) and bacteria p Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae Lecture: Phylogenetic trees
  • 49.
    49 All Cells havecommon Cycles p Born, eat, replicate, and die
  • 50.
    50 Common features oforganisms p Chemical energy is stored in ATP p Genetic information is encoded by DNA p Information is transcribed into RNA p There is a common triplet genetic code p Translation into proteins involves ribosomes p Shared metabolic pathways p Similar proteins among diverse groups of organisms
  • 51.
    51 All Life dependson 3 critical molecules p DNAs (Deoxyribonucleic acid) n Hold information on how cell works p RNAs (Ribonucleic acid) n Act to transfer short pieces of information to different parts of cell n Provide templates to synthesize into protein p Proteins n Form enzymes that send signals to other cells and regulate gene activity n Form body’s major components (e.g. hair, skin, etc.) n “Workhorses” of the cell
  • 52.
    52 DNA: The Codeof Life p The structure and the four genomic letters code for all living organisms p Adenine, Guanine, Thymine, and Cytosine which pair A-T and C-G on complimentary strands. Lecture: Genome sequencing and assembly
  • 53.
    53 Discovery of thestructure of DNA p 1952-1953 James D. Watson and Francis H. C. Crick deduced the double helical structure of DNA from X-ray diffraction images by Rosalind Franklin and data on amounts of nucleotides in DNA James Watson and Francis Crick Rosalind Franklin ”Photo 51”
  • 54.
    54 DNA, continued p DNAhas a double helix structure which is composed of n sugar molecule n phosphate group n and a base (A,C,G,T) p By convention, we read DNA strings in direction of transcription: from 5’ end to 3’ end 5’ ATTTAGGCC 3’ 3’ TAAATCCGG 5’
  • 55.
    55 DNA is containedin chromosomes http://en.wikipedia.org/wiki/Image:Chromatin_Structures.png p In eukaryotes, DNA is packed into chromatids n In metaphase, the “X” structure consists of two identical chromatids p In prokaryotes, DNA is usually contained in a single, circular chromosome
  • 56.
    56 Human chromosomes p Somaticcells in humans have 2 pairs of 22 chromosomes + XX (female) or XY (male) = total of 46 chromosomes p Germline cells have 22 chromosomes + either X or Y = total of 23 chromosomes Karyogram of human male using Giemsa staining (http://en.wikipedia.org/wiki/Karyotype)
  • 57.
    57 Length of DNAand number of chromosomes Organism #base pairs #chromosomes (germline) Prokayotic Escherichia coli (bacterium) 4x106 1 Eukaryotic Saccharomyces cerevisia (yeast) 1.35x107 17 Drosophila melanogaster (insect) 1.65x108 4 Homo sapiens (human) 2.9x109 23 Zea mays (corn / maize) 5.0x109 10
  • 58.
    58 1 atgagccaag ttccgaacaaggattcgcgg ggaggataga tcagcgcccg agaggggtga 61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc 121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag 181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac 241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga 301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac 361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag 421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat 481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga 541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt 601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg 661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc 721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc 781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga 841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac 901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact 961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga 1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga 1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg 1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg 1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct 1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga 1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg 1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga 1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga 1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt 1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt 1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg 1681 ag Hepatitis delta virus, complete genome
  • 59.
    59 RNA p RNA issimilar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) p Several types of RNA exist for different functions in the cell. http://www.cgl.ucsf.edu/home/glasfeld/tutorial/trna/trna.gif tRNA linear and 3D view:
  • 60.
    60 DNA, RNA, andthe Flow of Information Translation Transcription Replication ”The central dogma” Is this true? Denis Noble: The principles of Systems Biology illustrated using the virtual heart http://velblod.videolectures.net/2007/pascal/eccs07_dresden/noble_denis/eccs07_noble_psb_01.ppt
  • 61.
    61 Proteins p Proteins arepolypeptides (strings of amino acid residues) p Represented using strings of letters from an alphabet of 20: AEGLV…WKKLAG p Typical length 50…1000 residues Urease enzyme from Helicobacter pylori
  • 62.
  • 63.
    63 How DNA/RNA codesfor protein? p DNA alphabet contains four letters but must specify protein, or polypeptide sequence of 20 letters. p Dinucleotides are not enough: 42 = 16 possible dinucleotides p Trinucleotides (triplets) allow 43 = 64 possible trinucleotides p Triplets are also called codons
  • 64.
    64 How DNA/RNA codesfor protein? p Three of the possible triplets specify ”stop translation” p Translation usually starts at triplet AUG (this codes for methionine) p Most amino acids may be specified by more than triplet p How to find a gene? Look for start and stop codons (not that easy though)
  • 65.
    65 Proteins: Workhorses ofthe Cell p 20 different amino acids n different chemical properties cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell. p Proteins do all essential work for the cell n build cellular structures n digest nutrients n execute metabolic functions n mediate information flow within a cell and among cellular communities. p Proteins work together with other proteins or nucleic acids as "molecular machines" n structures that fit together and function in highly specific, lock- and-key ways. Lecture 8: Proteomics
  • 66.
    66 Genes p “A geneis a union of genomic sequences encoding a coherent set of potentially overlapping functional products” --Gerstein et al. p A DNA segment whose information is expressed either as an RNA molecule or protein 5’ 3’ 3’ 5’ … a t g a g t g g a … … t a c t c a c c t … augagugga ... (transcription) (translation) MSG … (folding) http://fold.it
  • 67.
    67 FoldIt: Protein foldinggame http://fold.it
  • 68.
    68 Genes & alleles pA gene can have different variants p The variants of the same gene are called alleles 5’ 3’ … a t g a g t g g a … … t a c t c a c c t … augagugga ... MSG … 5’ 3’ … a t g a g t c g a … … t a c t c a g c t … augagucga ... MSR …
  • 69.
    69 Genes can befound on both strands 3’ 5’ 5’ 3’
  • 70.
    70 Exons and introns& splicing 3’ 5’ 5’ 3’ Introns are removed from RNA after transcription Exons Exons are joined: This process is called splicing
  • 71.
    71 Alternative splicing A 3’ 5’ 5’ 3’ BC Different splice variants may be generated A B C B C A C …
  • 72.
    72 Where does thevariation in genomes come from? p Prokaryotes are typically haploid: they have a single (circular) chromosome p DNA is usually inherited vertically (parent to daughter) p Inheritance is clonal n Descendants are faithful copies of an ancestral DNA n Variation is introduced via mutations, transposable elements, and horizontal transfer of DNA Chromosome map of S. dysenteriae, the nine rings describe different properties of the genome http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm
  • 73.
    73 Causes of variation pMistakes in DNA replication p Environmental agents (radiation, chemical agents) p Transposable elements (transposons) n A part of DNA is moved or copied to another location in genome p Horizontal transfer of DNA n Organism obtains genetic material from another organism that is not its parent n Utilized in genetic engineering
  • 74.
    74 Biological string manipulation pPoint mutation: substitution of a base n …ACGGCT… => …ACGCCT… p Deletion: removal of one or more contiguous bases (substring) n …TTGATCA… => …TTTCA… p Insertion: insertion of a substring n …GGCTAG… => …GGTCAACTAG… Lecture: Sequence alignment Lecture: Genome rearrangements
  • 75.
    75 Meiosis p Sexual organismsare usually diploid n Germline cells (gametes) contain N chromosomes n Somatic (body) cells have 2N chromosomes p Meiosis: reduction of chromosome number from 2N to N during reproductive cycle n One chromosome doubling is followed by two cell divisions Major events in meiosis http://en.wikipedia.org/wiki/Meiosis http://www.ncbi.nlm.nih.gov/About/Primer
  • 76.
    76 Recombination and variation pRecap: Allele is a viable DNA coding occupying a given locus (position in the genome) p In recombination, alleles from parents become suffled in offspring individuals via chromosomal crossover over p Allele combinations in offspring are usually different from combinations found in parents p Recombination errors lead into additional variations Chromosomal crossover as described by T. H. Morgan in 1916
  • 77.
    77 Mitosis http://en.wikipedia.org/wiki/Image:Major_events_in_mitosis.svg p Mitosis: growthand development of the organism n One chromosome doubling is followed by one cell division
  • 78.
    78 Recombination frequency andlinked genes p Genetic marker: some DNA sequence of interest (e.g., gene or a part of a gene) p Recombination is more likely to separate two distant markers than two close ones p Linked markers: ”tend” to be inherited together p Marker distances measured in centimorgans: 1 centimorgan corresponds to 1% chance that two markers are separated in recombination
  • 79.
    79 Biological databases p Exponentialgrowth of biological data n New measurement techniques n Before we are able to use the data, we need to store it efficiently -> biological databases n Published data is submitted to databases p General vs specialised databases p This topic is discussed extensively in Practical course in biodatabases (III period)
  • 80.
    80 10 most importantbiodatabases… according to ”Bioinformatics for dummies” p GenBank/DDJB/EMBL www.ncbi.nlm.nih.gov Nucleotide sequences p Ensembl www.ensembl.org Human/mouse genome p PubMed www.ncbi.nlm.nih.gov Literature references p NR www.ncbi.nlm.nih.gov Protein sequences p UniProt www.expasy.org Protein sequences p InterPro www.ebi.ac.uk Protein domains p OMIM www.ncbi.nlm.nih.gov Genetic diseases p Enzymes www.expasy.org Enzymes p PDB www.rcsb.org/pdb/ Protein structures p KEGG www.genome.ad.jp Metabolic pathways Sophia Kossida, Introduction to Bioinformatics, Summer 2008
  • 81.
    81 FASTA format p Asimple format for DNA and protein sequence data is FASTA >Hepatitis delta virus, complete genome atgagccaagttccgaacaaggattcgcggggaggatagatcagcgcccgagaggggtga gtcggtaaagagcattggaacgtcggagatacaactcccaagaaggaaaaaagagaaagc aagaagcggatgaatttccccataacgccagtgaaactctaggaaggggaaagagggaag gtggaagagaaggaggcgggcctcccgatccgaggggcccggcggccaagtttggaggac actccggcccgaagggttgagagtaccccagagggaggaagccacacggagtagaacaga gaaatcacctccagaggaccccttcagcgaacagagagcgcatcgcgagagggagtagac catagcgataggaggggatgctaggagttgggggagaccgaagcgaggaggaaagcaaag agagcagcggggctagcaggtgggtgttccgccccccgagaggggacgagtgaggcttat cccggggaactcgacttatcgtccccacatagcagactcccggaccccctttcaaagtga … Header line, begins with >
  • 82.
  • 83.
    83 Recap p DNA codesinformation with alphabet of 4 letters: A, C, G, T p In proteins, the alphabet size is 20 p DNA -> RNA -> Protein (genetic code) n Three DNA bases (triplet, codon) code for one amino acid n Redundancy, start and stop codons
  • 84.
    84 1 atgagccaag ttccgaacaaggattcgcgg ggaggataga tcagcgcccg agaggggtga 61 gtcggtaaag agcattggaa cgtcggagat acaactccca agaaggaaaa aagagaaagc 121 aagaagcgga tgaatttccc cataacgcca gtgaaactct aggaagggga aagagggaag 181 gtggaagaga aggaggcggg cctcccgatc cgaggggccc ggcggccaag tttggaggac 241 actccggccc gaagggttga gagtacccca gagggaggaa gccacacgga gtagaacaga 301 gaaatcacct ccagaggacc ccttcagcga acagagagcg catcgcgaga gggagtagac 361 catagcgata ggaggggatg ctaggagttg ggggagaccg aagcgaggag gaaagcaaag 421 agagcagcgg ggctagcagg tgggtgttcc gccccccgag aggggacgag tgaggcttat 481 cccggggaac tcgacttatc gtccccacat agcagactcc cggaccccct ttcaaagtga 541 ccgagggggg tgactttgaa cattggggac cagtggagcc atgggatgct cctcccgatt 601 ccgcccaagc tccttccccc caagggtcgc ccaggaatgg cgggacccca ctctgcaggg 661 tccgcgttcc atcctttctt acctgatggc cggcatggtc ccagcctcct cgctggcgcc 721 ggctgggcaa cattccgagg ggaccgtccc ctcggtaatg gcgaatggga cccacaaatc 781 tctctagctt cccagagaga agcgagagaa aagtggctct cccttagcca tccgagtgga 841 cgtgcgtcct ccttcggatg cccaggtcgg accgcgagga ggtggagatg ccatgccgac 901 ccgaagagga aagaaggacg cgagacgcaa acctgcgagt ggaaacccgc tttattcact 961 ggggtcgaca actctgggga gaggagggag ggtcggctgg gaagagtata tcctatggga 1021 atccctggct tccccttatg tccagtccct ccccggtccg agtaaagggg gactccggga 1081 ctccttgcat gctggggacg aagccgcccc cgggcgctcc cctcgttcca ccttcgaggg 1141 ggttcacacc cccaacctgc gggccggcta ttcttctttc ccttctctcg tcttcctcgg 1201 tcaacctcct aagttcctct tcctcctcct tgctgaggtt ctttcccccc gccgatagct 1261 gctttctctt gttctcgagg gccttccttc gtcggtgatc ctgcctctcc ttgtcggtga 1321 atcctcccct ggaaggcctc ttcctaggtc cggagtctac ttccatctgg tccgttcggg 1381 ccctcttcgc cgggggagcc ccctctccat ccttatcttt ctttccgaga attcctttga 1441 tgtttcccag ccagggatgt tcatcctcaa gtttcttgat tttcttctta accttccgga 1501 ggtctctctc gagttcctct aacttctttc ttccgctcac ccactgctcg agaacctctt 1561 ctctcccccc gcggtttttc cttccttcgg gccggctcat cttcgactag aggcgacggt 1621 cctcagtact cttactcttt tctgtaaaga ggagactgct ggccctgtcg cccaagttcg 1681 ag Given a DNA sequence, we might ask a number of questions What sort of statistics should be used to describe the sequence? What sort of organism did this sequence come from? Does the description of this sequence differ from the description of other DNA in the organism? What sort of sequence is this? What does it do?
  • 85.
    85 Biological words p Wecan try to answer questions like these by considering the words in a sequence p A k-word (or a k-tuple) is a string of length k drawn from some alphabet p A DNA k-word is a string of length k that consists of letters A, C, G, T n 1-words: individual nucleotides (bases) n 2-words: dinucleotides (AA, AC, AG, AT, CA, ...) n 3-words: codons (AAA, AAC, …) n 4-words and beyond
  • 86.
    86 1-words: base composition pTypically DNA exists as duplex molecule (two complementary strands) 5’-GGATCGAAGCTAAGGGCT-3’ 3’-CCTAGCTTCGATTCCCGA-5’ Top strand: 7 G, 3 C, 5 A, 3 T Bottom strand: 3 G, 7 C, 3 A, 5 T Duplex molecule: 10 G, 10 C, 8 A, 8 T Base frequencies: 10/36 10/36 8/36 8/36 fr(G + C) = 20/36, fr(A + T) = 1 – fr(G + C) = 16/36 These are something we can determine experimentally.
  • 87.
    87 G+C content p fr(G+ C), or G+C content is a simple statistics for describing genomes p Notice that one value is enough characterise fr(A), fr(C), fr(G) and fr(T) for duplex DNA p Is G+C content (= base composition) able to tell the difference between genomes of different organisms? n Simple computational experiment, if we have the genome sequences under study (-> exercises)
  • 88.
    88 G+C content andgenome sizes (in megabasepairs, Mb) for various organisms p Mycoplasma genitalium 31.6% 0.585 p Escherichia coli K-12 50.7% 4.693 p Pseudomonas aeruginosa PAO1 66.4% 6.264 p Pyrococcus abyssi 44.6% 1.765 p Thermoplasma volcanium 39.9% 1.585 p Caenorhabditis elegans 36% 97 p Arabidopsis thaliana 35% 125 p Homo sapiens 41% 3080
  • 89.
    89 Base frequencies induplex molecules p Consider a DNA sequence generated randomly, with probability of each letter being independent of position in sequence p You could expect to find a uniform distribution of bases in genomes… p This is not, however, the case in genomes, especially in prokaryotes n This phenomena is called GC skew 5’-...GGATCGAAGCTAAGGGCT...-3’ 3’-...CCTAGCTTCGATTCCCGA...-5’
  • 90.
    90 DNA replication fork pWhen DNA is replicated, the molecule takes the replication fork form p New complementary DNA is synthesised at both strands of the ”fork” p New strand in 5’-3’ direction corresponding to replication fork movement is called leading strand and the other lagging strand Leading strand Lagging strand Replication fork Replication fork movement
  • 91.
    91 DNA replication fork pThis process has specific starting points in genome (origins of replication) p Observation: Leading strands have an excess of G over C p This can be described by GC skew statistics Lagging strand Replication fork Leading strand Replication fork movement
  • 92.
    92 GC skew p GCskew is defined as (#G - #C) / (#G + #C) p It is calculated at successive positions in intervals (windows) of specific width 5’-...GGATCGAAGCTAAGGGCT...-3’ 3’-...CCTAGCTTCGATTCCCGA...-5’ (3 – 2) / (3 + 2) = 1/5 (4 – 2) / (4 + 2) = 1/3
  • 93.
    93 p G-C content& GC skew statistics can be displayed with a circular genome map Chromosome map of S. dysenteriae, the nine rings describe different properties of the genome http://www.mgc.ac.cn/ShiBASE/circular_Sd197.htm G-C content & GC skew G+C content GC skew (10kb window size)
  • 94.
    94 GC skew p GCskew often changes sign at origins and termini of replication G+C content GC skew (10kb window size) Nie et al., BMC Genomics, 2006
  • 95.
    95 2-words: dinucleotides p Let’sconsider a sequence L1,L2,...,Ln where each letter Li is drawn from the DNA alphabet {A, C, G, T} p We have 16 possible dinucleotides lili+1: AA, AC, AG, ..., TG, TT.
  • 96.
    96 i.i.d. model fornucleotides p Assume that bases n occur independently of each other n bases at each position are identically distributed p Probability of the base A, C, G, T occuring is pA, pC, pG, pT, respectively n For example, we could use pA=pC=pG=pT=0.25 or estimate the values from known genome data p Probability of lili+1 is then PliPli+1 n For example, P(TG) = pT pG
  • 97.
    97 2-words: is whatwe see surprising? p We can test whether a sequence is ”unexpected”, for example, with a 2 test p Test statistic for a particular dinucleotide r1r2 is 2 = (O – E)2 / E where n O is the observed number of dinucleotide r1r2 n E is the expected number of dinucleotide r1r2 n E = (n – 1)pr1pr2 under i.i.d. model p Basic idea: high values of 2 indicate deviation from the model n Actual procedure is more detailed -> basic statistics courses
  • 98.
    98 Refining the i.i.d.model p i.i.d. model describes some organisms well (see Deonier’s book) but fails to characterise many others p We can refine the model by having the DNA letter at some position depend on letters at preceding positions …TCGTGACGCCG ? Sequence context to consider
  • 99.
    99 First-order Markov chains pLets assume that in sequence X the letter at position t, Xt, depends only on the previous letter Xt-1 (first-order markov chain) p Probability of letter j occuring at position t given Xt-1 = i: pij = P(Xt = j | Xt-1 = i) p We consider homogeneous markov chains: probability pij is independent of position t …TCGTGACGCCG ? Xt Xt-1
  • 100.
    100 Estimating pij p Wecan estimate probabilities pij (”the probability that j follows i”) from observed dinucleotide frequencies A C G T A pAA pAC pAG pAT C pCA pCC pCG pCT G pGA pGC pGG pGT T pTA pTC pTG pTT Frequency of dinucleotide AT in sequence …the values pAA, pAC, ..., pTG, pTT sum to 1 + + + Base frequency fr(C)
  • 101.
    101 Estimating pij p pij= P(Xt = j | Xt-1 = i) = P(Xt = j, Xt-1 = i) P(Xt-1 = i) Probability of transition i -> j Dinucleotide frequency Base frequency of nucleotide i, fr(i) A C G T A 0.146 0.052 0.058 0.089 C 0.063 0.029 0.010 0.056 G 0.050 0.030 0.028 0.051 T 0.086 0.047 0.063 0.140 P(Xt = j, Xt-1 = i) A C G T A 0.423 0.151 0.168 0.258 C 0.399 0.184 0.063 0.354 G 0.314 0.189 0.176 0.321 T 0.258 0.138 0.187 0.415 P(Xt = j | Xt-1 = i) 0.052 / 0.345 0.151
  • 102.
    102 Simulating a DNAsequence p From a transition matrix, it is easy to generate a DNA sequence of length n: n First, choose the starting base randomly according to the base frequency distribution n Then, choose next base according to the distribution P(xt | xt-1) until n bases have been chosen A C G T A 0.423 0.151 0.168 0.258 C 0.399 0.184 0.063 0.354 G 0.314 0.189 0.176 0.321 T 0.258 0.138 0.187 0.415 P(Xt = j | Xt-1 = i) T T C T T C A A Look for R code in Deonier’s book
  • 103.
    103 Simulating a DNAsequence ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttc ttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttaga gtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaac cacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaaga ttaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttattt ttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcata gctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgt tcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggtttt ggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttat ttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaa agtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttct ttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcc caatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatatttta acttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaa ttaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaa atataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctc catttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctacta acctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgc cccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaa atgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtct gaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgacta taaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatg ttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc p Now we can quickly generate sequences of arbitrary length...
  • 104.
    104 Simulating a DNAsequence aa 0.145 0.146 ac 0.050 0.052 ag 0.055 0.058 at 0.092 0.089 ca 0.065 0.063 cc 0.028 0.029 cg 0.011 0.010 ct 0.058 0.056 ga 0.048 0.050 gc 0.032 0.030 gg 0.029 0.028 gt 0.050 0.051 ta 0.084 0.086 tc 0.052 0.047 tg 0.064 0.063 tt 0.138 0.0140 Dinucleotide frequencies Simulated Observed n = 10000
  • 105.
    105 Simulating a DNAsequence ttcttcaaaataaggatagtgattcttattggcttaagggataacaatttagatcttttttcatgaatcatgtatgtcaacgttaaaagttgaactgcaataagttc ttacacacgattgtttatctgcgtgcgaagcatttcactacatttgccgatgcagccaaaagtatttaacatttggtaaacaaattgacttaaatcgcgcacttaga gtttgacgtttcatagttgatgcgtgtctaacaattacttttagttttttaaatgcgtttgtctacaatcattaatcagctctggaaaaacattaatgcatttaaac cacaatggataattagttacttattttaaaattcacaaagtaattattcgaatagtgccctaagagagtactggggttaatggcaaagaaaattactgtagtgaaga ttaagcctgttattatcacctgggtactctggtgaatgcacataagcaaatgctacttcagtgtcaaagcaaaaaaatttactgataggactaaaaaccctttattt ttagaatttgtaaaaatgtgacctcttgcttataacatcatatttattgggtcgttctaggacactgtgattgccttctaactcttatttagcaaaaaattgtcata gctttgaggtcagacaaacaagtgaatggaagacagaaaaagctcagcctagaattagcatgttttgagtggggaattacttggttaactaaagtgttcatgactgt tcagcatatgattgttggtgagcactacaaagatagaagagttaaactaggtagtggtgatttcgctaacacagttttcatacaagttctattttctcaatggtttt ggataagaaaacagcaaacaaatttagtattattttcctagtaaaaagcaaacatcaaggagaaattggaagctgcttgttcagtttgcattaaattaaaaatttat ttgaagtattcgagcaatgttgacagtctgcgttcttcaaataagcagcaaatcccctcaaaattgggcaaaaacctaccctggcttctttttaaaaaaccaagaaa agtcctatataagcaacaaatttcaaaccttttgttaaaaattctgctgctgaataaataggcattacagcaatgcaattaggtgcaaaaaaggccatcctctttct ttttttgtacaattgttcaagcaactttgaatttgcagattttaacccactgtctatatgggacttcgaattaaattgactggtctgcatcacaaatttcaactgcc caatgtaatcatattctagagtattaaaaatacaaaaagtacaattagttatgcccattggcctggcaatttatttactccactttccacgttttggggatatttta acttgaatagttcacaatcaaaacataggaaggatctactgctaaaagcaaaagcgtattggaatgataaaaaactttgatgtttaaaaaactacaaccttaatgaa ttaaagttgaaaaaatattcaaaaaaagaaattcagttcttggcgagtaatatttttgatgtttgagatcagggttacaaaataagtgcatgagattaactcttcaa atataaactgatttaagtgtatttgctaataacattttcgaaaaggaatattatggtaagaattcataaaaatgtttaatactgatacaactttcttttatatcctc catttggccagaatactgttgcacacaactaattggaaaaaaaatagaacgggtcaatctcagtgggaggagaagaaaaaagttggtgcaggaaatagtttctacta acctggtataaaaacatcaagtaacattcaaattgcaaatgaaaactaaccgatctaagcattgattgatttttctcatgcctttcgcctagttttaataaacgcgc cccaactctcatcttcggttcaaatgatctattgtatttatgcactaacgtgcttttatgttagcatttttcaccctgaagttccgagtcattggcgtcactcacaa atgacattacaatttttctatgttttgttctgttgagtcaaagtgcatgcctacaattctttcttatatagaactagacaaaatagaaaaaggcacttttggagtct gaatgtcccttagtttcaaaaaggaaattgttgaattttttgtggttagttaaattttgaacaaactagtatagtggtgacaaacgatcaccttgagtcggtgacta taaaagaaaaaggagattaaaaatacctgcggtgccacattttttgttacgggcatttaaggtttgcatgtgttgagcaattgaaacctacaactcaataagtcatg ttaagtcacttctttgaaaaaaaaaaagaccctttaagcaagctc p The model is able to generate correct proportions of 1- and 2-words in genomes... p ...but fails with k=3 and beyond.
  • 106.
    106 3-words: codons p Wecan extend the previous method to 3- words p k=3 is an important case in study of DNA sequences because of genetic code 5’ 3’ 3’ 5’ … a t g a g t g g a … … t a c t c a c c t … a u g a g u g g a ... M S G …
  • 107.
    107 3-word probabilities p Let’sagain assume a sequence L of independent bases p Probability of 3-word r1r2r3 at position i,i+1,i+2 in sequence L is P(Li = r1, Li+1 = r2, Li+2 = r3) = P(Li = r1)P(Li+1 = r2)P(Li+2 = r3)
  • 108.
    108 3-words in Escherichiacoli genome AAA 108924 0.02348 0.01492 AAC 82582 0.01780 0.01541 AAG 63369 0.01366 0.01537 AAT 82995 0.01789 0.01490 ACA 58637 0.01264 0.01541 ACC 74897 0.01614 0.01591 ACG 73263 0.01579 0.01588 ACT 49865 0.01075 0.01539 AGA 56621 0.01220 0.01537 AGC 80860 0.01743 0.01588 AGG 50624 0.01091 0.01584 AGT 49772 0.01073 0.01536 ATA 63697 0.01373 0.01490 ATC 86486 0.01864 0.01539 ATG 76238 0.01643 0.01536 ATT 83398 0.01797 0.01489 CAA 76614 0.01651 0.01541 CAC 66751 0.01439 0.01591 CAG 104799 0.02259 0.01588 CAT 76985 0.01659 0.01539 CCA 86436 0.01863 0.01591 CCC 47775 0.01030 0.01643 CCG 87036 0.01876 0.01640 CCT 50426 0.01087 0.01589 CGA 70938 0.01529 0.01588 CGC 115695 0.02494 0.01640 CGG 86877 0.01872 0.01636 CGT 73160 0.01577 0.01586 CTA 26764 0.00577 0.01539 CTC 42733 0.00921 0.01589 CTG 102909 0.02218 0.01586 CTT 63655 0.01372 0.01537 Word Count Observed Expected Word Count Observed Expected
  • 109.
    109 3-words in Escherichiacoli genome GAA 83494 0.01800 0.01537 GAC 54737 0.01180 0.01588 GAG 42465 0.00915 0.01584 GAT 86551 0.01865 0.01536 GCA 96028 0.02070 0.01588 GCC 92973 0.02004 0.01640 GCG 114632 0.02471 0.01636 GCT 80298 0.01731 0.01586 GGA 56197 0.01211 0.01584 GGC 92144 0.01986 0.01636 GGG 47495 0.01024 0.01632 GGT 74301 0.01601 0.01582 GTA 52672 0.01135 0.01536 GTC 54221 0.01169 0.01586 GTG 66117 0.01425 0.01582 GTT 82598 0.01780 0.01534 TAA 68838 0.01484 0.01490 TAC 52592 0.01134 0.01539 TAG 27243 0.00587 0.01536 TAT 63288 0.01364 0.01489 TCA 84048 0.01812 0.01539 TCC 56028 0.01208 0.01589 TCG 71739 0.01546 0.01586 TCT 55472 0.01196 0.01537 TGA 83491 0.01800 0.01536 TGC 95232 0.02053 0.01586 TGG 85141 0.01835 0.01582 TGT 58375 0.01258 0.01534 TTA 68828 0.01483 0.01489 TTC 83848 0.01807 0.01537 TTG 76975 0.01659 0.01534 TTT 109831 0.02367 0.01487 Word Count Observed Expected Word Count Observed Expected
  • 110.
    110 2nd order MarkovChains p Markov chains readily generalise to higher orders p In 2nd order markov chain, position t depends on positions t-1 and t-2 p Transition matrix: p Probabilistic models for DNA and amino acid sequences will be discussed in Biological sequence analysis course (II period) A C G T AA AC AG AT CA ...
  • 111.
    111 Codon Adaptation Index(CAI) p Observation: cells prefer certain codons over others in highly expressed genes n Gene expression: DNA is transcribed into RNA (and possibly translated into protein) Phe TTT 0.493 0.551 0.291 TTC 0.507 0.449 0.709 Ala GCT 0.246 0.145 0.275 GCC 0.254 0.276 0.164 GCA 0.246 0.196 0.240 GCG 0.254 0.382 0.323 Asn AAT 0.493 0.409 0.172 AAC 0.507 0.591 0.828 Amino acid Codon Predicted Gene class I Gene class II Highly expressed Moderately expressed Codon frequencies for some genes in E. coli
  • 112.
    112 Codon Adaptation Index(CAI) p CAI is a statistic used to compare the distribution of codons observed with the preferred codons for highly expressed genes
  • 113.
    113 Codon Adaptation Index(CAI) p Consider an amino acid sequence X = x1x2...xn p Let pk be the probability that codon k is used in highly expressed genes p Let qk be the highest probability that a codon coding for the same amino acid as codon k has n For example, if codon k is ”GCC”, the corresponding amino acid is Alanine (see genetic code table; also GCT, GCA, GCG code for Alanine) n Assume that pGCC = 0.164, pGCT = 0.275, pGCA = 0.240, pGCG = 0.323 n Now qGCC = qGCT = qGCA = qGCG = 0.323
  • 114.
    114 Codon Adaptation Index(CAI) p CAI is defined as p CAI can be given also in log-odds form: log(CAI) = (1/n) log(pk / qk) CAI = ( pk / qk ) k=1 n 1/n k=1 n
  • 115.
    115 CAI: example withan E. coli gene M A L T K A E M S E Y L … ATG GCG CTT ACA AAA GCT GAA ATG TCA GAA TAT CTG 1.00 0.47 0.02 0.45 0.80 0.47 0.79 1.00 0.43 0.79 0.19 0.02 0.06 0.02 0.47 0.20 0.06 0.21 0.32 0.21 0.81 0.02 0.28 0.04 0.04 0.28 0.03 0.04 0.20 0.03 0.05 0.20 0.01 0.03 0.01 0.04 0.01 0.89 0.18 0.89 ATG GCT TTA ACT AAA GCT GAA ATG TCT GAA TAT TTA GCC TTG ACC AAG GCC GAG TCC GAG TAC TTG GCA CTT ACA GCA TCA CTT GCG CTC ACG GCG TCG CTC CTA AGT CTA CTG AGC CTG 1.00 0.20 0.04 0.04 0.80 0.47 0.79 1.00 0.03 0.79 0.19 0.89… 1.00 0.47 0.89 0.47 0.80 0.47 0.79 1.00 0.43 0.79 0.81 0.89 1/n qk pk
  • 116.
    116 CAI: properties p CAI= 1.0 : each codon was the most frequently used codon in highly expressed genes p Log-odds used to avoid numerical problems n What happens if you multiply many values <1.0 together? p In a sample of E.coli genes, CAI ranged from 0.2 to 0.85 p CAI correlates with mRNA levels: can be used to predict high expression levels
  • 117.
    117 Biological words: summary pSimple 1-, 2- and 3-word models can describe interesting properties of DNA sequences n GC skew can identify DNA replication origins n It can also reveal genome rearrangement events and lateral transfer of DNA n GC content can be used to locate genes: human genes are comparably GC-rich n CAI predicts high gene expression levels
  • 118.
    118 Biological words: summary pk=3 models can help to identify correct reading frames n Reading frame starts from a start codon and stops in a stop codon n Consider what happens to translation when a single extra base is introduced in a reading frame p Also word models for k > 3 have their uses
  • 119.
    119 Next lecture p Genomesequencing & assembly – where do we get sequence data?
  • 120.
    120 Note on programminglanguages p Working with probability distributions is straightforward with R, for example n Deonier’s book contains many computational examples n You can use R in CS Linux systems p Python works too!
  • 121.
    121 #!/usr/bin/env python import sys,random n = int(sys.argv[1]) tm = {'a' : {'a' : 0.423, 'c' : 0.151, 'g' : 0.168, 't' : 0.258}, 'c' : {'a' : 0.399, 'c' : 0.184, 'g' : 0.063, 't' : 0.354}, 'g' : {'a' : 0.314, 'c' : 0.189, 'g' : 0.176, 't' : 0.321}, 't' : {'a' : 0.258, 'c' : 0.138, 'g' : 0.187, 't' : 0.415}} pi = {'a' : 0.345, 'c' : 0.158, 'g' : 0.159, 't' : 0.337} def choose(dist): r = random.random() sum = 0.0 keys = dist.keys() for k in keys: sum += dist[k] if sum > r: return k return keys[-1] c = choose(pi) for i in range(n - 1): sys.stdout.write(c) c = choose(tm[c]) sys.stdout.write(c) sys.stdout.write("n") Example Python code for generating DNA sequences with first-order Markov chains. Function choose(), returns a key (here ’a’, ’c’, ’g’ or ’t’) of the dictionary ’dist’ chosen randomly according to probabilities in dictionary values. Choose the first letter, then choose next letter according to P(xt | xt-1). Transition matrix tm and initial distribution pi. Initialisation: use packages ’sys’ and ’random’, read sequence length from input.
  • 122.
  • 123.
    123 Genome sequencing &assembly p DNA sequencing n How do we obtain DNA sequence information from organisms? p Genome assembly n What is needed to put together DNA sequence information from sequencing? p First statement of sequence assembly problem (according to G. Myers): n Peltola, Söderlund, Tarhio, Ukkonen: Algorithms for some string matching problems arising in molecular genetics. Proc. 9th IFIP World Computer Congress, 1983
  • 124.
  • 125.
    125 DNA sequencing p DNAsequencing: resolving a nucleotide sequence (whole-genome or less) p Many different methods developed n Maxam-Gilbert method (1977) n Sanger method (1977) n High-throughput methods
  • 126.
    126 Sanger sequencing: sequencingby synthesis p A sequencing technique developed by Fred Sanger p Also called dideoxy sequencing
  • 127.
    127 http://en.wikipedia.org/wiki/DNA_polymerase DNA polymerase p ADNA polymerase is an enzyme that catalyzes DNA synthesis p DNA polymerase needs a primer n Synthesis proceeds always in 5’->3’ direction
  • 128.
    128 Dideoxy sequencing p InSanger sequencing, chain-terminating dideoxynucleoside triphosphates (ddXTPs) are employed n ddATP, ddCTP, ddGTP, ddTTP lack the 3’-OH tail of dXTPs p A mixture of dXTPs with small amount of ddXTPs is given to DNA polymerase with DNA template and primer p ddXTPs are given fluorescent labels
  • 129.
    129 Dideoxy sequencing p WhenDNA polymerase encounters a ddXTP, the synthesis cannot proceed p The process yields copied sequences of different lengths p Each sequence is terminated by a labeled ddXTP
  • 130.
    130 Determining the sequence pSequences are sorted according to length by capillary electrophoresis p Fluorescent signals corresponding to labels are registered p Base calling: identifying which base corresponds to each position in a read n Non-trivial problem! Output sequences from base calling are called reads
  • 131.
    131 Reads are short! pModern Sanger sequencers can produce quality reads up to ~750 bases1 n Instruments provide you with a quality file for bases in reads, in addition to actual sequence data p Compare the read length against the size of the human genome (2.9x109 bases) p Reads have to be assembled! 1 Nature Methods - 5, 16 - 18 (2008)
  • 132.
    132 Problems with sequencing pSanger sequencing error rate per base varies from 1% to 3%1 p Repeats in DNA n For example, ~300 base Alu sequence repeated is over million times in human genome n Repeats occur in different scales p What happens if repeat length is longer than read length? n We will get back to this problem later 1 Jones, Pevzner (2004)
  • 133.
    133 Shortest superstring problem pFind the shortest string that ”explains” the reads p Given a set of strings (reads), find a shortest string that contains all of them
  • 134.
    134 Example: Shortest superstring Setof strings: {000, 001, 010, 011, 100, 101, 110, 111} Concetenation of strings: 000001010011100101110111 010 110 011 000 Shortest superstring: 0001110100 001 111 101 100
  • 135.
    135 Shortest superstrings: issues pNP-complete problem: unlike to have an efficient (exact) algorithm p Reads may be from either strand of DNA p Is the shortest string necessarily the correct assembly? p What about errors in reads? p Low coverage -> gaps in assembly n Coverage: average number of times each base occurs in the set of reads (e.g., 5x coverage)
  • 136.
    136 Sequence assembly andcombination locks p What is common with sequence assembly and opening keypad locks?
  • 137.
    137 Whole-genome shotgun sequence pWhole-genome shotgun sequence assembly starts with a large sample of genomic DNA 1. Sample is randomly partitioned into inserts of length > 500 bases 2. Inserts are multiplied by cloning them into a vector which is used to infect bacteria 3. DNA is collected from bacteria and sequenced 4. Reads are assembled
  • 138.
    138 Assembly of readswith Overlap-Layout- Consensus algorithm p Overlap n Finding potentially overlapping reads p Layout n Finding the order of reads along DNA p Consensus (Multiple alignment) n Deriving the DNA sequence from the layout p Next, the method is described at a very abstract level, skipping a lot of details Kececioglu, J.D. and E.W. Myers. 1995. Combinatorial algorithms for DNA sequence assembly. Algorithmica 13: 7-51.
  • 139.
    139 Finding overlaps p First,pairwise overlap alignment of reads is resolved p Reads can be from either DNA strand: The reverse complement r* of each read r has to be considered acggagtcc agtccgcgctt 5’ 3’ 3’ 5’ … a t g a g t g g a … … t a c t c a c c t … r1 r2 r1: tgagt, r1 *: actca r2: tccac, r2 *: gtgga
  • 140.
    140 Example sequence toassemble p 20 reads: 5’ – CAGCGCGCTGCGTGACGAGTCTGACAAAGACGGTATGCGCATCG TGATTGAAGTGAAACGCGATGCGGTCGGTCGGTGAAGTTGTGCT - 3’ # Read Read* 1 CATCGTCA TCACGATG 2 CGGTGAAG CTTCACCG 3 TATGCGCA TGCGCATA 4 GACGAGTC GACTCGTC 5 CTGACAAA TTTGTCAG 6 ATGCGCAT ATGCGCAT 7 ATGCGGTC GACCGCAT 8 CTGCGTGA TCACGCAG 9 GCGTGACG CGTCACGC 10 GTCGGTGA TCACCGAC # Read Read* 11 GGTCGGTG CACCGACC 12 ATCGTGAT ATCACGAT 13 GCGCTGCG CGCAGCGC 14 GCATCGTG CACGATGC 15 AGCGCGCT AGCGCGCT 16 GAAGTTGT ACAACTTC 17 AGTGAAAC GTTTCACT 18 ACGCGATG CATCGCGT 19 GCGCATCG CGATGCGC 20 AAGTGAAA TTTCACTT
  • 141.
    141 Finding overlaps p Overlapbetween two reads can be found with a dynamic programming algorithm n Errors can be taken into account p Dynamic programming will be discussed more on next lecture p Overlap scores stored into the overlap matrix n Entries (i, j) below the diagonal denote overlap of read ri and rj * 1 CATCGTCA 6 ATGCGCAT 12 ATCGTGAT Overlap(1, 6) = 3 Overlap(1, 12) = 7 1 6 12 3 7
  • 142.
    142 Finding layout &consensus p Method extends the assembly greedily by choosing the best overlaps p Both orientations are considered p Sequence is extended as far as possible 7* GACCGCAT 6=6* ATGCGCAT 14 GCATCGTG 1 CATCGTGA 12 ATCGTGAT 19 GCGCATCG 13* CGCAGCGC --------------------- CGCATCGTGAT Ambiguous bases Consensus sequence
  • 143.
    143 Finding layout &consensus p We move on to next best overlaps and extend the sequence from there p The method stops when there are no more overlaps to consider p A number of contigs is produced p Contig stands for contiguous sequence, resulting from merging reads 2 CGGTGAAG 10 GTCGGTGA 11 GGTCGGTG 7 ATGCGGTC --------------------- ATGCGGTCGGTGAAG
  • 144.
    144 Whole-genome shotgun sequencing: summary pOrdering of the reads is initially unknown p Overlaps resolved by aligning the reads p In a 3x109 bp genome with 500 bp reads and 5x coverage, there are ~107 reads and ~107(107-1)/2 = ~5x1013 pairwise sequence comparisons … … Original genome sequence Reads Non-overlapping read Overlapping reads => Contig
  • 145.
    145 Repeats in DNAand genome assembly Pop, Salzberg, Shumway (2002) Two instances of the same repeat
  • 146.
    146 Repeats in DNAcause problems in sequence assembly p Recap: if repeat length exceeds read length, we might not get the correct assembly p This is a problem especially in eukaryotes n ~3.1% of genome consists of repeats in Drosophila, ~45% in human p Possible solutions 1. Increase read length – feasible? 2. Divide genome into smaller parts, with known order, and sequence parts individually
  • 147.
    147 ”Divide and conquer”sequencing approaches: BAC-by-BAC Whole-genome shotgun sequencing Divide-and-conquer Genome Genome BAC library
  • 148.
    148 BAC-by-BAC sequencing p EachBAC (Bacterial Artificial Chromosome) is about 150 kbp p Covering the human genome requires ~30000 BACs p BACs shotgun-sequenced separately n Number of repeats in each BAC is significantly smaller than in the whole genome... n ...needs much more manual work compared to whole-genome shotgun sequencing
  • 149.
    149 Hybrid method p Divide-and-conquerand whole-genome shotgun approaches can be combined n Obtain high coverage from whole-genome shotgun sequencing for short contigs n Generate of a set of BAC contigs with low coverage n Use BAC contigs to ”bin” short contigs to correct places p This approach was used to sequence the brown Norway rat genome in 2004
  • 150.
    150 Paired end sequencing pPaired end (or mate-pair) sequencing is technique where n both ends of an insert are sequenced n For each insert, we get two reads n We know the distance between reads, and that they are in opposite orientation n Typically read length < insert length k Read 1 Read 2
  • 151.
    151 Paired end sequencing pThe key idea of paired end sequencing: n Both reads from an insert are unlikely to be in repeat regions n If we know where the first read is, we know also second’s location p This technique helps to WGSS higher organisms k Read 1 Read 2 Repeat region
  • 152.
    152 First whole-genome shotgunsequencing project: Drosophila melanogaster p Fruit fly is a common model organism in biological studies p Whole-genome assembly reported in Eugene Myers, et al., A Whole-Genome Assembly of Drosophila, Science 24, 2000 p Genome size 120 Mbp http://en.wikipedia.org/wiki/Drosophila_melanogaster
  • 153.
    153 Sequencing of theHuman Genome p The (draft) human genome was published in 2001 p Two efforts: n Human Genome Project (public consortium) n Celera (private company) p HGP: BAC-by-BAC approach p Celera: whole-genome shotgun sequencing HGP: Nature 15 February 2001 Vol 409 Number 6822 Celera: Science 16 February 2001 Vol 291, Issue 5507
  • 154.
    154 Genome assembly software pphrap (Phil’s revised assembly program) p AMOS (A Modular, Open-Source whole- genome assembler) p CAP3 / PCAP p TIGR assembler
  • 155.
    155 Next generation sequencingtechniques p Sanger sequencing is the prominent first- generation sequencing method p Many new sequencing methods are emerging p See Lars Paulin’s slides (course web page) for details
  • 156.
    156 Next-gen sequencing: 454 pGenome Sequencer FLX (454 Life Science / Roche) n >100 Mb / 7.5 h run n Read length 250-300 bp n >99.5% accuracy / base in a single run n >99.99% accuracy / base in consensus
  • 157.
    157 Next-gen sequencing: IlluminaSolexa p Illumina / Solexa Genome Analyzer n Read length 35 - 50 bp n 1-2 Gb / 3-6 day run n > 98.5% accuracy / base in a single run n 99.99% accuracy / consensus with 3x coverage
  • 158.
    158 Next-gen sequencing: SOLiD pSOLiD n Read length 25-30 bp n 1-2 Gb / 5-10 day run n >99.94% accuracy / base n >99.999% accuracy / consensus with 15x coverage
  • 159.
    159 Next-gen sequencing: Helicos pHelicos: Single Molecule Sequencer n No amplification of sequences needed n Read length up to 55 bp p Accuracy does not decrease when read length is increased p Instead, throughput goes down n 25-90 Mb / h n >2 Gb / day
  • 160.
    160 Next-gen sequencing: Pacific Biosciences pPacific Biosciences n Single-Molecule Real-Time (SMRT) DNA sequencing technology n Read length “thousands of nucleotides” p Should overcome most problems with repeats n Throughput estimate: 100 Gb / hour n First instruments in 2010?
  • 161.
    161 Exercise groups p 1stgroup: Tuesdays 16.15-18.00 C221 p 2nd group: Wednesday 14.15-16.00 B120 p You can choose freely which group you want to attend to p Send exercise notes before the 1st group starts (Tue 16.15), even if you go to the 2nd group
  • 162.
  • 163.
    163 Sequence alignment p Thebiological problem p Global alignment p Local alignment p Multiple alignment
  • 164.
    164 Background: comparative genomics pBasic question in biology: what properties are shared among organisms? p Genome sequencing allows comparison of organisms at DNA and protein levels p Comparisons can be used to n Find evolutionary relationships between organisms n Identify functionally conserved sequences n Identify corresponding genes in human and model organisms: develop models for human diseases
  • 165.
    165 Homologs p Two genes(sequences in general) gB and gC evolved from the same ancestor gene gA are called homologs p Homologs usually exhibit conserved functions p Close evolutionary relationship => expect a high number of homologs gB = agtgccgttaaagttgtacgtc gC = ctgactgtttgtggttc gA = agtgtccgttaagtgcgttc
  • 166.
    166 p We expecthomologs to be ”similar” to each other p Intuitively, similarity of two sequences refers to the degree of match between corresponding positions in sequence p What about sequences that differ in length? Sequence similarity agtgccgttaaagttgtacgtc ctgactgtttgtggttc
  • 167.
    167 Similarity vs homology pSequence similarity is not sequence homology n If the two sequences gB and gC have accumulated enough mutations, the similarity between them is likely to be low Homology is more difficult to detect over greater evolutionary distances. 0 agtgtccgttaagtgcgttc 1 agtgtccgttatagtgcgttc 2 agtgtccgcttatagtgcgttc 4 agtgtccgcttaagggcgttc 8 agtgtccgcttcaaggggcgt 16 gggccgttcatgggggt 32 gcagggcgtcactgagggct 64 acagtccgttcgggctattg 128 cagagcactaccgc 256 cacgagtaagatatagct 512 taatcgtgata 1024 acccttatctacttcctggagtt 2048 agcgacctgcccaa 4096 caaac #mutations #mutations
  • 168.
    168 Similarity vs homology(2) p Sequence similarity can occur by chance n Similarity does not imply homology p Consider comparing two short sequences against each other
  • 169.
    169 Orthologs and paralogs pWe distinguish between two types of homology n Orthologs: homologs from two different species, separated by a speciation event n Paralogs: homologs within a species, separated by a gene duplication event gA gB gC Organism B Organism C gA gA gA’ gB gC Organism A Gene duplication event Orthologs Paralogs
  • 170.
    170 Orthologs and paralogs(2) p Orthologs typically retain the original function p In paralogs, one copy is free to mutate and acquire new function (no selective pressure) gA gB gC Organism B Organism C gA gA gA’ gB gC Organism A
  • 171.
    171 Paralogy example: hemoglobin pHemoglobin is a protein complex which transports oxygen p In humans, hemoglobin consists of four protein subunits and four non- protein heme groups Hemoglobin A, www.rcsb.org/pdb/explore.do?structureId=1GZX Sickle cell diseases are caused by mutations in hemoglobin genes http://en.wikipedia.org/wiki/Image:Sicklecells.jpg
  • 172.
    172 Paralogy example: hemoglobin pIn adults, three types are normally present n Hemoglobin A: 2 alpha and 2 beta subunits n Hemoglobin A2: 2 alpha and 2 delta subunits n Hemoglobin F: 2 alpha and 2 gamma subunits p Each type of subunit (alpha, beta, gamma, delta) is encoded by a separate gene Hemoglobin A, www.rcsb.org/pdb/explore.do?structureId=1GZX
  • 173.
    173 Paralogy example: hemoglobin pThe subunit genes are paralogs of each other, i.e., they have a common ancestor gene p Exercise: hemoglobin human paralogs in NCBI sequence databases http://www.ncbi.nlm.nih.gov/sites/entre z?db=Nucleotide n Find human hemoglobin alpha, beta, gamma and delta n Compare sequences Hemoglobin A, www.rcsb.org/pdb/explore.do?structureId=1GZX
  • 174.
    174 Orthology example: insulin pThe genes coding for insulin in human (Homo sapiens) and mouse (Mus musculus) are orthologs: n They have a common ancestor gene in the ancestor species of human and mouse n Exercise: find insulin orthologs from human and mouse in NCBI sequence databases
  • 175.
    175 Sequence alignment p Alignmentspecifies which positions in two sequences match acgtctag ||||| -actctag 5 matches 2 mismatches 1 not aligned acgtctag || actctag- 2 matches 5 mismatches 1 not aligned acgtctag || ||||| ac-tctag 7 matches 0 mismatches 1 not aligned
  • 176.
    176 Sequence alignment p Maximumalignment length is the total length of the two sequences acgtctag------- --------actctag 0 matches 0 mismatches 15 not aligned -------acgtctag actctag-------- 0 matches 0 mismatches 15 not aligned
  • 177.
    177 Mutations: Insertions, deletionsand substitutions p Insertions and/or deletions are called indels n We can’t tell whether the ancestor sequence had a base or not at indel position! acgtctag ||||| -actctag Indel: insertion or deletion of a base with respect to the ancestor sequence Mismatch: substitution (point mutation) of a single base
  • 178.
    178 Problems p What sortsof alignments should be considered? p How to score alignments? p How to find optimal or good scoring alignments? p How to evaluate the statistical significance of scores? In this course, we discuss each of these problems briefly.
  • 179.
    179 Sequence Alignment (chapter6) p The biological problem p Global alignment p Local alignment p Multiple alignment
  • 180.
    180 Global alignment p Problem:find optimal scoring alignment between two sequences (Needleman & Wunsch 1970) p Every position in both sequences is included in the alignment p We give score for each position in alignment n Identity (match) +1 n Substitution (mismatch) -µ n Indel - p Total score: sum of position scores
  • 181.
    181 Scoring: Toy example pConsider two sequences with characters drawn from the English language alphabet: WHAT, WHY WHAT || WH-Y S(WHAT/WH-Y) = 1 + 1 – – µ WHAT -WHY S(WHAT/-WHY) = – – µ – µ – µ
  • 182.
    182 Dynamic programming p Howto find the optimal alignment? p We use previous solutions for optimal alignments of smaller subsequences p This general approach is known as dynamic programming
  • 183.
    183 Introduction to dynamicprogramming: the money change problem p Suppose you buy a pen for 4.23€ and pay for with a 5€ note p You get 77 cents in change – what coins is the cashier going to give you if he or she tries to minimise the number of coins? p The usual algorithm: start with largest coin (denominator), proceed to smaller coins until no change is left: n 50, 20, 5 and 2 cents p This greedy algorithm is incorrect, in the sense that it does not always give you the correct answer
  • 184.
    184 The money changeproblem p How else to compute the change? p We could consider all possible ways to reduce the amount of change p Suppose we have 77 cents change, and the following coins: 50, 20, 5 cents p We can compute the change with recursion n C(n) = min { C(n – 50) + 1, C(n – 20) + 1, C(n – 5) + 1 } p Figure shows the recursion tree for the example 77 72 27 57 7 22 7 37 52 22 52 67 … 50 20 5 p Many values are computed more than once! p This leads to a correct but very inefficient algorithm
  • 185.
    185 The money changeproblem p We can speed the computation up by solving the change problem for all i n n Example: solve the problem for 9 cents with available coins being 1, 2 and 5 cents p Solve the problem in steps, first for 1 cent, then 2 cents, and so on p In each step, utilise the solutions from the previous steps
  • 186.
    186 The money changeproblem 0 1 2 3 4 5 6 7 8 9 Amount of change left p Algorithm runs in time proportional to Md, where M is the amount of change and d is the number of coin types p The same technique of storing solutions of subproblems can be utilised in aligning sequences Number of coins used 0 1 1 2 2 1 2 2 3 3
  • 187.
    187 Representing alignments andscores Y H W - T A H W - WHAT || WH-Y Alignments can be represented in the following tabular form. Each alignment corresponds to a path through the table.
  • 188.
    188 Representing alignments andscores Y H W - T A H W - WHAT--- ----WHY WH-AT || WHY--
  • 189.
    189 Y H W - T A H W - WHAT || WH-Y Global alignment score S3,4= 2- -µ 2- -µ 2- 2 1 0 Representing alignments and scores
  • 190.
    190 Filling the alignmentmatrix Y H W - T A H W - Case 1 Case 2 Case 3 Consider the alignment process at shaded square. Case 1. Align H against H (match) Case 2. Align H in WHY against – (indel) in WHAT Case 3. Align H in WHAT against – (indel) in WHY
  • 191.
    191 Filling the alignmentmatrix (2) Y H W - T A H W - Case 1 Case 2 Case 3 Scoring the alternatives. Case 1. S2,2 = S1,1 + s(2, 2) Case 2. S2,2 = S1,2 Case 3. S2,2 = S2,1 s(i, j) = 1 for matching positions, s(i, j) = - µ for substitutions. Choose the case (path) that yields the maximum score. Keep track of path choices.
  • 192.
    192 Global alignment: formal development A= a1a2a3…an, B = b1b2b3…bm a3 a2 a1 - b4 b3 b2 b1 - 3 2 1 0 4 3 2 1 0 b1 b2 b3 b4 - - - a1 a2 a3 l Any alignment can be written as a unique path through the matrix l Score for aligning A and B up to positions i and j: Si,j = S(a1a2a3…ai, b1b2b3…bj)
  • 193.
    193 Scoring partial alignments pAlignment of A = a1a2a3…ai with B = b1b2b3…bj can be end in three possible ways n Case 1: (a1a2…ai-1) ai (b1b2…bj-1) bj n Case 2: (a1a2…ai-1) ai (b1b2…bj) - n Case 3: (a1a2…ai) – (b1b2…bj-1) bj
  • 194.
    194 Scoring alignments p Scoresfor each case: n Case 1: (a1a2…ai-1) ai (b1b2…bj-1) bj n Case 2: (a1a2…ai-1) ai (b1b2…bj) – n Case 3: (a1a2…ai) – (b1b2…bj-1) bj s(ai, bj) = { -µ otherwise +1 if ai = bj s(ai, -) = s(-, bj) = -
  • 195.
    195 Scoring alignments (2) pFirst row and first column correspond to initial alignment against indels: S(i, 0) = -i S(0, j) = -j p Optimal global alignment score S(A, B) = Sn,m a3 a2 a1 - b4 b3 b2 b1 - -3 3 -2 2 - 1 -4 -3 -2 - 0 0 4 3 2 1 0
  • 196.
    196 Algorithm for globalalignment Input sequences A, B, n = |A|, m = |B| Set Si,0 := - i for all i Set S0,j := - j for all j for i := 1 to n for j := 1 to m Si,j := max{Si-1,j – , Si-1,j-1 + s(ai,bj), Si,j-1 – } end end Algorithm takes O(nm) time
  • 197.
  • 198.
  • 199.
    199 Global alignment: example(2) -2 0 -3 -4 -7 -10 T -4 -3 -1 -2 -5 -8 G -5 -5 -3 -2 -3 -6 C -6 -4 -4 -2 -1 -4 T -9 -7 -5 -3 -1 -2 A -10 -8 -6 -4 -2 0 - G T G G T - µ = 1 = 2 ATCGT- | || -TGGTG
  • 200.
    200 Sequence Alignment (chapter6) p The biological problem p Global alignment p Local alignment p Multiple alignment
  • 201.
    201 Local alignment: rationale pOtherwise dissimilar proteins may have local regions of similarity -> Proteins may share a function Human bone morphogenic protein receptor type II precursor (left) has a 300 aa region that resembles 291 aa region in TGF- receptor (right). The shared function here is protein kinase.
  • 202.
    202 Local alignment: rationale pGlobal alignment would be inadequate p Problem: find the highest scoring local alignment between two sequences p Previous algorithm with minor modifications solves this problem (Smith & Waterman 1981) A B Regions of similarity
  • 203.
    203 From global tolocal alignment p Modifications to the global alignment algorithm n Look for the highest-scoring path in the alignment matrix (not necessarily through the matrix), or in other words: n Allow preceding and trailing indels without penalty
  • 204.
    204 Scoring local alignments A= a1a2a3…an, B = b1b2b3…bm Let I and J be intervals (substrings) of A and B, respectively: Best local alignment score: where S(I, J) is the alignment score for substrings I and J.
  • 205.
    205 Allowing preceding andtrailing indels p First row and column initialised to zero: Mi,0 = M0,j = 0 a3 a2 a1 - b4 b3 b2 b1 - 0 3 0 2 0 1 0 0 0 0 0 0 4 3 2 1 0 b1 b2 b3 - - a1
  • 206.
    206 Recursion for localalignment p Mi,j = max { Mi-1,j-1 + s(ai, bi), Mi-1,j – , Mi,j-1 – , 0 } 0 2 0 0 1 0 T 1 0 1 1 0 0 G 0 0 0 0 0 0 C 0 1 0 0 1 0 T 0 0 0 0 0 0 A 0 0 0 0 0 0 - G T G G T - Allow alignment to start anywhere in sequences
  • 207.
    207 Finding best localalignment p Optimal score is the highest value in the matrix = maxi,j Mi,j p Best local alignment can be found by backtracking from the highest value in M p What is the best local alignment in this example? 0 2 0 0 1 0 T 1 0 1 1 0 0 G 0 0 0 0 0 0 C 0 1 0 0 1 0 T 0 0 0 0 0 0 A 0 0 0 0 0 0 - G T G G T -
  • 208.
    208 Local alignment: example 0 G 8 0 G 7 0 A 6 0 A 5 0 T 4 0 C 3 0 C 2 0 A 1 0 0 0 0 0 0 0 0 0 0 0 - 0 A C T A A C T C G G - 10 9 8 7 6 5 4 3 2 1 0 Mi,j= max { Mi-1,j-1 + s(ai, bi), Mi-1,j , Mi,j-1 , 0 } 0 Scoring (for example) Match: +2 Mismatch: -1 Indel: -2
  • 209.
    209 Local alignment: example 0 G 8 0 G 7 0 A 6 0 A 5 0 T 4 0 C 3 0 C 2 0 A 1 0 0 0 0 0 0 0 0 0 0 0 - 0 A C T A A C T C G G - 10 9 8 7 6 5 4 3 2 1 0 Mi,j= max { Mi-1,j-1 + s(ai, bi), Mi-1,j , Mi,j-1 , 0 } 0 0 0 0 0 2 Scoring (for example) Match: +2 Mismatch: -1 Indel: -2
  • 210.
    210 C T –A A C T C A A Local alignment: example Scoring (for example) Match: +2 Mismatch: -1 Indel: -2 Optimal local alignment: 2 4 3 2 1 0 0 2 4 2 0 G 8 1 3 5 4 3 0 0 0 2 2 0 G 7 3 2 4 6 5 1 0 0 0 0 0 A 6 3 1 1 3 4 3 2 0 0 0 0 A 5 2 1 2 0 1 2 4 0 0 0 0 T 4 1 3 0 0 1 2 1 2 0 0 0 C 3 0 2 1 1 0 2 0 2 0 0 0 C 2 2 0 0 2 2 0 0 0 0 0 0 A 1 0 0 0 0 0 0 0 0 0 0 0 - 0 A C T A A C T C G G - 10 9 8 7 6 5 4 3 2 1 0
  • 211.
    211 Multiple optimal alignments Non-optimal,good-scoring alignments 2 4 3 2 1 0 0 2 4 2 0 G 8 1 3 5 4 3 0 0 0 2 2 0 G 7 3 2 4 6 5 1 0 0 0 0 0 A 6 3 1 1 3 4 3 2 0 0 0 0 A 5 2 1 2 0 1 2 4 0 0 0 0 T 4 1 3 0 0 1 2 1 2 0 0 0 C 3 0 2 1 1 0 2 0 2 0 0 0 C 2 2 0 0 2 2 0 0 0 0 0 0 A 1 0 0 0 0 0 0 0 0 0 0 0 - 0 A C T A A C T C G G - 10 9 8 7 6 5 4 3 2 1 0 How can you find 1. Optimal alignments if more than one exist? 2. Non-optimal, good-scoring alignments?
  • 212.
    212 Overlap alignment p Overlapmatrix used by Overlap-Layout- Consensus algorithm can be computed with dynamic programming p Initialization: Oi,0 = O0,j = 0 for all i, j p Recursion: Oi,j = max { Oi-1,j-1 + s(ai, bi), Oi-1,j – , Oi,j-1 – , } Best overlap: maximum value from rightmost column and bottom row
  • 213.
    213 Non-uniform mismatch penalties pWe used uniform penalty for mismatches: s(’A’, ’C’) = s(’A’, ’G’) = … = s(’G’, ’T’) = µ p Transition mutations (A->G, G->A, C->T, T->C) are approximately twice as frequent than transversions (A->T, T->A, A->C, G->T) n use non-uniform mismatch penalties collected into a substitution matrix 1 -1 -0.5 -1 T -1 1 -1 -0.5 G -0.5 -1 1 -1 C -1 -0.5 -1 1 A T G C A
  • 214.
    214 Gaps in alignment pGap is a succession of indels in alignment p Previous model scored a length k gap as w(k) = -k p Replication processes may produce longer stretches of insertions or deletions n In coding regions, insertions or deletions of codons may preserve functionality C T – - - A A C T C G C A A
  • 215.
    215 Gap open andextension penalties (2) p We can design a score that allows the penalty opening gap to be larger than extending the gap: w(k) = - – (k – 1) p Gap open cost , Gap extension cost p Alignment algorithms can be extended to use w(k) (not discussed on this course)
  • 216.
    216 Amino acid sequences pWe have discussed mainly DNA sequences p Amino acid sequences can be aligned as well p However, the design of the substitution matrix is more involved because of the larger alphabet p More on the topic in the course Biological sequence analysis
  • 217.
    217 Demonstration of theEBI web site p European Bioinformatics Institute (EBI) offers many biological databases and bioinformatics tools at http://www.ebi.ac.uk/ n Sequence alignment: Tools -> Sequence Analysis -> Align
  • 218.
    218 Sequence Alignment (chapter6) p The biological problem p Global alignment p Local alignment p Multiple alignment
  • 219.
    219 Multiple alignment p Considera set of n sequences on the right n Orthologous sequences from different organisms n Paralogs from multiple duplications p How can we study relationships between these sequences? aggcgagctgcgagtgcta cgttagattgacgctgac ttccggctgcgac gacacggcgaacgga agtgtgcccgacgagcgaggac gcgggctgtgagcgcta aagcggcctgtgtgcccta atgctgctgccagtgta agtcgagccccgagtgc agtccgagtcc actcggtgc
  • 220.
    220 Optimal alignment ofthree sequences p Alignment of A = a1a2…ai and B = b1b2…bj can end either in (-, bj), (ai, bj) or (ai, -) p 22 – 1 = 3 alternatives p Alignment of A, B and C = c1c2…ck can end in 23 – 1 ways: (ai, -, -), (-, bj, -), (-, -, ck), (-, bj, ck), (ai, -, ck), (ai, bj, -) or (ai, bj, ck) p Solve the recursion using three-dimensional dynamic programming matrix: O(n3) time and space p Generalizes to n sequences but impractical with even a moderate number of sequences
  • 221.
    221 Multiple alignment inpractice p In practice, real-world multiple alignment problems are usually solved with heuristics p Progressive multiple alignment n Choose two sequences and align them n Choose third sequence w.r.t. two previous sequences and align the third against them n Repeat until all sequences have been aligned n Different options how to choose sequences and score alignments n Note the similarity to Overlap-Layout-Consensus
  • 222.
    222 Multiple alignment inpractice p Profile-based progressive multiple alignment: CLUSTALW n Construct a distance matrix of all pairs of sequences using dynamic programming n Progressively align pairs in order of decreasing similarity n CLUSTALW uses various heuristics to contribute to accuracy
  • 223.
    223 Additional material p R.Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis p N. C. Jones, P. A. Pevzner: An introduction to bioinformatics algorithms p Course Biological sequence analysis in period II, 2008
  • 224.
    224 Rapid alignment methods:FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST
  • 225.
    225 The biological problem pGlobal and local alignment algoritms are slow in practice p Consider the scenario of aligning a query sequence against a large database of sequences n New sequence with unknown function n NCBI GenBank size in January 2007 was 65 369 091 950 bases (61 132 599 sequences) n Feb 2008: 85 759 586 764 bases (82 853 685 sequences)
  • 226.
    226 Problem with largeamount of sequences p Exponential growth in both number and total length of sequences p Possible solution: Compare against model organisms only p With large amount of sequences, chances are that matches occur by random n Need for statistical analysis
  • 227.
    227 Rapid alignment methods:FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST
  • 228.
    228 FASTA p FASTA isa multistep algorithm for sequence alignment (Wilbur and Lipman, 1983) p The sequence file format used by the FASTA software is widely used by other sequence analysis software p Main idea: n Choose regions of the two sequences (query and database) that look promising (have some degree of similarity) n Compute local alignment using dynamic programming in these regions
  • 229.
    229 FASTA outline p FASTAalgorithm has five steps: n 1. Identify common k-words between I and J n 2. Score diagonals with k-word matches, identify 10 best diagonals n 3. Rescore initial regions with a substitution score matrix n 4. Join initial regions using gaps, penalise for gaps n 5. Perform dynamic programming to find final alignments
  • 230.
    230 Search strategies p Howto speed up the computation? n Find ways to limit the number of pairwise comparisons p Compare the sequences at word level to find out common words n Word means here a k-tuple (or a k-word), a substring of length k
  • 231.
    231 Analyzing the wordcontent p Example query string I: TGATGATGAAGACATCAG p For k = 8, the set of k-words (substring of length k) of I is TGATGATG GATGATGA ATGATGAA TGATGAAG … GACATCAG
  • 232.
    232 Analyzing the wordcontent p There are n-k+1 k-words in a string of length n p If at least one word of I is not found from another string J, we know that I differs from J p Need to consider statistical significance: I and J might share words by chance only p Let n=|I| and m=|J|
  • 233.
    233 Word lists andcomparison by content p The k-words of I can be arranged into a table of word occurences Lw(I) p Consider the k-words when k=2 and I=GCATCGGC: GC, CA, AT, TC, CG, GG, GC AT: 3 CA: 2 CG: 5 GC: 1, 7 GG: 6 TC: 4 Start indecies of k-word GC in I Building Lw(I) takes O(n) time
  • 234.
    234 Common k-words p Numberof common k-words in I and J can be computed using Lw(I) and Lw(J) p For each word w in I, there are |Lw(J)| occurences in J p Therefore I and J have common words p This can be computed in O(n + m + 4k) time n O(n + m) time to build the lists n O(4k) time to calculate the sum (in DNA strings)
  • 235.
    235 Common k-words p I= GCATCGGC p J = CCATCGCCATCG Lw(J) AT: 3, 9 CA: 2, 8 CC: 1, 7 CG: 5, 11 GC: 6 TC: 4, 10 Lw(I) AT: 3 CA: 2 CG: 5 GC: 1, 7 GG: 6 TC: 4 Common words 2 2 0 2 2 0 2 10 in total
  • 236.
    236 Properties of thecommon word list p Exact matches can be found using binary search (e.g., where TCGT occurs in I?) n O(log 4k) time p For large k, the table size is too large to compute the common word count in the previous fashion p Instead, an approach based on merge sort can be utilised (details skipped) p The common k-word technique can be combined with the local alignment algorithm to yield a rapid alignment approach
  • 237.
    237 FASTA outline p FASTAalgorithm has five steps: n 1. Identify common k-words between I and J n 2. Score diagonals with k-word matches, identify 10 best diagonals n 3. Rescore initial regions with a substitution score matrix n 4. Join initial regions using gaps, penalise for gaps n 5. Perform dynamic programming to find final alignments
  • 238.
    238 Dot matrix comparisons pWord matches in two sequences I and J can be represented as a dot matrix p Dot matrix element (i, j) has ”a dot”, if the word starting at position i in I is identical to the word starting at position j in J p The dot matrix can be plotted for various k i j I = … ATCGGATCA … J = … TGGTGTCGC … i j
  • 239.
    239 k=1 k=4 k=8 k=16 Dotmatrix (k=1,4,8,16) for two DNA sequences X85973.1 (1875 bp) Y11931.1 (2013 bp)
  • 240.
    240 k=1 k=4 k=8 k=16 Dotmatrix (k=1,4,8,16) for two protein sequences CAB51201.1 (531 aa) CAA72681.1 (588 aa) Shading indicates now the match score according to a score matrix (Blosum62 here)
  • 241.
    241 Computing diagonal sums pWe would like to find high scoring diagonals of the dot matrix p Lets index diagonals by the offset, l = i - j C C A T C G C C A T C G G * C * * A * * T * * C * * G G * C k=2 I J Diagonal l = i – j = -6
  • 242.
    242 Computing diagonal sums pAs an example, lets compute diagonal sums for I = GCATCGGC, J = CCATCGCCATCG, k = 2 p 1. Construct k-word list Lw(J) p 2. Diagonal sums Sl are computed into a table, indexed with the offset and initialised to zero l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 Sl 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  • 243.
    243 Computing diagonal sums p3. Go through k-words of I, look for matches in Lw(J) and update diagonal sums C C A T C G C C A T C G G * C * * A * * T * * C * * G G * C I J For the first 2-word in I, GC, LGC(J) = {6}. We can then update the sum of diagonal l = i – j = 1 – 6 = -5 to S-5 := S-5 + 1 = 0 + 1 = 1
  • 244.
    244 Computing diagonal sums p3. Go through k-words of I, look for matches in Lw(J) and update diagonal sums C C A T C G C C A T C G G * C * * A * * T * * C * * G G * C I J Next 2-word in I is CA, for which LCA(J) = {2, 8}. Two diagonal sums are updated: l = i – j = 2 – 2 = 0 S0 := S0 + 1 = 0 + 1 = 1 I = i – j = 2 – 8 = -6 S-6 := S-6 + 1 = 0 + 1 = 1
  • 245.
    245 Computing diagonal sums p3. Go through k-words of I, look for matches in Lw(J) and update diagonal sums C C A T C G C C A T C G G * C * * A * * T * * C * * G G * C I J Next 2-word in I is AT, for which LAT(J) = {3, 9}. Two diagonal sums are updated: l = i – j = 3 – 3 = 0 S0 := S0 + 1 = 1 + 1 = 2 I = i – j = 3 – 9 = -6 S-6 := S-6 + 1 = 1 + 1 = 2
  • 246.
    246 Computing diagonal sums Aftergoing through the k-words of I, the result is: l -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 Sl 0 0 0 0 4 1 0 0 0 0 4 1 0 0 0 0 0 C C A T C G C C A T C G G * C * * A * * T * * C * * G G * C I J
  • 247.
    247 Algorithm for computingdiagonal sum of scores Sl := 0 for all 1 – m l n – 1 Compute Lw(J) for all words w for i := 1 to n – k – 1 do w := IiIi+1…Ii+k-1 for j Lw(J) do l := i – j Sl := Sl + 1 end end Match score is here 1
  • 248.
    248 FASTA outline p FASTAalgorithm has five steps: n 1. Identify common k-words between I and J n 2. Score diagonals with k-word matches, identify 10 best diagonals n 3. Rescore initial regions with a substitution score matrix n 4. Join initial regions using gaps, penalise for gaps n 5. Perform dynamic programming to find final alignments
  • 249.
    249 Rescoring initial regions pEach high-scoring diagonal chosen in the previous step is rescored according to a score matrix p This is done to find subregions with identities shorter than k p Non-matching ends of the diagonal are trimmed I: C C A T C G C C A T C G J: C C A A C G C A A T C A I’: C C A T C G C C A T C G J’: A C A T C A A A T A A A 75% identity, no 4-word identities 33% identity, one 4-word identity
  • 250.
    250 Joining diagonals p Twooffset diagonals can be joined with a gap, if the resulting alignment has a higher score p Separate gap open and extension are used p Find the best-scoring combination of diagonals High-scoring diagonals Two diagonals joined by a gap
  • 251.
    251 FASTA outline p FASTAalgorithm has five steps: n 1. Identify common k-words between I and J n 2. Score diagonals with k-word matches, identify 10 best diagonals n 3. Rescore initial regions with a substitution score matrix n 4. Join initial regions using gaps, penalise for gaps n 5. Perform dynamic programming to find final alignments
  • 252.
    252 Local alignment inthe highest-scoring region p Last step of FASTA: perform local alignment using dynamic programming around the highest- scoring p Region to be aligned covers –w and +w offset diagonal to the highest- scoring diagonals p With long sequences, this region is typically very small compared to the whole n x m matrix w w Dynamic programming matrix M filled only for the green region
  • 253.
    253 Properties of FASTA pFast compared to local alignment using dynamic programming only n Only a narrow region of the full matrix is aligned p Increasing parameter k decreases the number of hits: n Increases specificity n Decreases sensitivity n Decreases running time p FASTA can be very specific when identifying long regions of low similarity n Specific method does not find many incorrect results n Sensitive method finds many of the correct results
  • 254.
    254 Properties of FASTA pFASTA looks for initial exact matches to query sequence n Two proteins can have very different amino acid sequences and still be biologically similar n This may lead into a lack of sensitivity with diverged sequences
  • 255.
    255 Demonstration of FASTAat EBI p http://www.ebi.ac.uk/fasta/ p Note that parameter ktup in the software corresponds to parameter k in lectures
  • 256.
    256 Note on sequencesand alignment matrices in exercises p Example solutions to alignment problems will have sequences arranged like this: n Perform global alignment of the sequences p s = AGCTGCGTACT p t = ATGAGCGTTA AGCTGCGTACT ATGAGCGTTA So if you want to be able to compare your solution easily against the example, use this convention.
  • 257.
    257 Rapid alignment methods:FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST
  • 258.
    258 BLAST: Basic LocalAlignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some of the most common sequence search tools in use p Roughly, the basic BLAST has three parts: n 1. Find segment pairs between the query sequence and a database sequence above score threshold (”seed hits”) n 2. Extend seed hits into locally maximal segment pairs n 3. Calculate p-values and a rank ordering of the local alignments p Gapped BLAST introduced in 1997 allows for gaps in alignments
  • 259.
    259 Finding seed hits pFirst, we generate a set of neighborhood sequences for given k, match score matrix and threshold T p Neighborhood sequences of a k-word w include all strings of length k that, when aligned against w, have the alignment score at least T p For instance, let I = GCATCGGC, J = CCATCGCCATCG and k = 5, match score be 1, mismatch score be 0 and T = 4
  • 260.
    260 Finding seed hits pI = GCATCGGC, J = CCATCGCCATCG, k = 5, match score 1, mismatch score 0, T = 4 p This allows for one mismatch in each k-word p The neighborhood of the first k-word of I, GCATC, is GCATC and the 15 sequences A A C A A CCATC,G GATC,GC GTC,GCA CC,GCAT G T T T G T
  • 261.
    261 Finding seed hits pI = GCATCGGC has 4 k-words and thus 4x16 = 64 5-word patterns to locate in J n Occurences of patterns in J are called seed hits p Patterns can be found using exact search in time proportional to the sum of pattern lengths + length of J + number of matches (Aho-Corasick algorithm) n Methods for pattern matching are developed on course 58093 String processing algorithms p Compare this approach to FASTA
  • 262.
    262 Extending seed hits:original BLAST p Initial seed hits are extended into locally maximal segment pairs or High-scoring Segment Pairs (HSP) p Extensions do not add gaps to the alignment p Sequence is extended until the alignment score drops below the maximum attained score minus a threshold parameter value p All statistically significant HSPs reported AACCGTTCATTA | || || || TAGCGATCTTTT Initial seed hit Extension Altschul, S.F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J., J. Mol. Biol., 215, 403-410, 1990
  • 263.
    263 Extending seed hits:gapped BLAST p In a later version of BLAST, two seed hits have to be found on the same diagonal n Hits have to be non-overlapping n If the hits are closer than A (additional parameter), then they are joined into a HSP p Threshold value T is lowered to achieve comparable sensitivity p If the resulting HSP achieves a score at least Sg, a gapped extension is triggered Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ, Nucleic Acids Res. 1;25(17), 3389-402, 1997
  • 264.
    264 Gapped extensions ofHSPs p Local alignment is performed starting from the HSP p Dynamic programming matrix filled in ”forward” and ”backward” directions (see figure) p Skip cells where value would be Xg below the best alignment score found so far Region potentially searched by the alignment algorithm HSP Region searched with score above cutoff parameter
  • 265.
    265 Estimating the significanceof results p In general, we have a score S(D, X) = s for a sequence X found in database D p BLAST rank-orders the sequences found by p- values p The p-value for this hit is P(S(D, Y) s) where Y is a random sequence n Measures the amount of ”surprise” of finding sequence X p A smaller p-value indicates more significant hit n A p-value of 0.1 means that one-tenth of random sequences would have as large score as our result
  • 266.
    266 Estimating the significanceof results p In BLAST, p-values are computed roughly as follows p There are nm places to begin an optimal alignment in the n x m alignment matrix p Optimal alignment is preceded by a mismatch and has t matching (identical) letters n (Assume match score 1 and mismatch/indel score - ) p Let p = P(two random letters are equal) p The probability of having a mismatch and then t matches is (1-p)pt
  • 267.
    267 Estimating the significanceof results p We model this event by a Poisson distribution (why?) with mean = nm(1-p)pt p P(there is local alignment t or longer) 1 – P(no such event) = 1 – e- = 1 – exp(-nm(1-p)pt) p An equation of the same form is used in Blast: p E-value = P(S(D, Y) s) 1 – exp(-nm t) where > 0 and 0 < < 1 p Parameters and are estimated from data
  • 268.
    268 Scoring amino acid alignments pWe need a way to compute the score S(D, X) for aligning the sequence X against database D p Scoring DNA alignments was discussed previously p Constructing a scoring model for amino acids is more challenging n 20 different amino acids vs. 4 bases p Figure shows the molecular structures of the 20 amino acids http://en.wikipedia.org/wiki/List_of_standard_amino_acids
  • 269.
    269 Scoring amino acid alignments pSubstitutions between chemically similar amino acids are more frequent than between dissimilar amino acids p We can check our scoring model against this http://en.wikipedia.org/wiki/List_of_standard_amino_acids
  • 270.
    270 Score matrices p Scoress = S(D, X) are obtained from score matrices p Let A = A1a2…an and B = b1b2…bn be sequences of equal length (no gaps allowed to simplify things) p To obtain a score for alignment of A and B, where ai is aligned against bi, we take the ratio of two probabilities n The probability of having A and B where the characters match (match model M) n The probability that A and B were chosen randomly (random model R)
  • 271.
    271 Score matrices: randommodel p Under the random model, the probability of having X and Y is where qxi is the probability of occurence of amino acid type xi p Position where an amino acid occurs does not affect its type
  • 272.
    272 Score matrices: matchmodel p Let pab be the probability of having amino acids of type a and b aligned against each other given they have evolved from the same ancestor c p The probability is
  • 273.
    273 Score matrices: log-oddsratio score p We obtain the score S by taking the ratio of these two probabilities and taking a logarithm of the ratio
  • 274.
    274 Score matrices: log-oddsratio score p The score S is obtained by summing over character pair-specific scores: p The probabilities qa and pab are extracted from data
  • 275.
    275 Calculating score matricesfor amino acids p Probabilities qa are in principle easy to obtain: n Count relative frequencies of every amino acid in a sequence database
  • 276.
    276 p To calculatepab we can use a known pool of aligned sequences p BLOCKS is a database of highly conserved regions for proteins p It lists multiply aligned, ungapped and conserved protein segments p Example from BLOCKS shows genes related to human gene associated with DNA-repair defect xeroderma pigmentosum Calculating score matrices for amino acids Block PR00851A ID XRODRMPGMNTB; BLOCK AC PR00851A; distance from previous block=(52,131) DE Xeroderma pigmentosum group B protein signature BL adapted; width=21; seqs=8; 99.5%=985; strength=1287 XPB_HUMAN|P19447 ( 74) RPLWVAPDGHIFLEAFSPVYK 54 XPB_MOUSE|P49135 ( 74) RPLWVAPDGHIFLEAFSPVYK 54 P91579 ( 80) RPLYLAPDGHIFLESFSPVYK 67 XPB_DROME|Q02870 ( 84) RPLWVAPNGHVFLESFSPVYK 79 RA25_YEAST|Q00578 ( 131) PLWISPSDGRIILESFSPLAE 100 Q38861 ( 52) RPLWACADGRIFLETFSPLYK 71 O13768 ( 90) PLWINPIDGRIILEAFSPLAE 100 O00835 ( 79) RPIWVCPDGHIFLETFSAIYK 86 http://blocks.fhcrc.org
  • 277.
    277 BLOSUM matrix p BLOSUMis a score matrix for amino acid sequences derived from BLOCKS data p First, count pairwise matches fx,y for every amino acid type pair (x, y) p For example, for column 3 and amino acids L and W, we find 8 pairwise matches: fL,W = fW,L = 8 RPLWVAPD RPLWVAPR RPLWVAPN PLWISPSD RPLWACAD PLWINPID RPIWVCPD
  • 278.
    278 p Probability pabis obtained by dividing fab with the total number of pairs (note difference with course book): p We get probabilities qa by RPLWVAPD RPLWVAPR RPLWVAPN PLWISPSD RPLWACAD PLWINPID RPIWVCPD Creating a BLOSUM matrix
  • 279.
    279 Creating a BLOSUMmatrix p The probabilities pab and qa can now be plugged into to get a 20 x 20 matrix of scores s(a, b). p Next slide presents the BLOSUM62 matrix n Values scaled by factor of 2 and rounded to integers n Additional step required to take into account expected evolutionary distance n Described in Deonier’s book in more detail
  • 280.
    280 BLOSUM62 A R ND C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
  • 281.
    281 Using BLOSUM62 matrix MQLEANADTSV || | LQEQAEAQGEM = 2 + 5 – 3 – 4 + 4 + 0 + 4 + 0 – 2 + 0 + 1 = 7
  • 282.
    282 Demonstration of BLASTat NCBI p http://www.ncbi.nlm.nih.gov/BLAST/
  • 283.
  • 284.
    284 Why study genomerearrangements? p Provide insight into evolution of species p Fun algorithmic problem! p Structure of this lecture: n The biological phenomenon n How to computationally model it? n How to compute interesting things? n Studying the phenomenon using existing tools (continued in exercises)
  • 285.
    285 Genome rearrangements asan algorithmic problem
  • 286.
    286 Background p Genome sequencingenables us to compare genomes of two or more different species n -> Comparative genomics p Basic observation: n Closely related species (such as human and mouse) can be almost identical in terms of genome contents... n ...but the order of genomic segments can be very different between species
  • 287.
    287 Synteny blocks andsegments p Synteny – derived from Greek ’on the same ribbon’ – means genomic segments located on the same chromosome n Genes, markers (any sequence) p Synteny block (or syntenic block) n A set of genes or markers that co-occur together in two species p Synteny segment (or syntenic segment) n Syntenic block where the order of genes or markers is preserved
  • 288.
    288 Synteny blocks andsegments Chromosome i, species B Chromosome j, species C Synteny segment Synteny block Homologs of the same gene
  • 289.
    289 Observations from sequencing 1.Large chromosome inversions and translocations (we’ll get to these shortly) are common n ...Even between closely related species 2. Chromosome inversions are usually symmetric around the origin of DNA replication 3. Inversions are less common within species...
  • 290.
    290 What causes rearrangements? pRecA, Recombinase A, is a protein used to repair chromosomal damage p It uses a duplicate copy of the damaged sequence as template p Template is usually a homologous sequence on a sister chromosome Diarmaid Hughes: Evaluating genome dynamics: the constraints on rearrangements within bacterial genomes, Genome Biology 2000, 1
  • 291.
    291 Chromosomes: recap p Linearchromosomes n Eukaryotes (mostly) p Circular chromosomes n Prokaryotes (mostly) n Mitochondria chromatid centromere gene 1 gene 3 gene 2 Also double-stranded: genes can be found on both strands (orientations)
  • 292.
    292 What effects doesRecA have on genome? p Repeated sequences cause RecA to fail to choose correct recombination start position p This leads to n Tandem duplications n Translocations n Inversions Repeat 1 Repeat 2 RecA ? Damaged sequence
  • 293.
    293 Diarmaid Hughes: Evaluatinggenome dynamics: the constraints on rearrangements within bacterial genomes, Genome Biology 2000, 1 X, Y, Z and W are repeats of the same sequence. a, b, c and d are sequences on genome bounded by repeats. In a tandem duplication example, RecA recombines a sequence that starts from Y instead of Z after Z. This leads to duplication of segment Y-Z.
  • 294.
    294 Diarmaid Hughes: Evaluatinggenome dynamics: the constraints on rearrangements within bacterial genomes, Genome Biology 2000, 1 Recombination of two repeat sequences in the same chromosome can lead to a fragment translocation Here sequence d is translocated
  • 295.
    295 Diarmaid Hughes: Evaluatinggenome dynamics: the constraints on rearrangements within bacterial genomes, Genome Biology 2000, 1 Inversion happens when two sequences of opposite orientations are recombined.
  • 296.
    296 Example: human vsmouse genome p Human and mouse genomes share thousands of homologous genes, but they are n Arranged in different order n Located in different chromosomes p Examples n Human chromosome 6 contains elements from six different mouse chromosomes n Analysis of X chromosome indicates that rearrangements have happened primarily within chromosome
  • 297.
  • 298.
  • 299.
    299 Representing genome rearrangments pWhen comparing two genomes, we can find homologous sequences in both using BLAST, for example p This gives us a map between sequences in both genomes
  • 300.
    300 Representing genome rearrangments pWe assign numbers 1,...,n to the found homologous sequences p By convention, we number the sequences in the first genome by their order of appearance in chromosomes p If the homolog of i is in reverse orientation, it receives number –i (signed data) p For example, consider human vs mouse gene numbering on the right (il10) 9 (at3) -6 (pdc) 8 (lamc1) -7 (lamc1) 7 (pdc) -8 (at3) 6 (il10) -9 (pklr) 5 (pax3) 15 (gba) 4 (fn1) 14 (ngfb) 3 (cd28) 13 (nras) 2 (inpp1) 12 (gnat2) 1 Mouse Human List order corresponds to physical order on chromosomes!
  • 301.
    301 Permutations p The basicdata structure in the study of genome rearrangements is permutation p A permutation of a sequence of n numbers is a reordering of the sequence p For example, 4 1 3 2 5 is a permutation of 1 2 3 4 5
  • 302.
    302 Genome rearrangement problem pGiven two genomes (set of markers), how many n duplications, n inversions and n translocations do we need to do to transform the first genome to the second? Minimum number of operations? What operations? Which order?
  • 303.
    303 Genome rearrangement problem 61 2 3 4 5 1 2 3 4 5 6 #duplications? #inversions? #translocations?
  • 304.
    304 Genome rearrangement problem 61 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5 6 Keep in mind, that the two genomes have been evolved from a common ancestor genome! Permutation of 1,...,6
  • 305.
    305 Genome rearrangements usingreversals (=inversions) only p Lets consider a simpler problem where we just study reversals with unsigned data p A reversal p(i, j) reverses the order of the segment i i+1 ... j-1 j (indexing starts from 1) p For example, given permutation 6 1 2 3 4 5 and reversal p(3, 5) we get permutation 6 1 4 3 2 5 ...note that we do not care about exact positions on the genome p(3, 5)
  • 306.
    306 Reversal distance problem pFind the shortest series of reversals that, given a permutation , transforms it to the identity permutation (1, 2, ..., n) p This quantity is denoted by d( ) p Reversal distance for a pair of chromosomes: n Find synteny blocks in both n Number blocks in the first chromosome to identity n Set to correspond matching of second chromosome’s blocks against the first n Find reversal distance
  • 307.
    307 Reversal distance problem:discussion p If we can find the minimal series of reversals for some pair of genomes n Is that what happened during evolution? n If not, is it the correct number of reversals? p In any case, reversal distance gives us a measure of evolutionary distance between the two genomes and species
  • 308.
    308 Solving the problemby sorting p Our first approach to solve the reversal distance problem: n Examine each position i of the permutation n At each position, if i i, do a reversal such that i = i p This is a greedy approach: we try to choose the best option at each step
  • 309.
    309 Simple reversal sort:example 6 1 2 3 4 5 -> 1 6 2 3 4 5 -> 1 2 6 3 4 5 -> 1 2 3 4 6 5 -> 1 2 3 4 5 6 Reversal series: p(1,2), p(2,3), p(3,4), p(5,6) Is d(6 1 2 3 4 5) then 4? 6 1 2 3 4 5 -> 5 4 3 2 1 6 -> 1 2 3 4 5 6 D(6 1 2 3 4 5) = 2
  • 310.
    310 Pancake flipping problem pNo pancake made by the chef is of the same size p Pancakes need to be rearranged before delivery p Flipping operation: take some from the top and flip them over p This corresponds to always reversing the sequence prefix 1 2 3 6 4 5 -> 6 3 2 1 4 5 -> 5 4 1 2 3 6 -> 3 2 1 4 5 6 -> 1 2 3 4 5 6
  • 311.
    311 How good issimple reversal sort? p Not so good actually p It has to do at most n-1 reversals with permutation of length n p The algorithm can return a distance that is as large as (n – 1)/2 times the correct result d( ) n For example, if n = 1001, result can be as bad as 500 x d( )
  • 312.
    312 Estimating reversal distanceby cycle decomposition p We can estimate d( ) by cycle decomposition p Lets represent permutation = 1 2 4 5 3 with the following graph where edges correspond to adjacencies (identity, permutation F) 1 2 4 5 3 0 6
  • 313.
    313 Estimating reversal distanceby cycle decomposition p Cycle decomposition: a set of cycles that n have edges with alternating colors n do not share edges with other cycles (=cycles are edge disjoint) 1 2 4 5 3 0 6 1 2 4 5
  • 314.
    314 Cycle decompositions p Letc( ) the maximum number of alternating, edge-disjoint cycles in the graph representation of p The following formula allows estimation of d( ) n d( ) n + 1 – c( ), where n is the permutation length 1 2 4 5 3 0 6 1 2 4 5 d( ) 5 + 1 – 4 = 2 Claim in Deonier: equality holds for ”most of the usual and interesting biological systems.
  • 315.
    315 Cycle decompositions p Cycledecomposition is NP-complete n We cannot solve the general problem exactly for large instances p However, with signed data the problem becomes easy n Before going into signed data, lets discuss another algorithm for the general case
  • 316.
    316 Computing reversals withbreakpoints p Lets investigate a better way to compute reversal distance p First, some concepts related to permutation 1 2,,, n-1 n n Breakpoint: two elements i and i+1 are a breakpoint, if they are not consecutive numbers n Adjacency: if i and i+1 are consecutive, they are called adjacency
  • 317.
    317 Breakpoints and adjacencies 21 3 4 5 8 7 6 This permutation contains four breakpoints begin-2, 13, 58, 6-end and five adjacencies 21, 34, 45, 87, 76 Breakpoints
  • 318.
    318 Breakpoints p Each breakpointin permutation needs to be removed to get to the identity permutation (=our target) n Identity permutation does not contain any breakpoints p First and last positions special cases p Note that each reversal can remove at most two breakpoints p Denote the number of breakpoints by b( ) 2 1 3 4 5 8 7 6 b( ) = 4
  • 319.
    319 Breakpoint reversal sort pIdea: try to remove as many breakpoints as possible (max 2) in every step 1. While b( ) > 0 2. Choose reversal p that removes most breakpoints 3. Perform reversal p to 4. Output 5. return
  • 320.
    320 Breakpoint removal: example 82 7 6 5 1 4 3 b( ) = 6 2 8 7 6 5 1 4 3 b( ) = 5 2 3 4 1 5 6 7 8 b( ) = 3 4 3 2 1 5 6 7 8 b( ) = 2 1 2 3 4 5 6 7 8 b( ) = 0
  • 321.
    321 Breakpoint removal p Theprevious algorithm needs refinement to be correct p Consider the following permutation: 1 5 6 7 2 3 4 8 p There is no reversal that decreases the number of breakpoints! p See Jones & Pevzner for detailed description on this
  • 322.
    322 Breakpoint removal p Reversalcan only decrease breakpoint count if permutation contains decreasing strips 1 5 6 7 2 3 4 8 1 5 6 7 4 3 2 8 1 2 3 4 7 6 5 8 Increasing strip Decreasing strip Strip: maximal segment without breakpoints
  • 323.
    323 Improved breakpoint reversalsort 1. While b( ) > 0 2. If has a decreasing strip 3. Do reversal p that removes most BPs 4. Else 5. Reverse an increasing strip 6. Output 7. return
  • 324.
    324 Is Improved BPremoval enough? p The algorithm works pretty well: n It produces a result that is at most four times worse than the optimal result n ...is this good? p We considered only reversals p What about translocations & duplications?
  • 325.
    325 Translocations via reversals 12 3 4 5 6 7 8 1 5 6 7 8 2 3 4 1 4 3 2 8 7 6 5 1 2 3 4 8 7 6 5 1 2 3 4 5 6 7 8 Translocation of 2,3,4 p(2,8) p(2,4) p(5,8)
  • 326.
    326 Genome rearrangements withreversals p With unsigned data, the problem of finding minimum reversal distances is NP- complete n Why is this so if sorting is easy? p An algorithm has been developed that achieves 1.375-approximation p However, reversal distance in signed data can be computed quickly! n It takes linear time w.r.t. the length of permutation (Bader, Moret, Yan, 2001)
  • 327.
    327 Cycle decomposition withsigned data p Consider the following two permutations that include orientation of markers n J: +1 +5 -2 +3 +4 n K: +1 -3 +2 +4 -5 p We modify this representation a bit to include both endpoints of each marker: n J’: 0 1a 1b 5a 5b 2b 2a 3a 3b 4a 4b 6 n K’: 0 1a 1b 3b 3a 2a 2b 4a 4b 5b 5a 6
  • 328.
    328 Graph representation ofJ’ and K’ p Drawn online in lecture!
  • 329.
    329 Multiple chromosomes p Inunichromosomal genomes, inversion (reversal) is the most common operation p In multichromosomal genomes, inversions, translocations, fissions and fusions are most common
  • 330.
    330 Multiple chromosomes p Letsrepresent multichromosomal genome as a set of permutations, with $ denoting the boundary of a chromosome: 5 9 $ 1 3 2 8 $ 7 6 4 $ This notation is frequently used in software used to analyse genome rearrangements. Chr 1 Chr 2 Chr 3
  • 331.
    331 Multiple chromosomes p Notethat when dealing with multiple chromosomes, you need to specify numbering for elements on both genomes
  • 332.
    332 Reversals & translocations pReversal p( , i, j) p Translocation p( , , i, j) i j Translocation
  • 333.
    333 Fusions & fissions pFusion: merging of two chromosomes p Fission: chromosome is split into two chromosomes p Both events can be represented with a translocation
  • 334.
    334 Fusion p Fusion bytranslocation p( , , n+1, 1) i = n + 1 j = 1 Fusion
  • 335.
    335 Fission p Fission bytranslocation p( , , i, 1) i Empty chromosome Fission
  • 336.
    336 Algorithms for generalgenomic distance problem p Hannenhalli, Pevzner: Transforming Men into Mice (polynomial algorithm for genomic distance problem), 36th Annual IEEE Symposium on Foundations of Computer Science, 1995
  • 337.
    337 Human & mouserevisited p Human and mouse are separated by about 75-83 million years of evolutionary history p Only a few hundred rearrangements have happened after speciation from the common ancestory p Pevzner & Tesler identified in 2003 for 281 synteny blocks a rearrangement from mouse to human with n 149 inversions n 93 translocations n 9 fissions
  • 338.
    338 Discussion p Genome rearrangementevents are very rare compared to, e.g., point mutations n We can study rearrangement events further back in the evolutionary history p Rearrangements are easier to detect in comparison to many other genomic events p We cannot detect homologs 100% correctly so the input permutation can contain errors
  • 339.
    339 Discussion p Genome rearrangementis to some degree constrained by the number and size of repeats in a genome n Notice how the importance of genomic repeats pops up once again p Sequencing gives us (usually) signed data so we can utilize faster algorithms p What if there are more than one optimal solution?
  • 340.
    340 Two different genomerearrangement scenarios giving the same result.
  • 341.
    341 GRIMM demonstration Glenn Tesler,GRIMM: genome rearrangements web server. Bioinformatics, 2002,
  • 342.
    342 GRIMM file format #useful comment about first genome # another useful comment about it >name of first genome 1 -4 2 $ # chromosome 1 -3 5 6 # chromosome 2 >name of second genome 5 -3 $ 6 $ 2 -4 1 $ http://grimm.ucsd.edu/GRIMM/grimm_instr.html GRIMM supports analysis of one, two or more genomes
  • 343.
  • 344.
    344 Inferring the Past:Phylogenetic Trees p The biological problem p Parsimony and distance methods p Models for mutations and estimation of distances p Maximum likelihood methods
  • 345.
    345 Phylogeny p We wantto study ancestor- descendant relationships, or phylogeny, among groups of organisms p Groups are called taxa (singular: taxon) p Organisms are usually called operational taxonomic units or OTUs in the context of phylogeny
  • 346.
    346 Phylogenetic trees p Leaves(external nodes) ~ species, observed (OTUs) p Internal nodes ~ ancestral species/divergence events, not observed p Unrooted tree does not specify ancestor- descendant relationships beyond the observation ”leaves are not ancestors” 1 2 3 4 5 6 7 8 Unrooted tree with 5 leaves and 3 internal nodes. Is node 7 ancestor of node 6?
  • 347.
    347 Phylogenetic trees p Rootinga tree specifies all ancestor-descendant relationships in the tree p Root is the ancestor to the other species p There are n-1 ways to root a tree with n nodes 1 2 3 4 5 6 7 8 R1 R2 2 3 4 5 1 6 7 8 R1 2 3 4 5 1 6 7 8 R2 r o o t ( R 1 ) r o o t ( R 2 )
  • 348.
    348 Questions p Can weenumerate all possible phylogenetic trees for n species (or sequences?) p How to score a phylogenetic tree with respect to data? p How to find the best phylogenetic tree given data?
  • 349.
    349 Finding the bestphylogenetic tree: naive method p How can we find the phylogenetic tree that best represents the data? p Naive method: enumerate all possible trees p How many different trees are there of n species? p Denote this number by bn
  • 350.
    350 Enumerating unordered trees pStart with the only unordered tree with 3 leaves (b3 = 1) p Consider all ways to add a leaf node to this tree p Fourth node can be added to 3 different branches (edges), creating 1 new internal branch p Total number of branches is n external and n – 3 internal branches p Unrooted tree with n leaves has 2n – 3 branches 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4
  • 351.
    351 Enumerating unordered trees pThus, we get the number of unrooted trees bn = (2(n – 1) – 3)bn-1 = (2n – 5)bn-1 = (2n – 5) * (2n – 7) * …* 3 * 1 = (2n – 5)! / ((n-3)!2n-3), n > 2 p Number of rooted trees b’n is b’n = (2n – 3)bn = (2n – 3)! / ((n-2)!2n-2), n > 2 that is, the number of unrooted trees times the number of branches in the trees
  • 352.
    352 Number of possiblerooted and unrooted trees 8.20E+021 2.22E+020 20 4.95E+038 8.69E+036 30 34459425 2027025 10 2027025 135135 9 135135 10395 8 10395 954 7 945 105 6 105 15 5 15 3 4 3 1 3 b’n Bn n
  • 353.
    353 Too many trees? pWe can’t construct and evaluate every phylogenetic tree even for a smallish number of species p Better alternative is to n Devise a way to evaluate an individual tree against the data n Guide the search using the evaluation criteria to reduce the search space
  • 354.
    354 Inferring the Past:Phylogenetic Trees (chapter 12) p The biological problem p Parsimony and distance methods p Models for mutations and estimation of distances p Maximum likelihood methods
  • 355.
    355 Parsimony method p Theparsimony method finds the tree that explains the observed sequences with a minimal number of substitutions p Method has two steps n Compute smallest number of substitutions for a given tree with a parsimony algorithm n Search for the tree with the minimal number of substitutions
  • 356.
    356 Parsimony: an example pConsider the following short sequences 1 ACTTT 2 ACATT 3 AACGT 4 AATGT 5 AATTT p There are 105 possible rooted trees for 5 sequences p Example: which of the following trees explains the sequences with least number of substitutions?
  • 357.
    357 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 AATGT 7 AATTT 8ACTTT 9 AATTT T->C T->G T->A A->C This tree explains the sequences with 4 substitutions
  • 358.
    358 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 AATGT 7 AATTT 8ACTTT 9 AATTT T->C T->G T->A A->C 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 ACCTT C->T 7 AACGT 8 AATGT 9 AATTT G->T T->C T->G A->C C->A 6 substitutions… First tree is more parsimonious! 4 substitutions…
  • 359.
    359 Computing parsimony p Parsimonytreats each site (position in a sequence) independently p Total parsimony cost is the sum of parsimony costs (=required substitutions) of each site p We can compute the minimal parsimony cost for a given tree by n First finding out possible assignments at each node, starting from leaves and proceeding towards the root n Then, starting from the root, assign a letter at each node, proceeding towards leaves
  • 360.
    360 Labelling tree nodes pAn unrooted tree with n leaves contains 2n-1 nodes altogether p Assign the following labels to nodes in a rooted tree n leaf nodes: 1, 2, …, n n internal nodes: n+1, n+2, …, 2n-1 n root node: 2n-1 p The label of a child node is always smaller than the label of the parent node 2 3 4 5 1 6 8 7 9
  • 361.
    361 Parsimony algorithm: firstphase p Find out possible assignments at every node for each site u independently. Denote site u in sequence i by si,u. For i := 1, …, n do Fi := {si,u} % possible assignments at node i Li := 0 % number of substitutions up to node i For i := n+1, …, 2n-1 do Let j and k be the children of node i If Fj Fk = then Li := Lj + Lk + 1, Fi := Fj Fk else Li := Lj + Lk, Fi := Fj Fk
  • 362.
    362 Parsimony algorithm: firstphase 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT Choose u = 3 (for example, in general we do this for all sites) F1 := {T} L1 := 0 F2 := {A} L2 := 0 F3 := {C}, L3 := 0 F4 := {T}, L4 := 0 F5 := {T}, L5 := 0 6 7 8 9
  • 363.
    363 Parsimony algorithm: firstphase 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 {C,T} 7 T 8 {A, T} 9 T F8 := F1 F2 = {A, T} L8 := L1 + L2 + 1 = 1 F6 := F3 F4 = {C, T} L6 := L3 + L4 + 1 = 1 F7 := F5 F6 = {T} L7 := L5 + L6 = 1 F9 := F7 F8 = {T} L9 := L7 + L8 = 2 Parsimony cost for site 3 is 2
  • 364.
    364 Parsimony algorithm: secondphase p Backtrack from the root and assign x Fi at each node p If we assigned y at parent of node i and y Fi, then assign y p Else assign x Fi by random
  • 365.
    365 Parsimony algorithm: secondphase 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 {C,T} 7 T 8 {A,T} 9 T At node 6, the algorithm assigns T because T was assigned to parent node 7 and T F6. T is assigned to node 8 for the same reason. The other nodes have only one possible letter to assign
  • 366.
    366 Parsimony algorithm 3 AACGT 4 AATGT 5 AATTT 2 ACATT 1 ACTTT 6 T 7T 8 T 9 T First and second phase are repeated for each site in the sequences, summing the parsimony costs at each site
  • 367.
    367 Properties of parsimonyalgorithm p Parsimony algorithm requires that the sequences are of same length n First align the sequences against each other and, optionally, remove indels n Then compute parsimony for the resulting sequences n Indels (if present) considered as characters p Is the most parsimonious tree the correct tree? n Not necessarily but it explains the sequences with least number of substitutions n We can assume that the probability of having fewer mutations is higher than having many mutations
  • 368.
    368 Finding the mostparsimonious tree p Parsimony algorithm calculates the parsimony cost for a given tree… p …but we still have the problem of finding the tree with the lowest cost p Exhaustive search (enumerating all trees) is in general impossible p More efficient methods exist, for example n Probabilistic search n Branch and bound
  • 369.
    369 Branch and boundin parsimony p We can exploit the fact that adding edges to a tree can only increase the parsimony cost 1 AATGT 2 AATTT 3 AACGT 1 AATGT 2 AATTT {T} {T} {C, T} cost 0 cost 1
  • 370.
    370 Branch and boundin parsimony Branch and bound is a general search strategy where p Each solution is potentially generated p Track is kept of the best solution found p If a partial solution cannot achieve better score, we abandon the current search path In parsimony… p Start from a tree with 1 sequence p Add a sequence to the tree and calculate parsimony cost p If the tree is complete, check if found the best tree so far p If tree is not complete and cost exceeds best tree cost, do not continue adding edges to this tree
  • 371.
    371 Branch and boundexample 1 1 2 3 1 2 3 1 2 4 … Complete tree: compute parsimony cost Example with 4 sequences Partial tree: Compute parsimony cost and compare against best so far; Do not continue expansion if above cost of the best tree 3 1 2 4 2 1 3 4 2 1 3 4 1 3 2 1 3
  • 372.
    372 Distance methods p Theparsimony method works on sequence (character string) data p We can also build phylogenetic trees in a more general setting p Distance methods work on a set of pairwise distances dij for the data p Distances can be obtained from phenotypes as well as from genotypes (sequences)
  • 373.
    373 Distances in aphylogenetic tree p Distance matrix D = (dij) gives pairwise distances for leaves of the phylogenetic tree p In addition, the phylogenetic tree will now specify distances between leaves and internal nodes n Denote these with dij as well 2 3 4 5 1 6 7 8 Distance dij states how far apart species i and j are evolutionary (e.g., number of mismatches in aligned sequences)
  • 374.
    374 Distances in evolutionarycontext p Distances dij in evolutionary context satisfy the following conditions n Symmetry: dij = dji for each i, j n Distinguishability: dij 0 if and only if i j n Triangle inequality: dij dik + dkj for each i, j, k p Distances satisfying these conditions are called metric p In addition, evolutionary mechanisms may impose additional constraints on the distances additive and ultrametric distances
  • 375.
    375 Additive trees p Atree is called additive, if the distance between any pair of leaves (i, j) is the sum of the distances between the leaves and a node k on the shortest path from i to j in the tree dij = dik + djk p ”Follow the path from the leaf i to the leaf j to find the exact distance dij between the leaves.”
  • 376.
  • 377.
    377 Ultrametric trees p Arooted additive tree is called an ultrametric tree, if the distances between any two leaves i and j, and their common ancestor k are equal dik = djk p Edge length dij corresponds to the time elapsed since divergence of i and j from the common parent p In other words, edge lengths are measured by a molecular clock with a constant rate
  • 378.
    378 Identifying ultrametric data pWe can identify distances to be ultrametric by the three-point condition: D corresponds to an ultrametric tree if and only if for any three species i, j and k, the distances satisfy dij max(dik, dkj) p If we find out that the data is ultrametric, we can utilise a simple algorithm to find the corresponding tree
  • 379.
    379 Ultrametric trees 9 8 7 5 43 2 1 6 Observation time Time
  • 380.
    380 Ultrametric trees 9 8 7 5 43 2 1 6 Observation time Time Only vertical segments of the tree have correspondence to some distance dij: Horizontal segments act as connectors. d8,9
  • 381.
    381 Ultrametric trees 9 8 7 5 43 2 1 6 Observation time Time dik = djk for any two leaves i, j and any ancestor k of i and j
  • 382.
    382 Ultrametric trees 9 8 7 5 43 2 1 6 Observation time Time Three-point condition: there are no leafs i, j for which dij > max(dik, djk) for some leaf k.
  • 383.
    383 UPGMA algorithm p UPGMA(unweighted pair group method using arithmetic averages) constructs a phylogenetic tree via clustering p The algorithm works by at the same time n Merging two clusters n Creating a new node on the tree p The tree is built from leaves towards the root p UPGMA produces a ultrametric tree
  • 384.
    384 Cluster distances p Letdistance dij between clusters Ci and Cj be that is, the average distance between points (species) in the cluster.
  • 385.
    385 UPGMA algorithm p Initialisation nAssign each point i to its own cluster Ci n Define one leaf for each sequence, and place it at height zero p Iteration n Find clusters i and j for which dij is minimal n Define new cluster k by Ck = Ci Cj, and define dkl for all l n Define a node k with children i and j. Place k at height dij/2 n Remove clusters i and j p Termination: n When only two clusters i and j remain, place root at height dij/2
  • 386.
  • 387.
  • 388.
  • 389.
    389 1 2 3 4 5 1 24 5 6 7 8 3
  • 390.
    390 1 2 3 4 5 1 24 5 6 7 8 3 9
  • 391.
    391 UPGMA implementation p Innaive implementation, each iteration takes O(n2) time with n sequences => algorithm takes O(n3) time p The algorithm can be implemented to take only O(n2) time (see Gronau & Moran, 2006, for a survey)
  • 392.
    392 Problem solved? p Wenow have a simple algorithm which finds a ultrametric tree n If the data is ultrametric, then there is exactly one ultrametric tree corresponding to the data (we skip the proof) n The tree found is then the ”correct” solution to the phylogeny problem, if the assumptions hold p Unfortunately, the data is not ultrametric in practice n Measurement errors distort distances n Basic assumption of a molecular clock does not hold usually very well
  • 393.
    393 Incorrect reconstruction ofnon- ultrametric data by UPGMA 1 2 3 4 1 2 3 4 Tree which corresponds to non-ultrametric distances Incorrect ultrametric reconstruction by UPGMA algorithm
  • 394.
    394 Checking for additivity pHow can we check if our data is additive? p Let i, j, k and l be four distinct species p Compute 3 sums: dij + dkl, dik + djl, dil + djk
  • 395.
    395 Four-point condition i j l ki j l k i j l k dik djl dil djk dij dkl p The sums are represented by the three figures n Left and middle sum cover all edges, right sum does not p Four-point condition: i, j, k and l satisfy the four- point condition if two of the sums dij + dkl, dik + djl, dil + djk are the same, and the third one is smaller than these two
  • 396.
    396 Checking for additivity pAn n x n matrix D is additive if and only if the four point condition holds for every 4 distinct elements 1 i, j, k, l n
  • 397.
    397 Finding an additivephylogenetic tree p Additive trees can be found with, for example, the neighbor joining method (Saitou & Nei, 1987) p The neighbor joining method produces unrooted trees, which have to be rooted by other means n A common way to root the tree is to use an outgroup n Outgroup is a species that is known to be more distantly related to every other species than they are to each other n Root node candidate: position where the outgroup would join the phylogenetic tree p However, in real-world data, even additivity usually does not hold very well
  • 398.
    398 Neighbor joining algorithm pNeighbor joining works in a similar fashion to UPGMA n Find clusters C1 and C2 that minimise a function f(C1, C2) n Join the two clusters C1 and C2 into a new cluster C n Add a node to the tree corresponding to C n Assign distances to the new branches p Differences in n The choice of function f(C1, C2) n How to assign the distances
  • 399.
    399 Neighbor joining algorithm pRecall that the distance dij for clusters Ci and Cj was p Let u(Ci) be the separation of cluster Ci from other clusters defined by where n is the number of clusters.
  • 400.
    400 Neighbor joining algorithm pInstead of trying to choose the clusters Ci and Cj closest to each other, neighbor joining at the same time n Minimises the distance between clusters Ci and Cj and n Maximises the separation of both Ci and Cj from other clusters
  • 401.
    401 Neighbor joining algorithm pInitialisation as in UPGMA p Iteration n Find clusters i and j for which dij – u(Ci) – u(Cj) is minimal n Define new cluster k by Ck = Ci Cj, and define dkl for all l n Define a node k with edges to i and j. Remove clusters i and j n Assign length ½ dij + ½ (u(Ci) – u(Cj)) to the edge i -> k n Assign length ½ dij + ½ (u(Cj) – u(Ci)) to the edge j -> k p Termination: n When only one cluster remains
  • 402.
    402 Neighbor joining algorithm:example a b c d a 0 6 7 5 b 0 11 9 c 0 6 d 0 i u(i) a (6+7+5)/2 = 9 b (6+11+9)/2 = 13 c (7+11+6)/2 = 12 d (5+9+6)/2 = 10 i,j dij – u(Ci) – u(Cj) a,b 6 - 9 - 13 = -16 a,c 7 - 9 - 12 = -14 a,d 5 - 9 - 10 = -14 b,c 11 - 13 - 12 = -14 b,d 9 - 13 - 10 = -14 c,d 6 - 12 - 10 = -16 Choose either pair to join
  • 403.
    403 Neighbor joining algorithm:example a b c d a 0 6 7 5 b 0 11 9 c 0 6 d 0 i u(i) a (6+7+5)/2 = 9 b (6+11+9)/2 = 13 c (7+11+6)/2 = 12 d (5+9+6)/2 = 10 i,j dij – u(Ci) – u(Cj) a,b 6 - 9 - 13 = -16 a,c 7 - 9 - 12 = -14 a,d 5 - 9 - 10 = -14 b,c 11 - 13 - 12 = -14 b,d 9 - 13 - 10 = -14 c,d 6 - 12 - 10 = -16 a b c d e dae = ½ 6 + ½ (9 – 13) = 1 dbe = ½ 6 + ½ (13 – 9) = 5 dbe dae This is the first step only…
  • 404.
    404 Inferring the Past:Phylogenetic Trees (chapter 12) p The biological problem p Parsimony and distance methods p Models for mutations and estimation of distances p Maximum likelihood methods n These parts of the book is skipped on this course (see slides of 2007 course for material on these topics) n No questions in exams on these topics!
  • 405.
    405 Problems with tree-building pAssumptions n Sites evolve independently of one other n (Sites evolve according to the same stochastic model; not really covered this year) n The tree is rooted n The sequences are aligned n Vertical inheritance
  • 406.
    406 Additional material onphylogenetic trees p Durbin, Eddy, Krogh, Mitchison: Biological sequence analysis p Jones, Pevzner: An introduction to bioinformatics algorithms p Gusfield: Algorithms on strings, trees, and sequences p Course on phylogenetic analyses in Spring 2009
  • 407.
  • 408.
    408 Topics Sequencing -Sanger -High-throughput atgagccaag ttccgaacaa ggattcgcgg gtcggtaaagagcattggaa cgtcggagat aagaagcgga tgaatttccc cataacgcca gtggaagaga aggaggcggg cctcccgatc actccggccc gaagggttga gagtacccca gaaatcacct ccagaggacc ccttcagcga catagcgata ggaggggatg ctaggagttg Biodatabases BLAST query gtggaagagaaggaggcgg… gaagggttgagagtacccca... ccagaggaccccttcagcga… ggaggggatgctaggagttg… Sequence assembler Contigs => Genomes Reads Similar sequences = homologs? Proteomics Gene expression analysis DNA chips Protein gels MS/MS techniques Comparative genomics - Phylogenetics - Genome rearrangements - … Systems Biology 2010: Human genome in 15 minutes! atgagccaag aagaagcgga cagcggaaga
  • 409.
    409 Exams p Course examWednesday 15 October 16.00-19.00 Exactum A111 p Separate exams n Tue 18 November 16.00-20.00 Exactum A111 n Fri 16 January 16.00-20.00 Exactum A111 n Tue 31 March 16.00-20.00 Exactum A111 p Check exam date and place before taking the exam! (previous week or so)
  • 410.
    410 Exam regulations p Ifyou are late more than 30 min, you cannot take the exam p You are not allowed to bring material such as books or lecture notes to the exam p Allowed stuff: blank paper (distributed in the exam), pencils, pens, erasers, calculators, snacks p Bring your student card or other id!
  • 411.
    411 Grading p Grading: onthe scale 0-5 n To get the lowest passing grade 1, you need to get at least 30 points out of 60 maximum p Course exam gives you maximum of 48 points p Note: if you take the first separate exam, the best of the following options will be considered: n Exam gives you max 48 points, exercises max 12 points n Exam gives you max 60 points p In second and subsequent separate exams, only the 60 point option is in use
  • 412.
    412 Exercise points p Max.marks: 31 p 80% of 31 ~= 24 marks -> 12 points p 2 marks = 1 point
  • 413.
    413 Topics covered byexams p Exams cover everything presented in lectures (exception: biological background not covered) p Word distributions and occurrences (course book chapters 2-3) p Genome rearrangements (chapter 5) p Sequence alignment (chapter 6) p Rapid alignment methods: FASTA and BLAST (chapter 7) p Sequencing and sequence assembly (chapter 8)
  • 414.
    414 Topics covered byexams p Similarity, distance and clustering (chapter 10) p Expression data analysis (chapter 11) p Phylogenetic trees (chapter 12) p Systems biology: modelling biological networks (no chapter in course book)
  • 415.
    415 Bioinformatics courses in2008 p Biological sequence analysis (II period, Kumpula) n Focus on probabilistic methods: Hidden Markov Models, Profile HMMs, finding regulatory elements, … p Modeling of biological networks (20- 24.10., TKK) n Biochemical network modelling and parameter estimation in biochemical networks using mechanistic differential equation models.
  • 416.
    416 Bioinformatics courses inautumn 2008 p Bayesian paradigm in genetic bioinformatics (II period, Kumpula) n Applications of Bayesian approach in computer programs and data analysis of p genetic past, p phylogenetics, p coalescence, p relatedness, p haplotype structure, p disease gene associations.
  • 417.
    417 Bioinformatics courses inautumn 2008 p Statistical methods in genetics (II period, Kumpula) n Introduction to statistical methods in gene mapping and genetic epidemiology. n Basic concepts of linkage and association analysis as well as some concepts of population genetics will be covered.
  • 418.
    418 Bioinformatics courses inSpring 2009 p Practical Course in Biodatabases (III period, Kumpula) p High-throughput bioinformatics (III-IV periods, TKK) p Phylogenetic data analyses (IV period, Kumpula) n Maximum likelihood methods, Bayesian methods, program packages p Metabolic modelling (IV period, Kumpula)
  • 419.
    419 Genomes sequenced –all done? p Sequencing is just the beginning n What do genes and proteins do? p Functional genomics n How do they interact with other genes and proteins? p Systems biology Two sides of the same question!
  • 420.
    420 Bioinformatics (at leastmathematical biology) can exist outside molecular biology The metapopulation capacity of a fragmented landscape Ilkka Hanski and Otso Ovaskainen Nature 404, 755-758(13 April 2000) Melitaea cinxia, Glanville Fritillary butterfly
  • 421.
    421 Metagenomics p Metagenomics orenvironmental genomics n ”At the last count 1.8 million species were known to science. That sounds like a lot, but in truth it's no big deal. We may have done a reasonable job of describing the larger stuff, but the fact remains that an average teaspoon of water, soil or ice contains millions of micro- organisms that have never been counted or named. ” -- Henry Nicholls
  • 422.
    422 Omics p Phenome p Exposome pTextome p Receptorome p Kinome p Neurome p Cytome p Predictome p Omeome p Reactome p Connectome p Genome p Transcriptome p Metabolome p Metallome p Lipidome p Glycome p Interactome p Spliceome p ORFeome p Speechome p Mechanome http://en.wikipedia.org/wiki/-omics
  • 423.
    423 Take-home messages p Don’ttrust biodatabases blindly! n Annotation errors tend to accumulate p Consider n Statistical significance n Sensitivity of your results p Think about the whole ”bioinformatics workflow”: n Biological phenomenon -> Modelling -> Computation -> Validation of results p Results from bioinformatics tools and methods must be validated! p Actively seek cooperation with experts
  • 424.
    424 Bioinformatics journals p Bioinformatics,http://bioinformatics.oupjournals.org/ p BMC Bioinformatics, http://www.biomedcentral.com/bmcbioinformatics p Journal of Bioinformatics and Computational Biology (JBCB), http://www.worldscinet.com/jbcb/jbcb.shtml p Journal of Computational Biology, http://www.liebertpub.com/CMB/ p IEEE/ACM Transactions on Computational Biology and Bioinformatics , http://www.computer.org/tcbb/ p PLoS Computational Biology, www.ploscompbiol.org p In Silico Biology, http://www.bioinfo.de/isb/ p Nature, Science (bedtime reading)
  • 425.
    425 Bioinformatics conferences p ISMB,Intelligent Systems for Molecular Biology (Toronto, July 2008) p ICSB, International Conference on Systems Biology (Göteborg, Sweden; 22-28 August) p RECOMB, Research in Computational Molecular Biology p ECCB, European Conference on Computational Biology p WABI, Workshop on Algorithms in Bioinformatics p PSB, Pacific Symposium on Biocomputing January 5-9, 2009 The Big Island of Hawaii
  • 426.
    426 Master’s degree inbioinformatics? p You can apply to MBI during the application period November ’08 – 2 February ’09 n Bachelor’s degree in suitable field n At least 60 ECTS credits in CS or mathstat n English language certificate p Passing this course gives you the first 4 credits for Bioinformatics MSc!
  • 427.
    427 Information session onMBI p Wednesday 19.11. 13.00-15.00 Exactum D122 p www.cs.helsinki.fi/mbi/events/info08 p Talks in Finnish
  • 428.
    428 Mailing list forbioinformatics courses and events p MBI maintains a mailing list for announcement on bioinformatics courses and events p Send email to bioinfo a t cs.helsinki.fi if you want to subscribe to the list (you can unsubscribe in the same way) p List is moderated
  • 429.
    429 Computer Science Mathematics and Statistics Biology& Medicine Bioinformatics The aim of this course Where would you be in this triangle? Has your position shifted during the course?
  • 430.
    430 Feedback p Please givefeedback on the course! n https://ilmo.cs.helsinki.fi/kurssit/servlet/Valint a?kieli=en p Don’t worry about your grade – you can give feedback anonymously
  • 431.
    431 Thank you! p Ihope you enjoyed the course! taivasalla.net Halichoerus grypus, Gray seal or harmaahylje in Finnish