3. Reality
Nature
observation
observation
observation
perturbation
take it apart
“in vitro”
recreate life
“synthetic biology”
obs.
obs.
perturb.
obs.
obs.
perturb.
Catalogue
Georg Dionysius Ehret's illustration of Linnaeus's
sexual system of plant classification, 1736
Model
f(x)
formulate/select
Validate
by simulation
Validate on
real data
prediction
Estimate
4. Learning Outcomes
• Understand basic concepts of molecular
biology
• Understand and apply fundamental
models, algorithms, data structures, and
computational techniques to answer
biological questions
• Wide range of topics, but special focus on
biological sequences and their evolutionary
context.
5. Topics
Molecular Genetics
Gene Evolution
Genome Evolution
Mass Spectrometry
Codon Bias
Modeling
Dynamic programming
Markov models
Least squares
Maximum Likelihood
Optimization
Heuristics
Simulation
x
6. Organization
• Lecture
• Wed 13-14 (CAB G52), Fri 13-15 (ML F34)
• Prof. Gonnet will hold the lectures
• Exercises:
• Thu 14-16 (CAB H56), starting this week
• If you do not have a nethz account, ask
Stefan Zoller as soon as possible.
8. Date Topic Lecturer
Sept. 19/21 Course Introduction; Basic Molecular
Biology
NS
Sept. 26/28 Markov models/String Alignment I GHG
Oct. 3/5 String Alignment II (indels, estimating
distances)
GHG
Oct. 10/12 Substitution Matrices GHG
Oct. 17/19 Approximate Alignment Methods;
Statistics of Pairwise Alignments
GHG
Oct. 24/26 Phylogeny I GHG
Oct.31/Nov.2 Phylogeny II GHG
Nov. 7/9 Phylogeny III GHG
Nov. 14/16 Multiple Sequence Alignments AS
Nov. 21/23 Synthetic Evolution; Evaluation of
Estimators
DD/GHG
Nov. 28/30 Current research; Mass profiling Guests/
GHG
Dec. 5/7 Orthology/Lateral Gene Transfer NS
Dec. 12/14 Codon bias SZ
Dec. 19/21 Genome Rearrangements GHG
9. Course Grade & Credits
• Participation in the exercises is strongly
encouraged, but not mandatory
• Written Exam
• During winter session
• 3 hours
• Only support materials are 2 A4 pages
(4 sides), personally handwritten.
11. Darwin
• Interpreted language based on Maple
• Environment for bioinformatics, can do
sequence management, mathematics,
alignments, trees, drawing, etc.
• Available for download mac and linux
(http://www.cbrg.ethz.ch/darwin)
12. Biorecipes
• A collection of real
problems with coded
solutions in the
Darwin language
• Darwin input in green
• Darwin output in red
www.biorecipes.com
13. Other materials
• Slides can be downloaded from the
course homepage.
• Additional notes and references will be
made available as well.
15. Basic Principles
• Universality of life on earth: water,
carbon-based biochemistry; genetic
material; genetic code (largely) universal.
→ common origin!
• Life is compartmentalized: cells are
fundamental units of structure,
function, organization
• Self-replicating
• Capable of Darwinian evolution
10 µm
Cryptomonadales
Encyclopedia of Life
(eol.org)
16. So what is life?
• What about endospores? viruses? mules? priests?
prions? computer viruses?
• In biology, there are exceptions to almost every rule.
“Living organisms undergo metabolism,
maintain homeostasis, possess a capacity
to grow, respond to stimuli, reproduce
and, through natural selection, adapt to
their environment in successive
generations.”
18. Relevant components
• Ribosomes translate mRNA into proteins.
• Mitochondria (eukaryotes) have their own
DNA and are a result of early inclusion of α-
proteobacteria into a eukaryotic cell.
• Chloroplasts (plants, protists) have their own
DNA as a result of early inclusion of
cyanobacteria into a eukaryotic cell.
• Plasmids (bacteria) are short pieces of circular
DNA in multiple copies; nonessential; get
transferred between bacteria.
19. • Genome: all the genetic
material of an organism.
• The genome consists of
genes and non-coding
regions.
• Genes consist of
regulatory regions,
intron, exons,
untranslated regions
http://www.scfbio-iitd.res.in/tutorial/geneorganization.html
Genome
chromosome
chromatin
histone
20. Escherichia coli Homo sapiens
1 circular chromosome
1 plasmid (multiple copies)
~4.6 million base pairs
~3.9 million
coding bases (85%)
4132 protein-coding genes
172 RNA (tRNA, rRNA,etc)
578 pseudogenes
23 chromosome pairs
~3 billion base pairs
~50 million coding bases (1.5%)
~21,000 protein-coding genes
~294,000 exons
~60,000 different transcripts
~6,000 pseudogenes
~4,800 RNA genes
~2,900 RNA pseudogenes
21. DNA
Deoxyribonucleic acid
• Double helix
• Backbones: phosphate and
deoxyribose , directed
(5’ → 3’), antiparallel
• Connection: 4 bases Adenine,
Thymine, Cytosine, Guanine.
• A-T and C-G are paired by
hydrogen bonds (relatively weak) Wikipedia
34 Å
(3.4 nm)
3.3 Å
(0.33 nm)
23. Hydrogen Bond
• X-H ···· Y where X,Y is
an electronegative
atom (typically N,O,F)
• Responsible for high
boiling point of water
(each H20 can have
up to 4 H bonds)
28. RNA
• Single stranded (can form structure)
• Uracil instead of Thymine
• mRNA: messenger RNA, for translation
• rRNA: subunit of ribosome
• tRNA: specific for one amino-acid,
selectively bind to codon via ribosome.
• microRNA: short nucleotides (~22 nts)
which regulate gene function
http://www.pdb.org/pdb/static.do?
p=education_discussion/
molecule_of_the_month/pdb15_2.html
29. Transcription
• Transcription factors bind to promoter sites at
the 5’ regulatory region.
• RNA polymerase, binds to the complex.
• Working together, they open the DNA double
helix.
• Genes can be on either strand, but direction of
growing mRNA sequence is always 5’ → 3’
30. Roger Kornberg
Nobel Prize Chemistry 2006
The chain shown in grey is RNA polymerase,
with the portion that clamps on the DNA
shaded in yellow.The DNA helix being
unwound and transcribed by RNA
polymerase is shown in green and blue, and
the growing RNA stand is shown in red.
http://med.stanford.edu/featured_topics/nobel/kornberg/release.html
31. Post-transcriptional
modifications (Eukaryotes)
• 5’ Cap
• Poly-A tail
• Splicing (removal of introns)
Research questions:Where are the introns? Where are the
coding sequences? Where are the stop and start of
transcription? Where are the binding sites for the transcription
factors that control when transcription takes place?
32. Alternative Splicing
• Humans: >50% of genes have splice variants.
• Dscam gene in D. melanogaster: 95 alternative
exons can express 38,016 different mRNAs through
alternative splicing.
35. Proteins
• Participate in most (all?) cellular processes
• Made of 20 amino-acids (+ occasionally a
cofactor, such as metal ion, heme, ATP, etc.)
• Encoded in DNA
51. Darwinian Evolution
• Start from an initial population
• Repeat:
• reproduce and “mutate” randomly
• natural selection: fittest individuals
survive and have descendants
→ selects “good” mutations
• sometimes: a “branching” occurs (e.g.
speciation, duplication)
52. Not only the “good”
characters survive
• Genetic drift (random sampling)
• Population bottleneck
• Founder effect
• Genetic hitchhiking (neutral or mildly
deleterious alleles linked to positively
selected gene)
53. Species Evolution
• Speciation: the
evolutionary process by
which new species arise
• Can occur from
geographic isolation or
barriers, new niche
entered, animal
husbandry
Diane Dodd’s fruit fly experiment
http://evolution.berkeley.edu/evolibrary/article/_0_0/evo_45
54. Krzywinski et al. Circos: an information aesthetic for comparative genomics. Genome Research (2009) vol. 19 (9) pp. 1639-45
Genome Rearrangements
e.g. Human vs. Dog
65. Evolutionary Distances
• Time since divergence
• Number of common traits.
• Edit distance (minimum # of elementary
operations to transform one object into the
other)
• ...
• Desirable properties
• distance estimable without knowing history
• metric properties (e.g. triangle inequality)
How can we quantify the amount of evolution
between two subjects?
66. Markovian Evolution
Markov Model: every site evolves independently,
probability of mutation only depends on present
state (no memory), probabilities of mutation are
expressed by transition matrix.
A C G T
A 0.900 0.033 0.033 0.033
C 0.033 0.900 0.033 0.033
G 0.033 0.033 0.900 0.033
T 0.033 0.033 0.033 0.900
M1=
After “one unit” of evolution, the
probability that an A mutates into a
C is given by the corresponding
entry in the matrix:
p(A→C | d=1) = M1[A→C] = 0.033