!

!Bioinformatics before, during
and after the era of genomics
!
One personʼs view!
!

BioInfoSummer, 2013
!

1
Caveats
!
This is a personal potted history, along with commentary, !
opinions and summary remarks. For the real history, ...
Bioinformatics is driven by two things:
!
New methods for generating molecular data
!
and
!
The desires of scientists to u...
!
As a result, there are general problems in the field
that need revisiting, and general issues that need
attention, with e...
Challenges with Bioinformatics !
•  It is highly interdisciplinary!
Pro: there is always lots of interesting stuff to lear...
• 
Before! • 
• 
• 

Molecular evolution!
Protein structure and classification!
Sequence alignment !
Database searching!

•...
Before: 1950s
!

7
Proteins are single chains of amino acids
!
!
This point of view arose out of the work of Frederick
Sanger, who determined...
Protein structure by X-ray crystallography!
The view of proteins as amino acid (polypeptide)
chains was strengthened by la...
Another beginning: DNA
!
!
Arising from James Watson and Francis Crick s
famous determination of the structure of DNA in 1...
!

The 20 common amino acids!
Before: 1960s, Molecular evolution!

The idea that macromolecules have an evolutionary
history fully as traceable as that ...
From Perutz et al, Nature 1959
!
The polypeptide chain…first discovered in sperm
whale myoglobin has since been found in se...
Comparison between the Amino-Acid Sequences of
Sperm Whale Myoglobin and Human Haemoglobin
Watson & Kendrew Nature 1961
!
...
Was this the first a.a. sequence alignment?
!
In the paper we find the sentence:!
!
..the human haemoglobin chains have been...
Evolution using molecules
!
•  DNA is passed from parents to offspring more
or less unchanged.!
•  Molecular evolution is ...
Margaret O Dayhoff (1925-1983)
!
Has been called the mother and father of bioinformatics. !
“She anticipated the potential...
Protein evolution: some terms
!
Two proteins that share a common ancestry are said to
be homologues. (Important to disting...
Some human globins (paralogues)
!

alpha!
beta!
delta!
epsilon!
gamma!
myoglobin!

V
V
V
G
-

V
H
H
H
H
G

L
.
.
F
F
.

S
...
Some vertebrate beta-globins (orthologues)
!
	

 	

	

 	

	

 	


human
macaque
cow
platypus
chicken
shark

	

 	

	

 	
...
Beta-globins: Uncorrected pairwise distances
Distances: between protein sequences. Calculated over: 1 to 147!
Below diagon...
Beta-globins: Corrected pairwise distances
!
Distances: between protein sequences. Calculated over: residues 1 to 147!
Bel...
f

g

h

i

V ertebrates/
Insects

Carp/Lamprey

ab cd e

Reptiles/Fish

Birds/Reptiles
Mammals/
Reptiles

200

Mammals

2...
Unrooted UPGMA tree for beta-globins
!
shark

A gene tree across species

chicken

platypus

cow
24

macaque

human
Before: 1970s, Sequence alignment
!
Gibbs & McIntyre (1970):The Diagram (now the DotPlot)!
Needleman & Wunsch (1970): Glob...
Cytochrome c dot plots from Gibbs & McIntyre
!

!

26
Needleman and

Wunsch (1970)
!

Figure 1!
Figure 2!

27
Human beta haemoglobin 

vs. 

pea leghaemoglobin
!

!
1 VHLTPEEKSAVTALWG.KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHG...
Globin fold
α protein
myoglobin
PDB: 1MBN
β sandwich
β protein
immunoglobulin
PDB: 7FAB
Before: 1980s, Molecular databases
!

31
Sequence databases
!
In the mid-1970s, Sanger and Maxam-Gilbert DNA
sequencing were invented.!
!
From 1965 M.O. Dayhoff pu...
Bioinformatics comes of age

(but the name doesnʼt yet exist)!
•  1982 Genbank opens!
Many developments in database search...
Substitution/Scoring Matrices
!
PAM matrices (Dayhoff et al. 1978) --- phylogeny-based.!
PAM1: expected number of mutation...
Databases, cont.!
Not only are databases valuable
repositories for knowledge transfer,
they can help generate new knowledg...
Simian Sarcoma Virus onc Gene,v-sis, Is
Derived from the Gene (or Genes)
Encoding a Platelet-Derived Growth Factor
!
Howev...
Simian Sarcoma Virus onc Gene,v-sis, Is
Derived from the Gene (or Genes) Encoding
a Platelet-Derived Growth Factor
!

Dool...
1990s: The explosion
!
The decade of FASTA, BLAST, CLUSTAL W and HMMs
!
Database searching on a large scale
!
Whole genome...
BLO(ck)SU(bstitution)M(atrix)

(Henikoff & Henikoff 1992)
!

Derived from a set (2000) of aligned and ungapped regions fro...
DNA markers: RFLPs


!

40
The point: there were lots of these markers!

41
From Donis-Keller et al, Cell 1987
Chromosomes are mosaics
of grandparental
chromosomes,!
and the breakpoints can be!
Inferred.

42
From Donis-Keller et al, ...
Bioinformatics during the genome era
!
Life went on as before, but a lot of bioinformatic effort was!
devoted to the devel...
DNA sequencing: the march of technology
!

44
Sanger DNA
sequencing

~1975: radio-labelled 

chain terminators; 

slab gel
electrophoresis!

45
http://en.wikipedia.org/...
Automated DNA sequencing machines

initially slab gel, later capillary electorphoresis
!

46
Automated Sanger DNA sequencing ~2000: 

fluorescently labelled chain terminators,
capillary electrophoresis!

47
Sequencing technology growth!

/machine!
/year!
Helicos

AB SOLiD

Illumina

Pacific Biociences

Oxford Nanopore

Roche 454
49
HiSeq2000!

MiSeq!
50
Post-genome bioinformatics
!

• 
• 
• 
• 

A very large fraction is about sequencing
!
!
Take some “cells”!
Do something t...
The “cells” you take can be
!
•  Very heterogeneous pools, e.g. from soil, gut,
seawater, …!
•  Mixed cell populations, e....
The something you do to the cells may be
!
•  Not much, e.g. metagenomics!
•  Chromatin Immunoprecipitation, e.g for
trans...
What you do with the resulting reads
!
• 
• 
• 
• 
• 
• 

Map them back to a reference genome!
Assemble them de novo into ...
Metagenomics @ Melbourne
Wed 27 November 2013, 12.30pm 6.00pm
Bio21 Molecular Science and Biotechnology Institute
30 Flemi...
Forshew et al. Sci Transl Med (2012)!

56
And then there is lots of other stuff,
too numerous to mention
!
• 
• 
• 
• 

The other omics (prote-, metabol-, lipid-, p...
High!content!imaging!screens
•

Robotic!instruments!treat!cells!(e.g.!
RNAi,!drugs,!peptides)

•

Phenotype!measurement!is...
Combining image and molecular data
!

Yuan et al. Sci Transl Med 4, (2012)!
!

59
a general problem in the field, that needs
revisiting with each new kind of data
!

Sequence alignment
!

60
first sequence =  GAATTCAGTTA !
second sequence = GGATCGA !
An optimal alignment:!

G A A T T C A G T T A

!
G G A T _ C _ ...
Sequence alignment methods!

62
general issues that need attention,
with each new kind of data
!
•  Data quality assessment!
•  Data cleaning/normalizatio...
Some general techniques which 

can be adapted to new contexts
!
• 
• 
• 
• 
• 
• 

Dynamic programming!
Hidden Markov mod...
Some applications of hidden Markov
models in bioinformatics
!
• 
• 
• 
• 
• 

mapping chromosomes!
aligning biological seq...
A very short profile HMM
!

M = Match state, I = Insert state, D = Delete state.
!
To operate, go from left to right. I and...
How profile HMMs work, in brief
!
•  Instances of the motif are identified by calculating!
log{pr(sequence | M)/pr(sequence ...
Dealing with multiplicity (p-value adjustment)
!
BLAST (Basic Local Alignment Search Tool) aligns
short sequence to databa...
A specific problem arising with a new 

kind of data and a new application, 

which demanded a new technique.!

Gene set te...
Suppose that we do thousands of tests, comparing gene
expression levels between treated and control cells, and
find no (“si...
Finally
!

71
Two important general themes
!
•  Evolution and the comparative method!
!
"Nothing in Biology Makes Sense Except in the Li...
THANKS FOR LISTENING!!

73

The Deluge (1840) by Francis Danby (1793-1861)!
Upcoming SlideShare
Loading in …5
×

Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

1,848 views

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,848
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Bioinformatics Before, During and After The Era of Genomics, One Person's View - Terry Speed

  1. 1. ! !Bioinformatics before, during and after the era of genomics ! One personʼs view! ! BioInfoSummer, 2013 ! 1
  2. 2. Caveats ! This is a personal potted history, along with commentary, ! opinions and summary remarks. For the real history, perhaps! start with Wikipedia, then go to secondary or primary sources.! All statements are implicitly prefaced by! “In my humble (but perhaps not humble enough) opinion”! Finally, I will use a lot of terms that I do not properly define. My! apologies, but there is insufficient time for such care. This is a ! superficial overview: care wth details comes from others here.! 2
  3. 3. Bioinformatics is driven by two things: ! New methods for generating molecular data ! and ! The desires of scientists to use newly-generated data to address questions of interest to them ! ! 3
  4. 4. ! As a result, there are general problems in the field that need revisiting, and general issues that need attention, with each new kind of data. There are! general techniques, which can be adapted to new contexts, and there are a host of specific problems arising with each new kind of data and each new application, many of which demand new techniques. ! 4
  5. 5. Challenges with Bioinformatics ! •  It is highly interdisciplinary! Pro: there is always lots of interesting stuff to learn! Con: no clear boundaries. It merges into biology, biochemistry, biophysics, biomathematics, biostatistics, biocomputing, ..! •  It sits within a rapidly changing landscape! Pro: there is always something new to dive into! Con: by the time you eventually solve your problem, no-one may care (so work quickly!)! •  It is currently not so well funded in Australia! Pro: this should focus your mind on doing really good work! Con: you might not realize your dreams too readily (dguydj)! 5
  6. 6. •  Before! •  •  •  Molecular evolution! Protein structure and classification! Sequence alignment ! Database searching! •  •  During! •  •  •  After! •  •  •  •  Genetic mapping ! Physical mapping ! DNA sequencing ! Genome assembly ! Gene-finding! Sequencing, sequencing, sequencing, sequencing! Functional genomics, … ! Proteomics, Metabolomics, …! Imaging, …! 6
  7. 7. Before: 1950s ! 7
  8. 8. Proteins are single chains of amino acids ! ! This point of view arose out of the work of Frederick Sanger, who determined the amino acid sequence of (bovine) insulin at the beginning of the decade. ! ! 8
  9. 9. Protein structure by X-ray crystallography! The view of proteins as amino acid (polypeptide) chains was strengthened by later work of Max Perutz and John Kendrew, who determined the structure of (horse) haemoglobin and (sperm-whale) myoglobin respectively, publishing in 1960. At that time, the amino acid sequence of these globins was not known.! ! ! ! 9
  10. 10. Another beginning: DNA ! ! Arising from James Watson and Francis Crick s famous determination of the structure of DNA in 1953. ! But it took another two decades to have a method for determining DNA sequences (Sanger again). ! 10
  11. 11. ! The 20 common amino acids!
  12. 12. Before: 1960s, Molecular evolution! The idea that macromolecules have an evolutionary history fully as traceable as that of bone structure or any other large-scale feature developed on the heels of the first determinations of amino acid sequences of proteins. Hemoglobin: Structure, Function, Evolution and Pathology by RE Dickerson and I Geis 1983, p.66.! 12
  13. 13. From Perutz et al, Nature 1959 ! The polypeptide chain…first discovered in sperm whale myoglobin has since been found in seal myoglobin. Its appearance in horse haemoglobin suggests that all haemoglobins and myoglobins of vertebrates follow a similar pattern. How is this possible? ….This suggests the occurrence of similar sequences throughout this group of proteins….! …. their structural similarity suggests that they have developed from a common genetic precursor.! 13
  14. 14. Comparison between the Amino-Acid Sequences of Sperm Whale Myoglobin and Human Haemoglobin Watson & Kendrew Nature 1961 ! 14
  15. 15. Was this the first a.a. sequence alignment? ! In the paper we find the sentence:! ! ..the human haemoglobin chains have been placed alongside that of myoglobin ...in what seems to us the most plausible manner….we have tentatively assumed that interpolations or deletions occur only in or near inter-helical regions since even a single change in the middle of a helix would.… ! ! A modern alignment will be given later.! 15
  16. 16. Evolution using molecules ! •  DNA is passed from parents to offspring more or less unchanged.! •  Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection (neutral hypothesis: Kimura, 1968)! •  Mutations accumulate at fairly steady rates in surviving lineages (molecular clock hypothesis: Zuckerkandl & Pauling, 1962)! •  We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules.! ! 16
  17. 17. Margaret O Dayhoff (1925-1983) ! Has been called the mother and father of bioinformatics. ! “She anticipated the potential of computers to the current theories of Zuckerkandl & Pauling and the method which Sanger had engineered. With Richard Eck, she published the first reconstruction of a phylogeny (evolutionary tree) by computers from molecular sequences, using a maximum parsimony method. She also formulated the first probability model of protein evolution, the PAM model, in 1966.! She initiated the collection of protein sequences in the Atlas of Protein Sequence and Structure, a book collecting all known protein sequences that she published in 1965.” ! All from Wikipedia! ! 17
  18. 18. Protein evolution: some terms ! Two proteins that share a common ancestry are said to be homologues. (Important to distinguish from similarity.) ! We have two different types of homologous proteins.! Orthologues: the same gene in different organisms; common ancestry goes back to speciation.! A common mode of protein evolution is by duplication.! Paralogues: different genes in the same organism; common ancestry goes back to a gene duplication.! Lateral gene transfer gives another form of homology, one not wholly describable by a tree (need networks). ! 18 !
  19. 19. Some human globins (paralogues) ! alpha! beta! delta! epsilon! gamma! myoglobin! V V V G - V H H H H G L . . F F . S T T T T . P . . A E D A E E E E G D E E E . E K . . . . W 10 T N V S A . . A . A A . A T I Q L . K T N T T L A . . S S N A L L L L V W . . . . . G . . S . . K . . . . . V . . M . . G E 20 A . H N N N N D A V V V V I G D D E E P E . A . D G Y V V A A H G . . . . . A G G G G Q E . . . . . 30 A L E . . G . . G . . G T . G V . I R . . . . . M L L L L L F L L L L . L V V V V K S V V V V G F Y Y Y Y H P . . . . . 40 T T W . W . W . W . E . The sequences continue up to ~146 a.a. ! . means ditto, - means deletion. ! Producing multiple sequence alignments like this ! (and much longer, and with many more sequences), remains a major continuing task for bioinformatics. ! 19 ! ! ! ! ! !
  20. 20. Some vertebrate beta-globins (orthologues) ! human macaque cow platypus chicken shark human macaque cow platypus chicken shark human macaque cow platypus chicken shark human macaque cow platypus chicken BG-shark 10 V . . . H . . . L . . . . W T . . S . S P . A G A E E . . G . V E . . . . . K . . . . L S N A . A H A . . . . E 50 F . . . . Y F . . . . . E . . . . G S . . A . N F . . . . L G . . . . K D . . . . E L . . . . F S . . . . T 20 V . . . . I T S . S . A F . . . . . G . . . . S T . . S . A T . . . . Q P . . . . . F . . . . . P Q V E V Q A . . . . T V . L . L T T Q A K A D Q . . . . . L . . . . . A . . . . . S . . . . . P . A A A C L . F . F T W . . . . . G . . . . K K . . . . S V . . . . I E . . . . K L . . . . K D . . G . S A . . . . Y V . . . . G M . . . . - G . N . N - N . . . . - P . . . . - Y . F W F W V . . I . K Q . . . . E H . . . . . D . . N . H V . . L . Y D . . . . E V . . . . F K . . . . E A . . S . G L . . . . . G . . . . V H . . . . . V . . . . . V . . . . . A . . . . V E . . . . S V . . L . L G . . . . . G . . . . A E . . . . K A . . . . . L . . . . . V . . . . . K . . . . . D . . . . . P . . . . V E . . . . . A . . . . E H . . . . . G . . . . A K . . A . . K . . . . . A . . . . . L . . . . I N . . . . S F . . . . . R K K N K K L . . R . . L . . . . . A . . G . S H . . . . K K . R . R E Y . . . . . H . . . . . V . . . . . G . . . . A L . . . . T N . . . . K G . D T D . 40 R . . . . . L . . . . M L . . . . F V . . . . I V . . . . . Y . . . . . P . . . . . W . . . . . T . . . . . A . S S S . F . . . . L 110 N . . H . D G . . . . A 70 140 K . . . . . C . . . . A K . . . . - 30 100 A . D . D I N . K . K D 60 130 E . . D . K K . . . . . A T . N . T 90 L . . . . V T . . . . . L . . . . F V . . I . . S . . G . G D . N . N V G . . A . A L . M . M V A N K K K T H . . N . . L . . . . . D . . . . G N . D D D D 120 C . V V V V V . . . . E L . . . . . A . . . . G H . R R R I H . N . N L F . . . . L G . . S . K R . . . . . 80 V . . . . C Q . . . . T K . . . . D 20
  21. 21. Beta-globins: Uncorrected pairwise distances Distances: between protein sequences. Calculated over: 1 to 147! Below diagonal: observed number of differences! Above diagonal: number of differences per 100 amino acids! ! ! hum hum ---- mac cow pla chi sha 5 16 23 31 65 ---- 17 23 30 62 mac 7 cow 23 24 ---- 27 37 65 pla 34 34 39 ---- 29 64 chi 45 44 52 42 ---- 61 sha 91 88 91 90 87 ---- 21
  22. 22. Beta-globins: Corrected pairwise distances ! Distances: between protein sequences. Calculated over: residues 1 to 147! Below diagonal: observed number of differences! Above diagonal: estimated number of substitutions per 100 amino acids ! Correction method: Jukes-Cantor (1969)! ! hum hum ---- mac 5 ---- cow pla chi sha 17 27 37 108 18 27 36 102 32 46 110 ---- 34 106 mac 7 cow 23 24 ---- pla 34 34 39 chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ---22
  23. 23. f g h i V ertebrates/ Insects Carp/Lamprey ab cd e Reptiles/Fish Birds/Reptiles Mammals/ Reptiles 200 Mammals 220 j 10 160 140 120 Fibrin op 1.1 M eptides Y 180 6 5 7 89 Evolution of the globins 100 80 60 4 100 200 300 400 500 Algonkian Cambrian Devonian Silurian Ordovician Jurassic riassic T Permian Cretaceous Carboniferous 20 3 2 0 1 c rome ytoch C MY 20.0 600 Separation of ancestors of plants and animals Huronian 40 Pliocene Miocene Oligocene Eocene Paleocene Corrected amino acid changes per 100 residues The Rates of Macromolecular Evolution ! 700 800 900 1000 1100 1200 1300 1400 Millions of years since divergence After Dickerson (1971) 23
  24. 24. Unrooted UPGMA tree for beta-globins ! shark A gene tree across species chicken platypus cow 24 macaque human
  25. 25. Before: 1970s, Sequence alignment ! Gibbs & McIntyre (1970):The Diagram (now the DotPlot)! Needleman & Wunsch (1970): Global alignment using a dynamic programming algorithm ! ! ! 25
  26. 26. Cytochrome c dot plots from Gibbs & McIntyre ! ! 26
  27. 27. Needleman and
 Wunsch (1970) ! Figure 1! Figure 2! 27
  28. 28. Human beta haemoglobin 
 vs. 
 pea leghaemoglobin ! ! 1 VHLTPEEKSAVTALWG.KVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL 78! :|..:.. |.. .: | |:.: : .. :| | .. :|. : | ..:| :.||:.||:..|:| ..|: |:|! 1 .GFTDKQEALVNSSSEFKQNLPGYSILFYTIVLEKAPAAKGLFSFLKD...TAGVEDSPKLQAHAEQVFGLVRDSAAQL 74! ! 79 ...DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANAL..AHKYH 146! ::: . |||:.:|.:| |. .:| :: :.|: .: . |.::...:..|:: . .|:|.|: | | ! 75 RTKGEVVLGNATLGAIHVQK.GVTNPHFVVVKEALLQTIKKASGNNWSEELNTAWEVAYDGLATAIKKAMKTA 146! 28
  29. 29. Globin fold α protein myoglobin PDB: 1MBN
  30. 30. β sandwich β protein immunoglobulin PDB: 7FAB
  31. 31. Before: 1980s, Molecular databases ! 31
  32. 32. Sequence databases ! In the mid-1970s, Sanger and Maxam-Gilbert DNA sequencing were invented.! ! From 1965 M.O. Dayhoff published her first Atlas of Protein Sequence and Structure. It became the PIR.! ! By the end of the decade, a rapidly increasing amount of DNA sequence data joined the steadily growing amount of sequence in protein databases. ! 32
  33. 33. Bioinformatics comes of age
 (but the name doesnʼt yet exist)! •  1982 Genbank opens! Many developments in database searching, pairwise and multiple alignment and protein structure prediction! ! •  1987 !First of many high-resolution human genetic maps! •  1988 NCBI starts! ! 33
  34. 34. Substitution/Scoring Matrices ! PAM matrices (Dayhoff et al. 1978) --- phylogeny-based.! PAM1: expected number of mutation = 1%! PAM250 matrix, log-odds representation
  35. 35. Databases, cont.! Not only are databases valuable repositories for knowledge transfer, they can help generate new knowledge. ! We can see the Darwinian notion of descent with modification very clearly, that the dominant mode of molecular evolution was duplicate and modify, and find unexpected match-ups occurring, such as the early one on the next page.! 35
  36. 36. Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes) Encoding a Platelet-Derived Growth Factor ! However, previous efforts to demonstrate any functional or evolutionary relatedness between transforming gene products and any growth factor have been unsuccessful.! Doolittle et al, Science 1983! 36
  37. 37. Simian Sarcoma Virus onc Gene,v-sis, Is Derived from the Gene (or Genes) Encoding a Platelet-Derived Growth Factor ! Doolittle et al, Science 1983! 37
  38. 38. 1990s: The explosion ! The decade of FASTA, BLAST, CLUSTAL W and HMMs ! Database searching on a large scale ! Whole genome mapping and sequencing! Proteomics, gene expression arrays, and much more. ! 38
  39. 39. BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) ! Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity.! BLOSUM62 Matrix, log-odds representation!
  40. 40. DNA markers: RFLPs
 ! 40
  41. 41. The point: there were lots of these markers! 41 From Donis-Keller et al, Cell 1987
  42. 42. Chromosomes are mosaics of grandparental chromosomes,! and the breakpoints can be! Inferred. 42 From Donis-Keller et al, Cell 1987!
  43. 43. Bioinformatics during the genome era ! Life went on as before, but a lot of bioinformatic effort was! devoted to the development of tools to deal with the growing ! amounts of complete genome sequence data. Why? ! ! !Why do people rob banks? ! • Genetic mapping ! • Physical mapping ! • DNA sequencing ! • Genome assembly ! • Gene-finding! 43
  44. 44. DNA sequencing: the march of technology ! 44
  45. 45. Sanger DNA sequencing
 ~1975: radio-labelled 
 chain terminators; 
 slab gel electrophoresis! 45 http://en.wikipedia.org/wiki/Sequencing
  46. 46. Automated DNA sequencing machines
 initially slab gel, later capillary electorphoresis ! 46
  47. 47. Automated Sanger DNA sequencing ~2000: 
 fluorescently labelled chain terminators, capillary electrophoresis! 47
  48. 48. Sequencing technology growth! /machine! /year!
  49. 49. Helicos AB SOLiD Illumina Pacific Biociences Oxford Nanopore Roche 454 49
  50. 50. HiSeq2000! MiSeq! 50
  51. 51. Post-genome bioinformatics ! •  •  •  •  A very large fraction is about sequencing ! ! Take some “cells”! Do something to them! Sequence the resulting DNA! Do something with the resulting reads! 51
  52. 52. The “cells” you take can be ! •  Very heterogeneous pools, e.g. from soil, gut, seawater, …! •  Mixed cell populations, e.g. PBMC from blood! •  Cell lines! •  Sorted cells, e.g. CD4+ T-cells from blood! •  Single cells, eg. neurons, B-cells, …! •  Not cells at all, e.g. cell-free DNA from plasma! 52
  53. 53. The something you do to the cells may be ! •  Not much, e.g. metagenomics! •  Chromatin Immunoprecipitation, e.g for transcription factor binding, histone modification! •  RNA extraction, to do RNA-seq, to find noncoding RNAs ! •  RNA interference, e.g. siRNA to silence genes! •  Bisulphite conversion to measure methylation! •  ….(much more)! 53
  54. 54. What you do with the resulting reads ! •  •  •  •  •  •  Map them back to a reference genome! Assemble them de novo into a genome! Try to sort out what is there (a mix of spp)! Map them to a custom transcriptome! Map them to some other custom database! …..! 54
  55. 55. Metagenomics @ Melbourne Wed 27 November 2013, 12.30pm 6.00pm Bio21 Molecular Science and Biotechnology Institute 30 Flemington Road, Parkville Enquiries to kholt@unimelb edu au www bio21 org kholt@unimelb.edu.au www.bio21.org Guest Speakers Topics Ian Paulsen, Macquarie University Paulsen Gene Tyson, UQ Centre for Ecogenomics Aaron Darling, i3 Institute, UTS Brendan Burns, UNSW Rob Moore, CSIRO Human microbiome Animal & plant microbiome Ecogenomics Marine metagenomics Bioinformatics Plus short talks from 10 local researchers Register by November 20 Register by November 20 research.mdhs.unimelb.edu.au/event/advancing systems biology research.mdhs.unimelb.edu.au/event/advancing systems biology Registration is FREE andand includes lunch, afternoon tea and evening refreshments Registration is FREE Includes lunch, afternoon tea and evening refreshments. All welcome. All welcome A Bio21 Institute Structural and Cellular Biology Theme Event. Part of Advancing Systems Biology @ Melb. 55
  56. 56. Forshew et al. Sci Transl Med (2012)! 56
  57. 57. And then there is lots of other stuff, too numerous to mention ! •  •  •  •  The other omics (prote-, metabol-, lipid-, phen-…)! many imaging modalities (is this bioinformatics?)! …. ! and, perhaps hardest of all, combinations of the above! 57
  58. 58. High!content!imaging!screens • Robotic!instruments!treat!cells!(e.g.! RNAi,!drugs,!peptides) • Phenotype!measurement!is!image" based image!from!Drug!Discovery!World 58
  59. 59. Combining image and molecular data ! Yuan et al. Sci Transl Med 4, (2012)! ! 59
  60. 60. a general problem in the field, that needs revisiting with each new kind of data ! Sequence alignment ! 60
  61. 61. first sequence =  GAATTCAGTTA ! second sequence = GGATCGA ! An optimal alignment:! G A A T T C A G T T A
 ! G G A T _ C _ G _  _ A ! (score: match = 2, mismatch = -1, indel = -2)! ! A different optimal alignment:! _ G A A T T C A G T T A
 ! G G A  _T _ C _ G _ _ A! (score: match = 1, mismatch = 0, indel = 0)! ! 61
  62. 62. Sequence alignment methods! 62
  63. 63. general issues that need attention, with each new kind of data ! •  Data quality assessment! •  Data cleaning/normalization/adjusting! •  Data visualization! 63
  64. 64. Some general techniques which 
 can be adapted to new contexts ! •  •  •  •  •  •  Dynamic programming! Hidden Markov models! Dealing with multiplicity (p-value adjustment)! Resampling/bootstrapping! Network analyses! Bayesian inference! 64
  65. 65. Some applications of hidden Markov models in bioinformatics ! •  •  •  •  •  mapping chromosomes! aligning biological sequences! predicting sequence structure! inferring evolutionary relationships! finding genes in DNA sequence! 65
  66. 66. A very short profile HMM ! M = Match state, I = Insert state, D = Delete state. ! To operate, go from left to right. I and M states output! 66 amino acids; B, D and E states are silent . !
  67. 67. How profile HMMs work, in brief ! •  Instances of the motif are identified by calculating! log{pr(sequence | M)/pr(sequence | B)}, ! where M and B are the motif and background HMMs.! •  Alignments of instances of the motif to the HMM are found by calculating ! arg maxstates pr(states | instance, M).! •  Estimation of HMM parameters is by calculating ! arg max parameters pr(sequences| M, parameters).! In all cases, we use the efficient HMM algorithms.! 67
  68. 68. Dealing with multiplicity (p-value adjustment) ! BLAST (Basic Local Alignment Search Tool) aligns short sequence to databases. It uses extreme value theory in calculating its E-value.! ! Mascot uses mass spectometry data to identify proteins from peptide sequence databases. It has calibrated its probability score to deal with the huge multiplicity. The details are not fully known, but it works.! ! With microarray data the Benjamini-Hochberg false discovery rate (FDR) has become widely used. In other areas Bonferroni still reigns. ! 68 !
  69. 69. A specific problem arising with a new 
 kind of data and a new application, 
 which demanded a new technique.! Gene set tests ! 69
  70. 70. Suppose that we do thousands of tests, comparing gene expression levels between treated and control cells, and find no (“significantly”) differentially expressed genes. What might this mean? ! ! Or, suppose we find 100s of (“significantly”) differentially! expressed genes. What might this mean?! ! In each case there are gene-set tests that can be used which might illuminate the situation.! 70
  71. 71. Finally ! 71
  72. 72. Two important general themes ! •  Evolution and the comparative method! ! "Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky 1973! ! •  The use of positive and negative controls (“truth”) in order to see whether methods work in practice. ! ! “In theory, there is no difference between theory and practice. But in practice, there is.” Yogi Berra (undated)! 72
  73. 73. THANKS FOR LISTENING!! 73 The Deluge (1840) by Francis Danby (1793-1861)!

×