3. Your Genome and You
23 chromosomes
20,000 genes
3.1 billion nucleotides
Mycobacterium
tuberculosis
1 chromosome
4,000 genes
4.4 million nucleotides
Tremblaya
princeps
1 chromosome
110 genes
138,931 nucleotides
Daphnia pulex
12 chromosomes
31,000 genes
200 million nucleotides
Paris japonica
?? chromosomes
??? genes
150 billion nucleotides3
4. DNA Encodes the Business of the Cell
Chromosome
Chromosome region
Gene GGATCCTATGGATGCATGCCGCCGTAGTATAAT…
Protein
Protein functions
Copying the genome and the cell
Transport into and out of the cell
Energy production and storage
Cellular defense
etc…
4
5. Three key questions
(1)What genes in an organism’s genome are responsible for its
unique properties? For example:
- Ability to withstand environmental challenges
- Developmental “plan”
- Sources of nutrients
(2) How can we use properties of an organism’s genome as a
“fingerprint” to identify that organism?
(3) What mutations to an organism’s genome (including single base
changes) are responsible for altered properties of that organism?
5
6. Microbes: hot or not?
+ ++ +++ +++++++
Strain 121
MacDonald, NJ and Beiko, RG (2010). Efficient learning of microbial
genotype–phenotype association rules. Bioinformatics 26: 1834-1840. 6
7. Beating the heat
Proteins tend to stop working at temperatures above 37-40° C
Heat shock – “Things are getting uncomfortable here”
Extreme heat shock – “Make it stop make it stop make it stop!!!!”
What does an organism need to get by at higher temperatures?
(1) Specific proteins that help keep everything working
(2) Changes to all proteins that make them more heat tolerant
(3) Various other things
Proteins tend to stop working at temperatures above 37-40° C
Heat shock – “Things are getting uncomfortable here”
Extreme heat shock – “Make it stop make it stop make it stop!!!!”
What does an organism need to get by at higher temperatures?
(1) Specific proteins that help keep everything working
(2) Changes to all proteins that make them more heat tolerant
(3) Various other things
7
8. The “genotype-phenotype association” problem
Genotype: An organism’s DNA sequence, somehow defined
Phenotype: An organism’s physical properties
In this case, “genotype” will refer to the presence of genes that
are similar enough that they likely share the same function
8
10. A suitable approach
Problem: a typical dataset will contain between 50-500 genomes,
and presence / absence data for >10,000 genes
We need an approach that can detect interactions among genes, so
the potential feature space is very large. Searching all 210,000 rule
combinations is obviously not going to happen.
ASSOCIATION RULE MINING (Agrawal et al 1993):
Discover associative rules between items, e.g. {Milk, Eggs} -> {Flour}
Classification Based on Predictive Association Rules (Yin and Han,
2003): iteratively generate rules to “cover” each subset of the data
10
11. 11
F
F, Q
F, Z
A
None above
gain threshold
Rules discovered:
1. F, Q -> POSITIVE
2. F, Z -> POSITIVE
3. A -> POSITIVE
Covered samples get their weight reduced before the next iteration
None above
gain threshold
None above
gain threshold
Classification based on Predictive Association Rules
(CPAR)
12. CPAR results
One example for now: THERMOPHILY – the ability of an organism to
grow at temperatures above 42° C
427 genomes in the dataset: 376 mesophiles (negative set), 51
thermophiles (positive set)
26,290 genes to consider
Use CPAR to learn rules, submit identified genes to SVM for
classification. 10x 5-fold cross-validation
CPAR accuracy: 84.3% (obtained in 10.6 seconds)
Best competitor (NETCAR): 79.3% (obtained in 1250.9 seconds)
12
14. A complication
Organisms are not independent observations!
They share common ancestry
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5
14
15. What to do?
MUTUAL INFORMATION:
CONDITIONAL MUTUAL INFORMATION:
Weight CMI by total MI – CONDITIONAL WEIGHTED MUTUAL INFORMATION (CWMI)
Reweight CPAR rules to reflect MI or CWMI: what patterns emerge?
15
16. What genes are identified?
16
Highlighted boxes: genes identified in “A DNA repair system specific for
thermophilic Archaea and bacteria predicted by genomic context analysis”
(Makarova et al., Nucleic Acids Research, 2002, 30 (2) , 482-496)
Top CWMITop MI
18. Summary
18
Misclassifications (10 replicates)
18
- CPAR is FAST and fairly accurate, but the problem is challenging:
no “magic” set of genes that automatically make you a thermophile
- But we can investigate what pops up in the rules to find out which
genes are most likely associated with heat tolerance
- The hardest organisms to classify are from weird groups, with few
or no close relatives that are also thermophilic
- Different weighting schemes, especially those that consider the
confounding effects of taxonomy, have different strengths and can
identify different candidate genes
19. What’s next?
1919
- Much larger microbial datasets with much broader taxonomic
coverage are now available
- Will give us more precise models of what genes make a
thermophile, pathogen, etc.
- Consider other lines of evidence: variation WITHIN genes in
addition to gene presence/absence
- Apply to emerging pathogen data: classify outbreak isolates
based on antibiotic resistance, virulence and other properties
(SFU, BCCDC, National Microbiology Laboratory)
Jie (Jessie) Ning
20. METAGENOMICS:
Because one genome at a time is too easy
MacDonald NJ, Parks DH, and Beiko, RG (2012). Rapid identification of high-confidence
taxonomic assignments for metagenomic data. Nucleic Acids Research 40: e111.
Parks DH, MacDonald NJ, and Beiko, RG (2011). Classifying short genomic fragments from novel
lineages using composition and homology. BMC Bioinformatics 12: 328.
20
21. The microbial community problem
- Microbes almost never act alone;
samples will typically contain
dozens or hundreds of different
species
- How can we answer the following
questions:
- What microbes are present in
a given sample?
- What functions do they carry
out?
- How do they interact with one
another?
21
23. The species assignment problem
GATAAATCTGG
? ?
??
- UNSUPERVISED (clustering-ish)
and SUPERVISED approaches
- For supervised classification, we
need a set of known genomes
- Two attributes provide key clues:
(i) Genomic composition of k-
mers (aka n-grams)
(ii) Comparison with known
gene sequences
23
24. The species assignment problem
GATAAATCTGG
24
Mystery sequence
Where did I come from?
COMPOSITION
(k-mers)
k-mer frequency
AA 2/10
AC 0/10
AG 0/10
AT 1/10
k-mer frequency
AA 2/10
AC 0/10
AG 0/10
AT 1/10
k-mer frequency
AA 2/10
AC 0/10
AG 0/10
AT 1/10
k-mer frequency
AA 2/10
AC 0/10
AG 0/10
AT 1/10
k-mer frequency
AA 2/10
AC 0/10
AG 0/10
AT 1/10
Genome models
SIMILARITY
GATAAATCTGG
GATAAGTCTGG
GACCAATCTGG
GATAAACTTAG
CAAGGATAAGC
Sequences from
reference genomes
Sequence from
metagenome
25. Metagenomes - the first few years
25
Cost of DNA sequencing
(note log scale)
Study Author, Year # of
nucleotides
Size of each
“read”
Acid mine drainage Tyson et al, 2004 7.62 x 107 737 nt
Obese / Lean twins Turnbaugh et al, 2009 1.83 x 109 341 nt
Human gut
“catalogue”
Qin et al, 2010 5.77 x 1011 75 nt
26. Summary of challenges
26
- Datasets are already huge, and getting bigger and more numerous
- DNA sequences that we need to classify are SHORT: unstable
estimates of composition and similarity
- Our predictions depend on the coverage in our reference database
- We need to combine different lines of evidence into a coherent
prediction scheme
27. Two approaches
27
PhymmBL: Brady and Salzberg,
2010
- Similarity of sequences
assessed through the BLAST
algorithm
- Composition assessed using
interpolated context models
- Predictions are combined
using a formula
RITA: MacDonald, Parks and
Beiko, 2012
- Similarity of sequences
assessed using UBLAST and
BLAST
- Composition assessed using
naïve Bayes approach
- Look for agreement between
predictors; if no agreement,
decide based on best evidence
28. The naïve Bayes approach
28
- Build k-mer profiles for each reference genome
- The probability that a given DNA sequence fragment F originated from
a given genome Gi is:
- (that is, the combined frequencies of all k-mers from F in genome Gi)
- Note that naïve Bayes assumes INDEPENDENCE, which is a bit funny
with overlapping k-mers (But We Did It Anyway)
M
j
iji GwPGFP
1
||
AGGCTTGTCAA
29. Naïve Bayes in action
29
Build fake metagenomes by chopping up real sequenced genomes into
pieces of length 200
Build a reference database that excludes the chopped up genomes AND
Their close relatives (leave-one-out)
How accurate is the classifier, for different values of k?
k
Average proportion
of sequences correctly
classified
31. RITA:
Rapid Identification of Taxonomic Assignments
31
Query DNA
sequence fragment
Run naïve Bayes
classifier
UBLAST filter
(fast, imprecise)
BLAST comparisons
(slower, better)
Is there a BLAST
match?
Is there a strong
naïve Bayes
preference?
Do BLAST and
naïve Bayes
agree?
Is there a strong
BLAST preference?
Group 2 Group 3
Group 1a
Group 1b
Yes!
No!
34. Application to human microbiome
data sets
34
Homology+CompositionComposition
Without HMP genomes:
Clostridium, Bacteroides and Eubacterium, but
lots of low-confidence calls too
With HMP reference genomes:
Add Ruminococcus, Faecalibacterium,
Lachnospiraceae
Good Less Good
Data from Turnbaugh et al., 2010
35. Application to bioremediation metagenome
35Hug et al., 2012
Three sets of microbes, all can clean up
PCEs. Are there differences in the
composition of these sets?
36. Summary
36
- Naïve Bayes is FAST and performs as well as alternative, more
complicated approaches
- The combination of composition and similarity is superior to either
approach in isolation
- The accuracy on short reads is good, but a substantial minority of
reads are misclassified so the question of “who is doing what”
remains somewhat open
37. What’s next?
37
- Apply to emerging metagenomic data sets:
- Bioremediation
- Aging and frailty in mice and humans
- Refine the approach to include both
unsupervised and supervised components
38. Coda #1: mammalian fertility
38
Random mating
CONTROL (105)
Selective breeding
SELECTED (344)
Starting colony
30 years of….
Examine genetic variation at >8000 positions
within the genome.
Are there any genetic differences at one or
more sites that distinguish the populations
and individuals within the populations?
Alex Keddy
Katherine
Rutherford
40. What’s next?
40Jeremy Koenig
- Expand the project: more data, and more types of data!
- Integrating lines of evidence from multiple sources will be a
significant challenge – each yields overlapping / different
predictions
- Map interesting results into the cow genome and test effectiveness
Developer to be
named later
41. Coda #2: data retrieval and GIS
41
20,304 samples
1.7 billion sequences
43. Objectives
43
- Automated classification of data from sources such as the EMP
- Retrieval of data from EMP via Web services under development
(some plugins already completed – come in October for the story)
45. Classifying DNA: Adventures in
Multidisciplinarity
45
Genetics
Evolution
Statistics
Machine
Learning
Throw in the challenges of massive data sets,
data retrieval challenges,
emerging technologies,
and uncertain reliability of some data sets,
And there is a lot of work still to be done!!
Chris Whidden
Donovan Parks
Morgan Langille