Interactive Powerpoint_How to Master effective communication
Comprehensive Exam Slides 11/13/2013
1. Investigate the diversity of extremely
complex metagenomic samples
Qingpeng Zhang
Department of Computer Science and Engineering
Michigan State University
Supervisor: Dr. Titus Brown
2. Outline
● Significance and background
– Metagenomics
– Microbial diversity measurement
● Preliminary results
– A novel method to investigate microbial diversity
based on an efficient k-mer counting approach
● Proposed research
– Prove effectiveness using test data sets
– Tackle extremely large metagenomic data sets
generated from extremely complex microbial
samples
3. The Great Prairie Grand Challenge
● How many different species in a soil sample? What is their abundance distribution? How
different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa
prairie?
● “Grand Challenge” - extremely large data sets from extremely complex microbial community
– Estimated 50 Tbps are needed for an individual gram of soil (Jason Gans,2005)
– In a gram of soil, there are approximately a billion microbial cells, containing an estimated
4 petabase pairs of DNA (Jack A. Gilbert,2013)
– Over a tera bases of sequences from Iowa cultivated and uncultivated
5. species
Individuals
OTUs
16S rRNAs sequences
Unique
k-mers
total k-mers in
WGS data
Diversity
measurement
based on different
unit concepts
Whole genome
sequencing reads
Nature Reviews
Genetics 6, 805-814,
ettc.
97% similarity of
16S sequences
6. Statistics for Diversity Estimation
● rarefaction curve
– Quite incapable of dealing with the scale of diversity
of the microbial world
● extrapolation from curves
● parametric estimators(need relative species
abundance)
● non-parametric estimators(Chao1,etc.)
– Lower bound estimator
– Sensitive to underlying distribution
7. The Goal of this Project
● Using whole genome shotgun metagenomic data set rather than 16S
rRNA
– Measuring the microbial diversity of samples alpha-diversity
– Comparing microbial samples beta-diversity
● A novel method that is:
– Binning-free
– Assembly-free
– Annotation-free
– Reference-free
● Efficient (Memory and Time)
– extremely large shotgun metagenomic data sets (Terabytes, etc.)
– extremely diverse microbial communities (Soil, etc.)
8. species
Individuals
OTUs
16S rRNAs sequences
Unique
k-mers
total k-mers in
WGS data
Diversity
measurement
based on different
unit concepts
Whole genome
sequencing reads
Nature Reviews
Genetics 6, 805-814,
ettc.
97% similarity
of 16S
sequences
9. Preliminary Results
● A novel method to investigate microbial
diversity based on an efficient k-mer counting
approach
– Diversity measurement of one sample
– Comparison of multiple samples
10. An Approach to Count k-mer Efficiently
●
● an approach to count k-mer efficiently
–
• Highly scalable: Constant memory consuming,
independent of k and dataset size
• Probabilistic properties well suited to next
generation sequencing datasets
• With certain counting false positive rate as tradeoff
because of collision
11. What is khmer 's advantage?
● Good performance in
time/memory usage
● Online counting, updating and
retrieving (important for this
project!!)
● With Python API – flexible and
expandable
(Zhang, Pell, Canino-Koning, Howe, & Brown,
2013,submitted)
12. median k-mer frequency to represent the sequencing
coverage of the read
Using median k-mer frequency rather than
average k-mer frequency can decrease the influence
of sequencing error
13. Mapping and k-mer coverage measures correlate for
simulated genome data and a real E. coli data set (5m reads).
(Brown, Howe, Zhang, Pyrkosz, & Brom, 2012)
14. iGS
It there are Y reads with a
sequencing depth of X. In other
word, for each of those Y reads,
there are X-1 other reads that
cover the same DNA segment
in a genome that single read
originates. So we can estimate
that there are Y/X distinct DNA
segments with reads coverage
as X. We term these distinct
DNA segments in species
genome as IGS(informative
genomic segment).
IGS(informative
genomic
segment) can
represent the
novel information
of a genome
15. N =G/(L-k+1)
1000000/(80-22+1)
Borrowing statistical methods from OTU based diversity
analysis, (rarefaction curve, estimators, etc.)
16. Compare the contents of multiple metagenomics
samples
● How different are two samples?
●
–
If sequencing coverage of
a read from sample A in
sample B >0,
the segment in sample A
that read originates exists
in sample B
17. Synthetic datasetsA:(same abundance)
– SampleA: 100 species with 80 common to B
– SampleB: 100 species with 80 common to A
– SampleC: 100 species with 20 common to A/B, and 60 common to D
– SampleD: 100 species with 20 common to A/B, and 60 common to D
●
18. Synthetic datasetB:
– Sample1A:
● species IDs: 1,2,3,4,5,6,7,8,9,10 relative abundance: 20:18:16:4:3:2:2:2:2:2
– Sample1B:
● species IDs: 1,2,3,14,15,16,17,18,19,20 relative abundance: 20:18:16:4:3:2:2:2:2:2
– Sample1C:
● species IDs: 21,22,3,4,5,6,7,8,9,10 relative abundance: 2:2:2:2:2:3:4:16:18:20
– A and B high overlap on individual level, low overlap on species level A and C high overlap on species
level, low overlap on individual level
– B and C low overlap on species level and low overlap on individual level
19. What's Next
● Refi ne the methods
– Errors are still haunting.
– More statistics of IGSs(informative
genomic segment)
● Prove effectiveness using test data
sets
– Simulated data sets based on real
microbial genomes
– MetaHIT, 124 metagenomic
samples from 99 healthy people,
and 25 patients with inflammatory
bowel disease (IBD) syndrome.
Each sample has on average 65 ±
21 million reads.
● Integrate functions into khmer package
20. The Great Prairie Grand Challenge
● How many different species in a soil sample? What is their abundance distribution? How
different are the soil samples from 100-year cultivated Iowa agricultural soil and native Iowa
prairie?
● “Grand Challenge” - extremely large data sets from extremely complex microbial community
– Over a tera bases of sequences from Iowa cultivated and uncultivated
– Should be prepared to face technical challenge when dealing with such large-scale
data sets (Storage, Computing, Resource, HPCC, etc.)
– A preliminary result :The majority of the prairie reads (50%) are present in the corn
with a coverage of > 0
21. Acknowledgement
● Dr. Titus Brown
● Lab members of GED
● Dr. Jason Pell
● Dr. Adina Howe
● Eric McDonald
● Everybody in this room