SlideShare a Scribd company logo
Novel Computational Approaches to
Investigate Microbial Diversity
Qingpeng Zhang
Department of Computer Science and Engineering
Michigan State University
Advisor: Dr. Titus Brown
1
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative genomic segment)
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
2
Background and motivation
Microbial diversity: Sorcerer II Global Ocean
Sampling Expedition
 How many different species in a sea water sample? What is their
abundance distribution? (alpha-diversity) [Hint: ~10,000 different types
of microorganisms]
 Are samples from tropical area and temperate area different?(beta-
diversity) [Hint: Yes, but how different?]
”...functional analysis of the collective genomes of
soil microflora, which we term the metagenome of
the soil.”
- J. Handelsman et al., 1998
Most of the microorganisms can not be
cultured and isolated and sequenced in
way.
In a gram of soil, there are approximately a billion microbial cells,
containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)
+
Improvement of next generation sequencing = data deluge
+
Decreasing cost of sequencing 4
Metagenomics and
“big data”
”...functional analysis of the collective genomes
of soil microflora, which we term the
metagenome of the soil.”
- J. Handelsman et al., 1998
most of the microorganisms can not be
cultured and isolated and sequenced in
traditional way.
In a gram of soil, there are approximately a billion microbial cells,
containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)
+
Improvement of next generation sequencing = data deluge
+
Decreasing cost of sequencing 5
Metagenomics and
“big data”
”...functional analysis of the collective genomes
of soil microflora, which we term the
metagenome of the soil.”
- J. Handelsman et al., 1998
most of the microorganisms can not be
cultured and isolated and sequenced in
traditional way.
In a gram of soil, there are approximately a billion microbial cells,
containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013)
+
Improvement of next generation sequencing = data deluge
+
Decreasing cost of sequencing
http://www.e-janco.com/
6
Metagenomics and
“big data”
7
Bacterial genome size: 139,000 bps to 13,000,000 bps
Human genome size: 3,000,000,000 bps
Metagenomics and
“big data”
Tools friendly to big data are required.
Sorcerer II Global Ocean Sampling Expedition
 How many different species in a sea water sample? What is their
abundance distribution? (alpha-diversity) [Hint: ~10,000 different types
of microorganisms, 10 million viruses, one million bacteria in a drop.]
 Are samples from tropical area and temperate area different?(beta-
diversity) [Hint: Yes, but how different?]
Genome and Book
Sample and Library
Which library has more titles?
Which library has more titles?
With only shredded pieces of pages
with (partial) sentences on it to read.
How similar are the libraries by their collection?
With only shredded pieces of pages
with (partial) sentences on it to read.
Sequencing
AssemblyReads Contigs
1000+ bp
Abundance
Mapping
3x
1x
1x
2x
Species A
Species B
Species C
Species D
BinningSpecies
Sample1
Sample2
Sample3
Sample4
Sample5
Sample1 Sample2 Sample3 Sample4 Sample5
Species A 3 3 4 4 0
Species B 1 3 0 2 2
Species C 1 2 2 2 3
Species D 2 1 2 1 1
A typical pipeline of metagenome diversity analysis
Loosely based on slide by Mads Albertsen
100-150 bp
14
Sample
1
Sample
2
Sample
3
Sample
4
Sample
5
Sample
6
SpeciesA 3 3 4 4 0 0
SpeciesB 1 3 0 2 2 2
SpeciesC 1 2 2 2 3 3
SpeciesD 2 1 2 1 1 1
SpeciesE 4 1 3 0 2 5
SpeciesF 5 1 2 2 2 5
SpeciesG 2 2 1 2 1 3
Alpha diversity:
“How many different species in a seawater sample?”
Beta diversity:
“How different are the seawater samples?”
Sample1 Sample1 Sample1 Sample1 Sample1 Sample1
Sample1 0.0 0.25 0.5 0.5 0.75 1.0
Sample2 0.25 0.0 0.25 0.25 0.75 1.0
Sample3 0.5 0.25 0.0 0.25 0.75 1.0
Sample4 0.5 0.25 0.25 0.0 0.75 1.0
Sample5 0.75 0.75 0.75 0.75 0.0 0.25
Sample6 1.0 1.0 1.0 1.0 0.25 0.0
15
Tools: QIIME, mothur, etc.
We need this!
Sequencing
AssemblyReads Contigs
1000+ bp
Abundance
Mapping
3x
1x
1x
2x
Species A
Species B
Species C
Species D
BinningSpecies
Sample1
Challenges:
• Assembly? - difficult (eg. Only small portions of reads can be
assembled)
• Binning? – difficult (eg. may need reference genomes, computationally
expensive)
• Mapping? – difficult (eg. accuracy, computationally expensive)
• 4 petabase pairs of DNA in a gram of soil(Jack A. Gilbert,2013)
A typical pipeline of metagenome diversity analysis
100-150 bp
Samples by species table
16
Sequencing
AssemblyReads Contigs
1000+ bp
Abundance
Mapping
3x
1x
1x
2x
Species A
Species B
Species C
Species D
BinningSpecies
Sample1
Species-by-sample table
A typical pipeline of metagenome diversity analysis
100-150 bp
17
Reads -> k-mers
18
K-mer : DNA segment with length of k.
>24288227
TTTGCCTGTTGAAAACCTGTAAACGGGATGTTTCCGGGGTTTTCCGCCCCGGTAGGGTTGAGAC
GTGCCCCGCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAGAGCGATCCGCAACACACCCCG
GAGCCTCTACCCCCCCCTCACGC
TTTGCCTGTT
TTGCCTGTTG
TTTCCGGGGT
TTCCGGGGTT
TTCCGGGGTT
ATCCGCGGTT
TTCCGGGGTT
TTCCGGGGTT
TTCCGGCCGTT
GGTTTTTCCG
Reads – sample A K-mers – sample A
Reads – sample B K-mers – sample B
19
Counting (unique) k-mers to evaluate diversity
Reads – sample A K-mers – sample A
Reads – sample B K-mers – sample B
20
Counting (unique) words to evaluate diversity
• handle large number of k-mers. (simple hash table will not
work)
• retrieve count of k-mer randomly (in-memory).
– To check the occurrence of a k-mer in multiple samples to find shared
k-mers
21
Challenges in counting k-mers
Soil metagenome
sample
Number of reads Number of k-mers
k=29
Number of
unique k-mers
IOHH.2301.7.1859 273,021,397 32,762,567,744 22,333,571,795
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative genomic segment)
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
22
An efficient k-mer counting approach
o The Count-min Sketch consists of one or more hash
tables of different size
o Each entry in the hash tables is a counter representing
the number of k-mers that hash to that location
o To retrieve the count for a given k-mer we select the
minimum count across all of the hash entries.
 Highly scalable: Constant memory consuming,
independent of k and dataset size
 Probabilistic properties well suited to next generation
sequencing datasets
 Memory usage is fixed and low, with certain
counting false positive rate as tradeoff because
of collision
 The one-sided counting error is low and
predictable.
Khmer - An efficient k-mer counting approach
23
• My contributions:
• algorithm design/analysis, exploring the mathematics behind, the choice of optimal parameters
• contributing codes, including unique k-mers counting, overlap k-mer counting, optimal
parameter choice, others related to my specific research project.
• benchmarking, testing
• exploration of applications like error trimming, filter low abundance reads, digital normalization,
etc. suggestion on features
• work on the khmer manuscript 24
An efficient k-mer counting approach
>24288227
TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCG
GGGTTTTCCGCCCCGGTAGGGTTGAGACGTGCCCC
GCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAG
AGCGATCCGCAACACACCCCGGAGCCTCTACCCCCC
CCTCACGC
c
GAAAACCTGA
AAAACCTGAA
AAACCTGAAA 25
Sequencing errors cause serious problem
In high coverage reads data set, most
of the unique k-mers are erroneous
caused by sequencing errors.
Counting k-mer to measure
diversity directly?
>24288227
TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCG
GGGTTTTCCGCCCCGGTAGGGTTGAGACGTGCCCC
GCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAG
AGCGATCCGCAACACACCCCGGAGCCTCTACCCCCC
CCTCACGC
c
GAAAACCTGA
AAAACCTGAA
AAACCTGAAA 26
Sequencing errors cause serious problem
In high coverage reads data set, most
of the unique k-mers are erroneous
caused by sequencing errors.
Counting k-mer to measure
diversity directly?
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
27
Concept of IGS (informative genomic segment)
Sequencing
AssemblyReads Contigs
1000+ bp
Abundance
Mapping
3x
1x
1x
2x
Species A
Species B
Species C
Species D
BinningSpecies
Sample1
Species-by-sample table
A typical pipeline of metagenome diversity analysis
100-150 bp
28
• IGS (informative genomic segment)
– can represent the novel information of a genome
– can be used as a new unit to do diversity analysis to replace species
– Assembly-free
– Reference-free
– Efficient and scalable
Replace
samples-by-species table
with
samples-by-IGSs table
IGSs
(Informative Genomic Segment)
Reads
Genome
29
1. We can use median k-mer abundance to estimate the
sequencing coverage of a read without a reference assembly.
2. We can retrieve the coverage of a read across different
samples.
3. If a read has a coverage as C, there will be approximately
other C-1 reads with the same coverage covering the same
DNA region.
4. We can estimate the size of covered genomic region by the
number of reads and the corresponding coverage.
Four tricks!
From reads to IGSs
30
Reads – sample A K-mers – sample A
The abundance of the k-mers in a read tends to be
similar, which is close to the coverage of the read.
Coverage of a read and how to get it without a
reference assembly
31
Coverage of read: the sequencing depth of the DNA region that read
originates from
>24288227
TTTGCCTGTTGAAAACCTGTAAACGGGATGTTTCCGGGGTTTTCCGCCCCGGTAGGGTTGAGAC
GTGCCCCGCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAGAGCGATCCGCAACACACCCCG
GAGCCTCTACCCCCCCCTCACGC
GAAAACCTGT
AAAACCTGTA
AAACCTGTAA
TTTGCCTGTT
TTGCCTGTTG
TTTCCGGGGT
TTCCGGGGTT
TTCCGGGGTT
ATCCGCGGTT
TTCCGGGGTT
TTCCGGGGTT
TTCCGGCCGTT
GGTTTTTCCG
4
5 4
5
4
5
4
4
3
4
5
4
5
4
Reads – sample A K-mers – sample A
32
K-mer abundance
>24288227
TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCGGGGTTTTCCGCCCCGGTAGGGTTGAGA
CGTGCCCCGCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAGAGCGATCCGCAACACACCCC
GGAGCCTCTACCCCCCCCTCACGC
GAAAACCTGA
AAAACCTGAA
AAACCTGAAA
TTTGCCTGTT
TTGCCTGTTG
TTTCCGGGGT
TTCCGGGGTT
TTCCGGGGTT
ATCCGCGGTT
TTCCGGGGTT
TTCCGGGGTT
TTCCGGCCGTT
GGTTTTTCCG
4
5 4
5
4
5
4
4
3
4
5
1
1
1
Reads – sample A K-mers – sample A
33
xx
K-mer abundance
>24288227
TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCGGGGTTTTCCGCCCCGGTAGGGTTGAGA
CGTGCCCCGCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAGAGCGATCCGCAACACACCCC
GGAGCCTCTACCCCCCCCTCACGC
GAAAACCTGA
AAAACCTGAA
AAACCTGAAA
TTTGCCTGTT
TTGCCTGTTG
TTTCCGGGGT
TTCCGGGGTT
TTCCGGGGTT
ATCCGCGGTT
TTCCGGGGTT
TTCCGGGGTT
TTCCGGCCGTT
GGTTTTTCCG
4
5 4
5
4
5
4
4
3
4
5
1
1
1
median k-mer abundance to represent
the sequencing coverage of the read
34
35
Trick 1: We can use median k-mer
abundance to estimate the sequencing
coverage of a read without a reference
assembly.
median k-mer abundance to represent
the sequencing coverage of the read
Reads – sample A K-mers – sample B
36
Trick 2: We can estimate the sequencing
coverage of a read across different
samples
Trick 1: We can use median k-mer
abundance to estimate the sequencing
coverage of a read without a reference
assembly.
Coverage = 4
Sample A
Sample B
Sample C
Read Coverage Profile:
Coverage = 6
Coverage = 2 37
3:2:2
3:2:2 3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
3:2:2
Sample A
Sample B
Sample C
38
3:2:2
3:2:2 3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
3:2:2
Sample A
Sample B
Sample C
“Contigs with similar coverage profiles are
likely to have originated from the same
microbial population” (Imelfort et al. 2014)
Reads with similar coverage profiles are
likely to have originated from the same
species.
39
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:0:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
Genomic Content Sample A
Genomic Content Sample B
Genomic Content Sample C
3:2:2
bin 1 bin 2 bin 3
Do the reads in each bin cover the same size of genomic region?
6 reads with coverage of 3 in the sample where they are from
2 reads with coverage of 2 in the sample where they are from
VS
41
What is the size of genomic region covered by the reads in each bin?
What we know:
C: coverage of the read in the sample in the bin
N: number of reads in the bin
Reads – sample A K-mers – sample A
42
Trick 3: If a read has a coverage as C,
there will be approximately other C-1
reads with the same coverage covering
the same DNA region.
43
Reads – sample A
Trick 4: We can estimate the size of
covered genomic region by the number
of reads(N) and the corresponding
coverage(C).
44
Reads – sample A
N ~= 7, C ~= 7
N ~= 30, C ~= 10
Trick 4: We can estimate the size of
covered genomic region by the number
of reads(N) and the corresponding
coverage(C).
Reads – sample A
45
N ~= 7, C ~= 7
N ~= 30, C ~= 10
Size of covered
DNA region
N/C = 30/10 = 3 N/C = 7/7= 1
Trick 4: We can estimate the size of
covered genomic region by the number
of reads(N) and the corresponding
coverage(C).
Reads – sample A
46
Trick 4: We can estimate the size of
covered genomic region by the number
of reads(N) and the corresponding
coverage(C).
N ~= 7, C ~= 7
N ~= 30, C ~= 10
Size of covered
DNA region
N/C = 30/10 = 3 N/C = 7/7= 1
Unit of size of covered DNA region:
IGS - IGS(informative genomic segment)
47
Do the reads in each bin cover the same size of genomic region?
6 reads with coverage of 3 in the sample where they are from
2 reads with coverage of 2 in the sample where they are from
VS
6/3 = 2 4/2 = 2 2/2 = 1
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2
3:2:2 2:0:3
2:0:32:0:3
2:0:3
2:0:32:0:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
2:4:3
3:2:2
SampleA SampleB SampleC
IGS1 3 2 2
IGS2 3 2 2
IGS3 2 0 3
IGS4 2 0 3
IGS5 2 4 3
48
Sequencing
AssemblyReads Contigs
1000+ bp
Abundance
Mapping
3x
1x
1x
2x
Species A
Species B
Species C
Species D
BinningSpecies
Sample1
Challenges:
• Assembly? - difficult (eg. Only small portions of reads can be
assembled)
• Binning? – difficult (eg. may need reference genomes, computationally
expensive)
• Mapping? – difficult (eg. accuracy, computationally expensive)
• 4 petabase pairs of DNA in a gram of soil(Jack A. Gilbert,2013)
Species-by-sample table
A typical pipeline of metagenome diversity analysis
100-150 bp
49
IGS(informative genomic segment) can
represent the novel information of a genome
• The more IGSs there are in a sample, the higher diversity , higher richness the sample has.
• The more IGSs two samples share, the more similar the two samples are.
IGS (informative genomic segment):
genomic segments that do not share genomic content with the same length of read
50
(meta)Genome size = number of IGSs x IGS length (read length)
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative genomic segment)
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
51
Testing IGS method on simulated data sets
Simulated data sets with different species composition
52
1. Estimated (meta)genome size matches real size perfectly.
2. Evenness metric can represent species distribution correctly
No sequencing error is introduced in this simulated data
53
Samples-by-IGSs tableSamples-by-species table
Ground truth
Using IGS
method
Mantel Correlation:
0.9714
Calculated dissimilarity matrix matches the matrix
calculated from species composition.
54
Ordination (PCoA) Clustering
AAAB
ABCD
AABC
ABCE
FGHI
AFGH
With an accurate dissimilarity matrix, it will be straightforward to perform PCoA
or clustering analysis to demonstrate the relationship between samples.
Influence of error rate and coverage to
alpha diversity analysis
Metagenome size is more accurate with increasing coverage
and lower error rate
55
56
Beta diversity is less prone to error rate and low coverage
Influence of error rate and coverage to
beta diversity analysis
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative genomic segment)
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
57
Testing IGS method on real metagenome data
Sorcerer II Global Ocean Sampling Expedition
It has been calculated that microbes in ocean
account for about half of the biomass on Earth.
In a drop (one millilitre) of seawater, one can
find 10 million viruses, one million bacteria
Tropical - Galapogas
Tropical – Open Ocean
Sargasso
Temperate - North
Temperate - South
Estuary
Samples can be separated by
locations using IGS-based method –
as good as the result in original
research based on alignment-based
method or better.
TemperateTropicalandSargasso
Estuary
Temperate - North
Temperate -South
Tropical - Open Ocean
Tropical - Galapogas
Sargasso
Alpha diversity analysis shows
samples from tropical area have
higher richness.
Temperate
Tropical and Sargasso
ARMO (Amazon Rain Forest Microbial Observatory) project
Soil metagenome
sample
Number of reads Number of k-mers
k=29
Number of
unique k-mers
IOHH.2301.7.1859 273,021,397 32,762,567,744 22,333,571,795
21 samples, 1T size of data, 5 billion reads.
62
Blue: Prairie
Red: Forest
1. Conversion (forest -> pasture) affects
composition
2. Prairie samples more similar in space
3. Subsamples with 8M reads are used.
0.0080
ITXB_1
IOHH_1
IOAH_1
ITXB_2
IOAH_2
INPP_2
INPO_1
ITXC
INPP_1
INPN_2
IOHH_2
INPN_1
INPI_2
INPO_2
IOAG_1
IOAG_2
INPI_1
63
Samples can be clustered by different treatments and distances using IGS method.
100m
0.01m
10m
0.1m
0.1m
0.01m
1m
10m
Blue: Prairie
Red: Forest
64
Using subsamples can get decent
result of beta diversity.
Blue: Prairie
Red: Forest
Mantel correlation to
similarity matrix calculated
from subsample with 8M
reads
Alpha diversity analysis shows
samples from prairie area have
higher richness than forest area.
Blue: Prairie
Red: Forest
Outline
• Background and motivation
• An efficient k-mer counting approach
• Novel method to investigate microbial
diversity
– Concept of IGS (informative genomic segment)
– Testing IGS method on simulated data sets
– Testing IGS method on real metagenome data
• Summary
66
Summary
67
• A whole framework with potentials:
• Take all information into account, rare species
• Alpha diversity ( richness, evenness) (Chao1, ACE
estimators, etc.)
• Beta diversity ( abundance-based, incidence-based)
• Integrate with existing packages (khmer, mothur, QIIME,
scikit-bio,etc.)
• Scalable, efficient:
• Based on efficient k-mer counting
• Using subsamples may be enough for specific task
Summary
• Extend applications beyond diversity analysis
– Extracts interesting reads by coverage profile
– Reads binning/classification based on reads coverage
profile
– Genome size estimation/sequencing effort evaluation
• Refine pipeline, optimize performance
– Adjust IGS resolution
– Improve methods to handle large number of IGSs
more efficiently
– Iterative beta diversity analysis
68
What’s next
Publications
• Using the concept of informative genomic segment to investigate
microbial diversity of metagenomic sample (in preparation)
• [stream Crossing the streams: a framework for streaming analysis of
short DNA sequencing reads. Qingpeng Zhang, Sharon Awad, C. Titus Brown.
(submitted) https://peerj.com/preprints/890/
• [khmer] These are not the k-mers you are looking for: efficient online k-mer
counting using a probabilistic data structure. Qingpeng Zhang, Jason Pell,
Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown. PLOS ONE, July 25,
2014, DOI: 10.1371/journal.pone.0101271
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271
• [diginorm] A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data. C. Titus Brown, Adina Howe,
Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom. Submitted.
http://arxiv.org/abs/1203.4802
• [Osedax] Genomic versatility and functional variation between two
dominant heterotrophic symbionts of deep-sea Osedax worms. Shana K Goffredi,
Hana Yi, Qingpeng Zhang, Jane E Klann,Isabelle A Struve, Robert C Vrijenhoek and
C Titus Brown. The ISME Journal doi:10.1038/ismej.2013.201
IGS method
seaworm symbiont
digital normalization
khmer
error analysis
69
Dr. Titus Brown
Lab members of GED
Dr. Sherine Awad
Michael Crusoe
Jiarong Guo
Luiz Irber
Camille Scott
Collaborators
Dr. James Tiedje
Dr. Joan Ross
Dr. Shana K Goffredi
Yiseul KIm
Acknowledgement
Committee members
Jaimes Cole
Richard Enbody
Yanni Sun
Eric Torng
Former members of GED
Dr. Adina Howe
Eric McDonald
Dr. Jason Pell
Dr. Likit Preeyanon
Dr. Elijah Lowe
khmer developers

More Related Content

What's hot

2015. Petr Smykal. Study domestication and to broaden genetic diversity of w...
2015. Petr Smykal.  Study domestication and to broaden genetic diversity of w...2015. Petr Smykal.  Study domestication and to broaden genetic diversity of w...
2015. Petr Smykal. Study domestication and to broaden genetic diversity of w...
FOODCROPS
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
Surender Rawat
 
Centre for Genetic Resources The Netherlands
Centre for Genetic Resources  The NetherlandsCentre for Genetic Resources  The Netherlands
Centre for Genetic Resources The Netherlands
CIAT
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Mick Watson
 
Techniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-goodTechniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-good
rcolatru
 
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Pat (JS) Heslop-Harrison
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 
Accelerating crop genetic gains with genomic selection
Accelerating crop genetic gains with genomic selectionAccelerating crop genetic gains with genomic selection
Accelerating crop genetic gains with genomic selection
ViolinaBharali
 
Genotype imputation study in Gir dairy cattle of Gujarat
Genotype imputation study in Gir dairy cattle of GujaratGenotype imputation study in Gir dairy cattle of Gujarat
Genotype imputation study in Gir dairy cattle of Gujarat
Superior Animal Genetics (SAG)
 
Testing for Food Authenticity
Testing for Food AuthenticityTesting for Food Authenticity
Testing for Food Authenticity
Richard Stansfield
 
Genomic Selection in dairy cattle breeding -An overview
Genomic Selection in dairy cattle breeding -An overviewGenomic Selection in dairy cattle breeding -An overview
Genomic Selection in dairy cattle breeding -An overview
Superior Animal Genetics (SAG)
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
Dan Gaston
 
Case studies of HTS / NGS applications
Case studies of HTS / NGS applicationsCase studies of HTS / NGS applications
Case studies of HTS / NGS applications
rjorton
 
'Novel technologies to study the resistome'
'Novel technologies to study the resistome''Novel technologies to study the resistome'
'Novel technologies to study the resistome'
Willem van Schaik
 
Mini core collection - an international public good
Mini core collection - an international public goodMini core collection - an international public good
Mini core collection - an international public good
ICRISAT
 
The science of genomics and livestock genetic improvement
The science of genomics and livestock genetic improvementThe science of genomics and livestock genetic improvement
The science of genomics and livestock genetic improvement
ILRI
 
GRM 2011: Approaches, resources and tools for rice gene discovery and breeding
GRM 2011: Approaches, resources and tools for rice gene discovery and breedingGRM 2011: Approaches, resources and tools for rice gene discovery and breeding
GRM 2011: Approaches, resources and tools for rice gene discovery and breeding
CGIAR Generation Challenge Programme
 
MAGIC population and its application in crop improvement
MAGIC population and its application in crop improvementMAGIC population and its application in crop improvement
MAGIC population and its application in crop improvement
SanghaviBoddu
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
FOODCROPS
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
cdgenomics525
 

What's hot (20)

2015. Petr Smykal. Study domestication and to broaden genetic diversity of w...
2015. Petr Smykal.  Study domestication and to broaden genetic diversity of w...2015. Petr Smykal.  Study domestication and to broaden genetic diversity of w...
2015. Petr Smykal. Study domestication and to broaden genetic diversity of w...
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Centre for Genetic Resources The Netherlands
Centre for Genetic Resources  The NetherlandsCentre for Genetic Resources  The Netherlands
Centre for Genetic Resources The Netherlands
 
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequenc...
 
Techniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-goodTechniques of-biotechnology-mcclean-good
Techniques of-biotechnology-mcclean-good
 
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
Banana, Ensete and Boesenbergia Genomics - Schwarzacher, Heslop-Harrison, Har...
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
Accelerating crop genetic gains with genomic selection
Accelerating crop genetic gains with genomic selectionAccelerating crop genetic gains with genomic selection
Accelerating crop genetic gains with genomic selection
 
Genotype imputation study in Gir dairy cattle of Gujarat
Genotype imputation study in Gir dairy cattle of GujaratGenotype imputation study in Gir dairy cattle of Gujarat
Genotype imputation study in Gir dairy cattle of Gujarat
 
Testing for Food Authenticity
Testing for Food AuthenticityTesting for Food Authenticity
Testing for Food Authenticity
 
Genomic Selection in dairy cattle breeding -An overview
Genomic Selection in dairy cattle breeding -An overviewGenomic Selection in dairy cattle breeding -An overview
Genomic Selection in dairy cattle breeding -An overview
 
Human genetics evolutionary genetics
Human genetics   evolutionary geneticsHuman genetics   evolutionary genetics
Human genetics evolutionary genetics
 
Case studies of HTS / NGS applications
Case studies of HTS / NGS applicationsCase studies of HTS / NGS applications
Case studies of HTS / NGS applications
 
'Novel technologies to study the resistome'
'Novel technologies to study the resistome''Novel technologies to study the resistome'
'Novel technologies to study the resistome'
 
Mini core collection - an international public good
Mini core collection - an international public goodMini core collection - an international public good
Mini core collection - an international public good
 
The science of genomics and livestock genetic improvement
The science of genomics and livestock genetic improvementThe science of genomics and livestock genetic improvement
The science of genomics and livestock genetic improvement
 
GRM 2011: Approaches, resources and tools for rice gene discovery and breeding
GRM 2011: Approaches, resources and tools for rice gene discovery and breedingGRM 2011: Approaches, resources and tools for rice gene discovery and breeding
GRM 2011: Approaches, resources and tools for rice gene discovery and breeding
 
MAGIC population and its application in crop improvement
MAGIC population and its application in crop improvementMAGIC population and its application in crop improvement
MAGIC population and its application in crop improvement
 
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
2015. Jason Wallace. Applying high throughput genomics to crops for the devel...
 
Metagenomics sequencing
Metagenomics sequencingMetagenomics sequencing
Metagenomics sequencing
 

Viewers also liked

Talk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meetingTalk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meeting
Jonathan Eisen
 
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
Renzo Kottmann
 
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
Renzo Kottmann
 
Sophie F. summer Poster Final
Sophie F. summer Poster FinalSophie F. summer Poster Final
Sophie F. summer Poster Final
Sophie Friedheim
 
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
Mads Albertsen
 
16S classifier
16S classifier16S classifier
16S classifier
Ashok Sharma
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis
Abdulrahman Muhammad
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overview
Paolo Dametto
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
Francisco Rodriguez-Valera
 
HOME MARKETING STRATEGY
HOME MARKETING STRATEGYHOME MARKETING STRATEGY
HOME MARKETING STRATEGY
RDerstler
 
Follett destiny
Follett destinyFollett destiny
Follett destiny
llgarza
 
Cum să găteşti friptura perfectă
Cum să găteşti friptura perfectăCum să găteşti friptura perfectă
Cum să găteşti friptura perfectă
serios31
 
Cloudwords Perspectives - A book for the global marketing professional
Cloudwords Perspectives - A book for the global marketing professionalCloudwords Perspectives - A book for the global marketing professional
Cloudwords Perspectives - A book for the global marketing professional
Michael Meinhardt
 
Brand Newsrooms - Real-time Marketing Strategy, Operations and Execution
Brand Newsrooms - Real-time Marketing Strategy, Operations and ExecutionBrand Newsrooms - Real-time Marketing Strategy, Operations and Execution
Brand Newsrooms - Real-time Marketing Strategy, Operations and Execution
Insight Summit Series
 
Genre survey results
Genre survey resultsGenre survey results
Genre survey results
HollieRackley
 
Itg investor presentation_1mar16 web
Itg investor presentation_1mar16 webItg investor presentation_1mar16 web
Itg investor presentation_1mar16 web
Investment_Tech_Group
 
Freelance survey 2015 (voor Interim Managers in Finance)
Freelance survey 2015 (voor Interim Managers in Finance)Freelance survey 2015 (voor Interim Managers in Finance)
Freelance survey 2015 (voor Interim Managers in Finance)
TriFinance Nederland
 
LavaCon Portland 2013 - Cloudwords
LavaCon Portland 2013 - CloudwordsLavaCon Portland 2013 - Cloudwords
LavaCon Portland 2013 - Cloudwords
Michael Meinhardt
 

Viewers also liked (20)

Talk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meetingTalk by J. Eisen for NZ Computational Genomics meeting
Talk by J. Eisen for NZ Computational Genomics meeting
 
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
Managing environmental- molecular- and associated meta-data: The Micro B3 Inf...
 
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
The Ocean Sampling Day's Metagenome Analysis: Standards, Pipelines and First ...
 
Sophie F. summer Poster Final
Sophie F. summer Poster FinalSophie F. summer Poster Final
Sophie F. summer Poster Final
 
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
[2013.12.02] Mads Albertsen: Extracting Genomes from Metagenomes
 
16S classifier
16S classifier16S classifier
16S classifier
 
16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis16S Ribosomal DNA Sequence Analysis
16S Ribosomal DNA Sequence Analysis
 
New Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overviewNew Generation Sequencing Technologies: an overview
New Generation Sequencing Technologies: an overview
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
HOME MARKETING STRATEGY
HOME MARKETING STRATEGYHOME MARKETING STRATEGY
HOME MARKETING STRATEGY
 
Follett destiny
Follett destinyFollett destiny
Follett destiny
 
Cum să găteşti friptura perfectă
Cum să găteşti friptura perfectăCum să găteşti friptura perfectă
Cum să găteşti friptura perfectă
 
Cloudwords Perspectives - A book for the global marketing professional
Cloudwords Perspectives - A book for the global marketing professionalCloudwords Perspectives - A book for the global marketing professional
Cloudwords Perspectives - A book for the global marketing professional
 
Brand Newsrooms - Real-time Marketing Strategy, Operations and Execution
Brand Newsrooms - Real-time Marketing Strategy, Operations and ExecutionBrand Newsrooms - Real-time Marketing Strategy, Operations and Execution
Brand Newsrooms - Real-time Marketing Strategy, Operations and Execution
 
Genre survey results
Genre survey resultsGenre survey results
Genre survey results
 
Tuuli Toivonen, Opetushallitus 24.1.2013
Tuuli Toivonen, Opetushallitus 24.1.2013 Tuuli Toivonen, Opetushallitus 24.1.2013
Tuuli Toivonen, Opetushallitus 24.1.2013
 
Itg investor presentation_1mar16 web
Itg investor presentation_1mar16 webItg investor presentation_1mar16 web
Itg investor presentation_1mar16 web
 
Freelance survey 2015 (voor Interim Managers in Finance)
Freelance survey 2015 (voor Interim Managers in Finance)Freelance survey 2015 (voor Interim Managers in Finance)
Freelance survey 2015 (voor Interim Managers in Finance)
 
Alfavit
AlfavitAlfavit
Alfavit
 
LavaCon Portland 2013 - Cloudwords
LavaCon Portland 2013 - CloudwordsLavaCon Portland 2013 - Cloudwords
LavaCon Portland 2013 - Cloudwords
 

Similar to Novel Computational Approaches to Investigate Microbial Diversity

Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013
Qingpeng "Q.P." Zhang
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
c.titus.brown
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
c.titus.brown
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
c.titus.brown
 
Target Inducing Local Lesions In Genome (Tilling)
Target Inducing Local Lesions In Genome (Tilling)Target Inducing Local Lesions In Genome (Tilling)
Target Inducing Local Lesions In Genome (Tilling)
Ankit R. Chaudhary
 
genomics-180323095216.pptx
genomics-180323095216.pptxgenomics-180323095216.pptx
genomics-180323095216.pptx
Bharatlalmeena6
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
c.titus.brown
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
VHIR Vall d’Hebron Institut de Recerca
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
c.titus.brown
 
phd-defense41
phd-defense41phd-defense41
phd-defense41
John McLaughlin
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
c.titus.brown
 
Machine Learning
Machine LearningMachine Learning
Genomics and Plant Genomics
Genomics and Plant GenomicsGenomics and Plant Genomics
Genomics and Plant Genomics
Harshvardhan Gaikwad
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101
Ino de Bruijn
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
c.titus.brown
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
Genome Reference Consortium
 
Bes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandezBes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandez
AdApta research team, Universidad Rey Juan Carlos
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro Suite
QIAGEN
 
DNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity CoservationDNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity Coservation
vishnugm
 
201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt
nimrah farooq
 

Similar to Novel Computational Approaches to Investigate Microbial Diversity (20)

Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013Comprehensive Exam Slides 11/13/2013
Comprehensive Exam Slides 11/13/2013
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 sage-talk
2014 sage-talk2014 sage-talk
2014 sage-talk
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Target Inducing Local Lesions In Genome (Tilling)
Target Inducing Local Lesions In Genome (Tilling)Target Inducing Local Lesions In Genome (Tilling)
Target Inducing Local Lesions In Genome (Tilling)
 
genomics-180323095216.pptx
genomics-180323095216.pptxgenomics-180323095216.pptx
genomics-180323095216.pptx
 
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
2014 talk at NYU CUSP: "Biology Caught the Bus: Now what? Sequencing, Big Dat...
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
phd-defense41
phd-defense41phd-defense41
phd-defense41
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Genomics and Plant Genomics
Genomics and Plant GenomicsGenomics and Plant Genomics
Genomics and Plant Genomics
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101
 
2013 stamps-intro-assembly
2013 stamps-intro-assembly2013 stamps-intro-assembly
2013 stamps-intro-assembly
 
Theory and practice of graphical population analysis
Theory and practice of graphical population analysisTheory and practice of graphical population analysis
Theory and practice of graphical population analysis
 
Bes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandezBes meeting spain 2015_alfredo garcia fernandez
Bes meeting spain 2015_alfredo garcia fernandez
 
Microbiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro SuiteMicrobiome Profiling with the Microbial Genomics Pro Suite
Microbiome Profiling with the Microbial Genomics Pro Suite
 
DNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity CoservationDNA barcoding and Insect Diversity Coservation
DNA barcoding and Insect Diversity Coservation
 
201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt201007131ghfjjkklllllllll14012254-152438.ppt
201007131ghfjjkklllllllll14012254-152438.ppt
 

More from Qingpeng "Q.P." Zhang

VenmoPlus
VenmoPlusVenmoPlus
Qingpeng zhang 0713
Qingpeng zhang 0713Qingpeng zhang 0713
Qingpeng zhang 0713
Qingpeng "Q.P." Zhang
 
Qingpeng zhang 0711
Qingpeng zhang 0711Qingpeng zhang 0711
Qingpeng zhang 0711
Qingpeng "Q.P." Zhang
 
VenmoPlus0708
VenmoPlus0708VenmoPlus0708
VenmoPlus0708
Qingpeng "Q.P." Zhang
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
Qingpeng "Q.P." Zhang
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
Qingpeng "Q.P." Zhang
 
Qingpeng zhang week5
Qingpeng zhang week5Qingpeng zhang week5
Qingpeng zhang week5
Qingpeng "Q.P." Zhang
 
Introducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 versionIntroducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 version
Qingpeng "Q.P." Zhang
 
committee_meeting_1031
committee_meeting_1031committee_meeting_1031
committee_meeting_1031
Qingpeng "Q.P." Zhang
 

More from Qingpeng "Q.P." Zhang (9)

VenmoPlus
VenmoPlusVenmoPlus
VenmoPlus
 
Qingpeng zhang 0713
Qingpeng zhang 0713Qingpeng zhang 0713
Qingpeng zhang 0713
 
Qingpeng zhang 0711
Qingpeng zhang 0711Qingpeng zhang 0711
Qingpeng zhang 0711
 
VenmoPlus0708
VenmoPlus0708VenmoPlus0708
VenmoPlus0708
 
VenmoPlus demo week6
VenmoPlus demo week6VenmoPlus demo week6
VenmoPlus demo week6
 
0629venmoplus
0629venmoplus0629venmoplus
0629venmoplus
 
Qingpeng zhang week5
Qingpeng zhang week5Qingpeng zhang week5
Qingpeng zhang week5
 
Introducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 versionIntroducing VenmoPlus.com 6/27 version
Introducing VenmoPlus.com 6/27 version
 
committee_meeting_1031
committee_meeting_1031committee_meeting_1031
committee_meeting_1031
 

Recently uploaded

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Novel Computational Approaches to Investigate Microbial Diversity

  • 1. Novel Computational Approaches to Investigate Microbial Diversity Qingpeng Zhang Department of Computer Science and Engineering Michigan State University Advisor: Dr. Titus Brown 1
  • 2. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative genomic segment) – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 2 Background and motivation
  • 3. Microbial diversity: Sorcerer II Global Ocean Sampling Expedition  How many different species in a sea water sample? What is their abundance distribution? (alpha-diversity) [Hint: ~10,000 different types of microorganisms]  Are samples from tropical area and temperate area different?(beta- diversity) [Hint: Yes, but how different?]
  • 4. ”...functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.” - J. Handelsman et al., 1998 Most of the microorganisms can not be cultured and isolated and sequenced in way. In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013) + Improvement of next generation sequencing = data deluge + Decreasing cost of sequencing 4 Metagenomics and “big data”
  • 5. ”...functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.” - J. Handelsman et al., 1998 most of the microorganisms can not be cultured and isolated and sequenced in traditional way. In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013) + Improvement of next generation sequencing = data deluge + Decreasing cost of sequencing 5 Metagenomics and “big data”
  • 6. ”...functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.” - J. Handelsman et al., 1998 most of the microorganisms can not be cultured and isolated and sequenced in traditional way. In a gram of soil, there are approximately a billion microbial cells, containing an estimated 4 petabase pairs of DNA (Jack A. Gilbert,2013) + Improvement of next generation sequencing = data deluge + Decreasing cost of sequencing http://www.e-janco.com/ 6 Metagenomics and “big data”
  • 7. 7 Bacterial genome size: 139,000 bps to 13,000,000 bps Human genome size: 3,000,000,000 bps Metagenomics and “big data” Tools friendly to big data are required.
  • 8. Sorcerer II Global Ocean Sampling Expedition  How many different species in a sea water sample? What is their abundance distribution? (alpha-diversity) [Hint: ~10,000 different types of microorganisms, 10 million viruses, one million bacteria in a drop.]  Are samples from tropical area and temperate area different?(beta- diversity) [Hint: Yes, but how different?]
  • 11. Which library has more titles?
  • 12. Which library has more titles? With only shredded pieces of pages with (partial) sentences on it to read.
  • 13. How similar are the libraries by their collection? With only shredded pieces of pages with (partial) sentences on it to read.
  • 14. Sequencing AssemblyReads Contigs 1000+ bp Abundance Mapping 3x 1x 1x 2x Species A Species B Species C Species D BinningSpecies Sample1 Sample2 Sample3 Sample4 Sample5 Sample1 Sample2 Sample3 Sample4 Sample5 Species A 3 3 4 4 0 Species B 1 3 0 2 2 Species C 1 2 2 2 3 Species D 2 1 2 1 1 A typical pipeline of metagenome diversity analysis Loosely based on slide by Mads Albertsen 100-150 bp 14
  • 15. Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 SpeciesA 3 3 4 4 0 0 SpeciesB 1 3 0 2 2 2 SpeciesC 1 2 2 2 3 3 SpeciesD 2 1 2 1 1 1 SpeciesE 4 1 3 0 2 5 SpeciesF 5 1 2 2 2 5 SpeciesG 2 2 1 2 1 3 Alpha diversity: “How many different species in a seawater sample?” Beta diversity: “How different are the seawater samples?” Sample1 Sample1 Sample1 Sample1 Sample1 Sample1 Sample1 0.0 0.25 0.5 0.5 0.75 1.0 Sample2 0.25 0.0 0.25 0.25 0.75 1.0 Sample3 0.5 0.25 0.0 0.25 0.75 1.0 Sample4 0.5 0.25 0.25 0.0 0.75 1.0 Sample5 0.75 0.75 0.75 0.75 0.0 0.25 Sample6 1.0 1.0 1.0 1.0 0.25 0.0 15 Tools: QIIME, mothur, etc. We need this!
  • 16. Sequencing AssemblyReads Contigs 1000+ bp Abundance Mapping 3x 1x 1x 2x Species A Species B Species C Species D BinningSpecies Sample1 Challenges: • Assembly? - difficult (eg. Only small portions of reads can be assembled) • Binning? – difficult (eg. may need reference genomes, computationally expensive) • Mapping? – difficult (eg. accuracy, computationally expensive) • 4 petabase pairs of DNA in a gram of soil(Jack A. Gilbert,2013) A typical pipeline of metagenome diversity analysis 100-150 bp Samples by species table 16
  • 17. Sequencing AssemblyReads Contigs 1000+ bp Abundance Mapping 3x 1x 1x 2x Species A Species B Species C Species D BinningSpecies Sample1 Species-by-sample table A typical pipeline of metagenome diversity analysis 100-150 bp 17
  • 18. Reads -> k-mers 18 K-mer : DNA segment with length of k. >24288227 TTTGCCTGTTGAAAACCTGTAAACGGGATGTTTCCGGGGTTTTCCGCCCCGGTAGGGTTGAGAC GTGCCCCGCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAGAGCGATCCGCAACACACCCCG GAGCCTCTACCCCCCCCTCACGC TTTGCCTGTT TTGCCTGTTG TTTCCGGGGT TTCCGGGGTT TTCCGGGGTT ATCCGCGGTT TTCCGGGGTT TTCCGGGGTT TTCCGGCCGTT GGTTTTTCCG
  • 19. Reads – sample A K-mers – sample A Reads – sample B K-mers – sample B 19 Counting (unique) k-mers to evaluate diversity
  • 20. Reads – sample A K-mers – sample A Reads – sample B K-mers – sample B 20 Counting (unique) words to evaluate diversity
  • 21. • handle large number of k-mers. (simple hash table will not work) • retrieve count of k-mer randomly (in-memory). – To check the occurrence of a k-mer in multiple samples to find shared k-mers 21 Challenges in counting k-mers Soil metagenome sample Number of reads Number of k-mers k=29 Number of unique k-mers IOHH.2301.7.1859 273,021,397 32,762,567,744 22,333,571,795
  • 22. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative genomic segment) – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 22 An efficient k-mer counting approach
  • 23. o The Count-min Sketch consists of one or more hash tables of different size o Each entry in the hash tables is a counter representing the number of k-mers that hash to that location o To retrieve the count for a given k-mer we select the minimum count across all of the hash entries.  Highly scalable: Constant memory consuming, independent of k and dataset size  Probabilistic properties well suited to next generation sequencing datasets  Memory usage is fixed and low, with certain counting false positive rate as tradeoff because of collision  The one-sided counting error is low and predictable. Khmer - An efficient k-mer counting approach 23
  • 24. • My contributions: • algorithm design/analysis, exploring the mathematics behind, the choice of optimal parameters • contributing codes, including unique k-mers counting, overlap k-mer counting, optimal parameter choice, others related to my specific research project. • benchmarking, testing • exploration of applications like error trimming, filter low abundance reads, digital normalization, etc. suggestion on features • work on the khmer manuscript 24 An efficient k-mer counting approach
  • 25. >24288227 TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCG GGGTTTTCCGCCCCGGTAGGGTTGAGACGTGCCCC GCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAG AGCGATCCGCAACACACCCCGGAGCCTCTACCCCCC CCTCACGC c GAAAACCTGA AAAACCTGAA AAACCTGAAA 25 Sequencing errors cause serious problem In high coverage reads data set, most of the unique k-mers are erroneous caused by sequencing errors. Counting k-mer to measure diversity directly?
  • 26. >24288227 TTTGCCTGTTGAAAACCTGAAAACGGGATGTTTCCG GGGTTTTCCGCCCCGGTAGGGTTGAGACGTGCCCC GCGGTCGCGAACAGCTCGCCGCTTACCGGCGTAAG AGCGATCCGCAACACACCCCGGAGCCTCTACCCCCC CCTCACGC c GAAAACCTGA AAAACCTGAA AAACCTGAAA 26 Sequencing errors cause serious problem In high coverage reads data set, most of the unique k-mers are erroneous caused by sequencing errors. Counting k-mer to measure diversity directly?
  • 27. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 27 Concept of IGS (informative genomic segment)
  • 28. Sequencing AssemblyReads Contigs 1000+ bp Abundance Mapping 3x 1x 1x 2x Species A Species B Species C Species D BinningSpecies Sample1 Species-by-sample table A typical pipeline of metagenome diversity analysis 100-150 bp 28
  • 29. • IGS (informative genomic segment) – can represent the novel information of a genome – can be used as a new unit to do diversity analysis to replace species – Assembly-free – Reference-free – Efficient and scalable Replace samples-by-species table with samples-by-IGSs table IGSs (Informative Genomic Segment) Reads Genome 29
  • 30. 1. We can use median k-mer abundance to estimate the sequencing coverage of a read without a reference assembly. 2. We can retrieve the coverage of a read across different samples. 3. If a read has a coverage as C, there will be approximately other C-1 reads with the same coverage covering the same DNA region. 4. We can estimate the size of covered genomic region by the number of reads and the corresponding coverage. Four tricks! From reads to IGSs 30
  • 31. Reads – sample A K-mers – sample A The abundance of the k-mers in a read tends to be similar, which is close to the coverage of the read. Coverage of a read and how to get it without a reference assembly 31 Coverage of read: the sequencing depth of the DNA region that read originates from
  • 35. 35 Trick 1: We can use median k-mer abundance to estimate the sequencing coverage of a read without a reference assembly. median k-mer abundance to represent the sequencing coverage of the read
  • 36. Reads – sample A K-mers – sample B 36 Trick 2: We can estimate the sequencing coverage of a read across different samples Trick 1: We can use median k-mer abundance to estimate the sequencing coverage of a read without a reference assembly.
  • 37. Coverage = 4 Sample A Sample B Sample C Read Coverage Profile: Coverage = 6 Coverage = 2 37
  • 39. 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 2:0:3 2:0:3 2:0:3 2:0:3 2:0:3 2:0:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 3:2:2 Sample A Sample B Sample C “Contigs with similar coverage profiles are likely to have originated from the same microbial population” (Imelfort et al. 2014) Reads with similar coverage profiles are likely to have originated from the same species. 39
  • 40. 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 3:2:2 2:0:3 2:0:3 2:0:3 2:0:3 2:0:3 2:0:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 2:4:3 Genomic Content Sample A Genomic Content Sample B Genomic Content Sample C 3:2:2 bin 1 bin 2 bin 3 Do the reads in each bin cover the same size of genomic region? 6 reads with coverage of 3 in the sample where they are from 2 reads with coverage of 2 in the sample where they are from VS
  • 41. 41 What is the size of genomic region covered by the reads in each bin? What we know: C: coverage of the read in the sample in the bin N: number of reads in the bin
  • 42. Reads – sample A K-mers – sample A 42 Trick 3: If a read has a coverage as C, there will be approximately other C-1 reads with the same coverage covering the same DNA region.
  • 43. 43 Reads – sample A Trick 4: We can estimate the size of covered genomic region by the number of reads(N) and the corresponding coverage(C).
  • 44. 44 Reads – sample A N ~= 7, C ~= 7 N ~= 30, C ~= 10 Trick 4: We can estimate the size of covered genomic region by the number of reads(N) and the corresponding coverage(C).
  • 45. Reads – sample A 45 N ~= 7, C ~= 7 N ~= 30, C ~= 10 Size of covered DNA region N/C = 30/10 = 3 N/C = 7/7= 1 Trick 4: We can estimate the size of covered genomic region by the number of reads(N) and the corresponding coverage(C).
  • 46. Reads – sample A 46 Trick 4: We can estimate the size of covered genomic region by the number of reads(N) and the corresponding coverage(C). N ~= 7, C ~= 7 N ~= 30, C ~= 10 Size of covered DNA region N/C = 30/10 = 3 N/C = 7/7= 1 Unit of size of covered DNA region: IGS - IGS(informative genomic segment)
  • 47. 47 Do the reads in each bin cover the same size of genomic region? 6 reads with coverage of 3 in the sample where they are from 2 reads with coverage of 2 in the sample where they are from VS 6/3 = 2 4/2 = 2 2/2 = 1
  • 49. Sequencing AssemblyReads Contigs 1000+ bp Abundance Mapping 3x 1x 1x 2x Species A Species B Species C Species D BinningSpecies Sample1 Challenges: • Assembly? - difficult (eg. Only small portions of reads can be assembled) • Binning? – difficult (eg. may need reference genomes, computationally expensive) • Mapping? – difficult (eg. accuracy, computationally expensive) • 4 petabase pairs of DNA in a gram of soil(Jack A. Gilbert,2013) Species-by-sample table A typical pipeline of metagenome diversity analysis 100-150 bp 49
  • 50. IGS(informative genomic segment) can represent the novel information of a genome • The more IGSs there are in a sample, the higher diversity , higher richness the sample has. • The more IGSs two samples share, the more similar the two samples are. IGS (informative genomic segment): genomic segments that do not share genomic content with the same length of read 50 (meta)Genome size = number of IGSs x IGS length (read length)
  • 51. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative genomic segment) – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 51 Testing IGS method on simulated data sets
  • 52. Simulated data sets with different species composition 52 1. Estimated (meta)genome size matches real size perfectly. 2. Evenness metric can represent species distribution correctly No sequencing error is introduced in this simulated data
  • 53. 53 Samples-by-IGSs tableSamples-by-species table Ground truth Using IGS method Mantel Correlation: 0.9714 Calculated dissimilarity matrix matches the matrix calculated from species composition.
  • 54. 54 Ordination (PCoA) Clustering AAAB ABCD AABC ABCE FGHI AFGH With an accurate dissimilarity matrix, it will be straightforward to perform PCoA or clustering analysis to demonstrate the relationship between samples.
  • 55. Influence of error rate and coverage to alpha diversity analysis Metagenome size is more accurate with increasing coverage and lower error rate 55
  • 56. 56 Beta diversity is less prone to error rate and low coverage Influence of error rate and coverage to beta diversity analysis
  • 57. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative genomic segment) – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 57 Testing IGS method on real metagenome data
  • 58. Sorcerer II Global Ocean Sampling Expedition It has been calculated that microbes in ocean account for about half of the biomass on Earth. In a drop (one millilitre) of seawater, one can find 10 million viruses, one million bacteria
  • 59. Tropical - Galapogas Tropical – Open Ocean Sargasso Temperate - North Temperate - South Estuary Samples can be separated by locations using IGS-based method – as good as the result in original research based on alignment-based method or better. TemperateTropicalandSargasso
  • 60. Estuary Temperate - North Temperate -South Tropical - Open Ocean Tropical - Galapogas Sargasso Alpha diversity analysis shows samples from tropical area have higher richness. Temperate Tropical and Sargasso
  • 61. ARMO (Amazon Rain Forest Microbial Observatory) project Soil metagenome sample Number of reads Number of k-mers k=29 Number of unique k-mers IOHH.2301.7.1859 273,021,397 32,762,567,744 22,333,571,795 21 samples, 1T size of data, 5 billion reads.
  • 62. 62 Blue: Prairie Red: Forest 1. Conversion (forest -> pasture) affects composition 2. Prairie samples more similar in space 3. Subsamples with 8M reads are used.
  • 63. 0.0080 ITXB_1 IOHH_1 IOAH_1 ITXB_2 IOAH_2 INPP_2 INPO_1 ITXC INPP_1 INPN_2 IOHH_2 INPN_1 INPI_2 INPO_2 IOAG_1 IOAG_2 INPI_1 63 Samples can be clustered by different treatments and distances using IGS method. 100m 0.01m 10m 0.1m 0.1m 0.01m 1m 10m Blue: Prairie Red: Forest
  • 64. 64 Using subsamples can get decent result of beta diversity. Blue: Prairie Red: Forest Mantel correlation to similarity matrix calculated from subsample with 8M reads
  • 65. Alpha diversity analysis shows samples from prairie area have higher richness than forest area. Blue: Prairie Red: Forest
  • 66. Outline • Background and motivation • An efficient k-mer counting approach • Novel method to investigate microbial diversity – Concept of IGS (informative genomic segment) – Testing IGS method on simulated data sets – Testing IGS method on real metagenome data • Summary 66 Summary
  • 67. 67 • A whole framework with potentials: • Take all information into account, rare species • Alpha diversity ( richness, evenness) (Chao1, ACE estimators, etc.) • Beta diversity ( abundance-based, incidence-based) • Integrate with existing packages (khmer, mothur, QIIME, scikit-bio,etc.) • Scalable, efficient: • Based on efficient k-mer counting • Using subsamples may be enough for specific task Summary
  • 68. • Extend applications beyond diversity analysis – Extracts interesting reads by coverage profile – Reads binning/classification based on reads coverage profile – Genome size estimation/sequencing effort evaluation • Refine pipeline, optimize performance – Adjust IGS resolution – Improve methods to handle large number of IGSs more efficiently – Iterative beta diversity analysis 68 What’s next
  • 69. Publications • Using the concept of informative genomic segment to investigate microbial diversity of metagenomic sample (in preparation) • [stream Crossing the streams: a framework for streaming analysis of short DNA sequencing reads. Qingpeng Zhang, Sharon Awad, C. Titus Brown. (submitted) https://peerj.com/preprints/890/ • [khmer] These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. Qingpeng Zhang, Jason Pell, Rosangela Canino-Koning, Adina Chuang Howe, C. Titus Brown. PLOS ONE, July 25, 2014, DOI: 10.1371/journal.pone.0101271 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0101271 • [diginorm] A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom. Submitted. http://arxiv.org/abs/1203.4802 • [Osedax] Genomic versatility and functional variation between two dominant heterotrophic symbionts of deep-sea Osedax worms. Shana K Goffredi, Hana Yi, Qingpeng Zhang, Jane E Klann,Isabelle A Struve, Robert C Vrijenhoek and C Titus Brown. The ISME Journal doi:10.1038/ismej.2013.201 IGS method seaworm symbiont digital normalization khmer error analysis 69
  • 70. Dr. Titus Brown Lab members of GED Dr. Sherine Awad Michael Crusoe Jiarong Guo Luiz Irber Camille Scott Collaborators Dr. James Tiedje Dr. Joan Ross Dr. Shana K Goffredi Yiseul KIm Acknowledgement Committee members Jaimes Cole Richard Enbody Yanni Sun Eric Torng Former members of GED Dr. Adina Howe Eric McDonald Dr. Jason Pell Dr. Likit Preeyanon Dr. Elijah Lowe khmer developers